Monday, July 3, 2017

Awarding 90 quality points instead of 89

I have written before about the over-representation, by most wine commentators, of certain wine-quality scores compared to others. For example, I have discussed this for certain wine professionals (Biases in wine quality scores) and for certain semi-professionals (Are there biases in wine quality scores from semi-professionals?); and I have discussed it for the pooled scores from many amateurs (Are there biases in community wine-quality scores?). It still remains for me to analyze some data for the pooled scores of professionals as a group. This is what I will do here.


The data I will look at is the compilation provided by Suneal Chaudhary and Jeff Siegel in their report entitled Expert Scores and Red Wine Bias: a Visual Exploration of a Large Dataset. I have discussed these data in a previous post (What's all this fuss about red versus white wine quality scores?). The data are described this way:
We obtained 14,885 white wine scores and 46,924 red wine scores dating from the 1970s that appeared in the major wine magazines. They were given to us on the condition of anonymity. The scores do not include every wine that the magazines reviewed, so the data may not be complete, and the data was not originally collected with any goal of being a representative sample.
This is as big a compilation of wine scores as is readily available, and presumably represents a wide range of professional wine commentators. It is likely to represent widespread patterns of wine-quality scores among the critics, even today.

In my previous analyses, and those of Alex Hunt, who has also commented on this (What's in a number? Part the second), the most obvious and widespread bias when assigning quality scores on a 100-point scale is the over-representation of the score 90 and under-representation of 89. That is, the critics are more likely to award 90 than 89, when given a choice between the two scores. A similar thing often happens for the score 100 versus 99. In an unbiased world, some of the "90" scores should actually have been 89, and some of the "100" scores should actually have been 99. However, assigning wine-quality scores is not an unbiased procedure — wine assessors often have subconscious biases about what scores to assign.

It would be interesting to estimate just how many scores are involved, as this would quantify the magnitude of these two biases. Since we have at hand a dataset that represents a wide range of commentators, analyzing this particular set would tell us about general biases, not just those specific to each individual commentator.

Estimating the biases

As in my earlier posts, the analysis involves frequency distributions. The first two graphs show the quality-score data for the red wines and the white wines, arranged as two frequency distributions. The height of each vertical bar in the graphs represents the proportion of wines receiving the score indicated.

Frequency histogram of red wine scores

Frequency histogram of white wine scores

The biases involving 90 versus 89 are clear in both graphs; and the bias involving 100 is clear in the graph for the red wines (we all know that white wines usually do not get scores as high as for red wines — see What's all this fuss about red versus white wine quality scores?).

For these data, the "expectation" is that, in an unbiased world, the quality scores would show a relatively smooth frequency distribution, rather than having dips and spikes in the frequency at certain score values (such as 90 or 100). Mathematically, the expected scores would come from an "expected frequency distribution", also known as a probability distribution (see Wikipedia).

In my earlier post (Biases in wine quality scores), I used a Weibull distribution (see Wikipedia) as being a suitable probability distribution for wine-score data. In that post I also described how to use this as an expectation to estimate the degree of bias in our red- and white-wine frequency distributions.

The resulting frequency distributions are shown in the next two graphs. In these graphs, the blue bars represent the (possibly biased) scores from the critics, and the maroon bars are the unbiased expectations (from the model). Note that the mathematical expectations both form nice smooth distributions, with no dips or spikes. Those quality scores where the heights of the paired bars differ greatly are the ones where bias is indicated.

Frequency histogram of modeled red wine scores

Frequency histogram of modeled white wine scores

We can now estimate the degree of bias by comparing the observed scores to their expectations. For the red wines, a score of "90" occurs 1.53 times more often than expected, and for the white wines it is 1.44 times. So, we can now say that there is a consistent bias among the critics, whereby a score of "90" occurs c.50% more often than it should. This is not a small bias!

For a score of "100" we can only refer to the red-wine data. These data indicate that this score occurs more than 8 times as often as expected from the model. This is what people are referring to when they talk about "score inflation" — the increasing presence of 100-point scores. It might therefore be an interesting future analysis to see whether we can estimate any change in 100-point bias through recent time, and thereby quantify this phenomenon.

Finally, having produced unbiased expectations for the  red and white wines, we can now compare their average scores. These are c.91.7 and c.90.3 for the reds and whites, respectively. That is, on average, red wines get 1⅓ more points than do the whites. This is much less of a difference than has been claimed by some wine commentators.

Conclusion

Personal wine-score biases are easy to demonstrate for individual commentators, whether professional or semi-professional. We now know that there are also general biases shared among commentators, whether they are professional or amateur. The most obvious of these is a preference for over-using a score of 90 points, instead of 89 points. I have shown here that one in every three 90-point wines from the professional critics is actually an 89-point wine with an inflated score. Moreover, the majority of the 100-point wines of the world are actually 99-point wines that are receiving a bit of emotional support from the critics.

6 comments:

  1. Quoting from Wine Spectator (March 16-31, 1982, page 12):

    "Scoring"

    The Wine Spectator Tasting Panel uses a nine-point tasting scale, first introduced in 1974 by the Oenological and Viticultural Research Institute of South Africa, and modified by researchers at the University of California-Davis.

    Panelists are required to grade a wine against five sections (unacceptable to superior) and to provide written comments about each wine tasted. The section division is:

    Unacceptable . . . 1 point
    Average quality with some defects . . . 2 to 3 points
    Average quality . . . 4 to 6 points
    Above average quality with some superior qualities . . . 7 to 8 points
    Superior . . . 9 points

    Space is provided on the tasting sheet for panelists to describe appearance, aroma, taste, and to list general comments.

    Following the scoring, a panel discussion is held on each flight of wines.

    Total points given for each wine are tallied and an average score calculated. Only the top four wines (or more if ties occur at any or all of the four levels) are reviewed in detail. All other wines are listed only as having been tasted in the flight.

    "Criteria"

    In selecting the wines to be evaluated by The Wine Spectator Tasting Panel, care is taken to be as fair and equitable as possible.

    We believe that Wine Spectator Tasting Panel is the most significant series of wine tasting reports available to the consumer. This is why:

    • All flights contain no more than 12 wines. From each flight of wines tasted, only the top four wines are fully rated with scores and condensed tasting notes from the full panel. When ties occur, all wines in the first four places are reviewed fully.

    • All judging panels will normally consist of five or six members, most of whom are selected for their reputations as winemakers, wine merchants, educators, or wine-interested consumers with acknowledged palates. Further, every effort is made to select panel members to taste and rate wine for which they have established a particular knowledge and expertise.

    • All wines are tasted blind against others of like type. No “ringer,” or European-American comparisons are permitted. When a selected wine is available in more than one vintage, a mixing of vintages is allowed.

    • All wines are poured in each flight, then each panelist tastes and rates each wine individually. A panel discussion of each flight of wines is held following the tasting and rating.

    • All wines are rated on a modified UC-Davis nine-point scale, recommended for The Wine Spectator Tasting Panel by Emeritus Professor Maynard Amerine.

    • All wines are purchased at Southern California retail prices and are selected for their general availability in most major U.S. markets. However, because of distribution and pricing variables, availability will vary throughout the country.

    [See next comment.]

    ReplyDelete
  2. In 1985, Wine Spectator dropped its 9-point scale and adopted a 100-point scale.

    One consequence of this change can be found in the absence of printed 9-point scale wine reviews dating from the late 1970s to the mid-1980s in the Wine Spectator “Buyer’s Guide” biannual review books.

    That being the case, one can conclude that these (and I quote) "14,885 white wine scores and 46,924 red wine scores dating from the 1970s that appeared in the major wine magazines" exclude Wine Spectator reviews.

    So what other "major" wine magazine(s) from the 1970s could be the sources of these scores?

    ReplyDelete
    Replies
    1. The original authors may tell you privately, but their manuscript says that they were given the dataset on the basis that they would not publicly reveal its source.

      Delete
  3. Regarding:

    "... assigning wine-quality scores is not an unbiased procedure — wine assessors often have subconscious biases about what scores to assign."

    -- and --

    "... the majority of the 100-point wines of the world are actually 99-point wines that are receiving a bit of emotional support from the critics."

    Let me quote Robert Parker:

    “. . . Readers often wonder what a 100-point score means, and the best answer is that it is pure emotion that makes me give a wine 100 instead of 96, 97, 98 or 99."

    Source: The Wine Advocate (unknown issue from 2002).

    ReplyDelete
  4. Quoting from The San Francisco Chronicle “Food & Wine” Letters-to-the-Editor Section
    (June 22, 2007, Page Unknown):

    “Keeping Score on Ratings”

    Link: http://www.sfgate.com/wine/article/LETTERS-TO-WINE-Keeping-score-on-ratings-2572416.php

    Editor -- Re [article titled] "Are Ratings Pointless?" (June 15, 2007):

    Well, gee, all this fuss (yet again) about the 100-point scale. So it doesn't tell everything you need to know about a wine. Do four stars? Twenty points? No, you have to read the critic's words.

    Does it reward subtle wines? That's up to the critic using the scale. Do subtle wines that lack drama ever get four-star ratings from the [San Francisco] Chronicle's team? No, they get two or 2½ stars, which pretty much corresponds to a 100-point score in the 80s.

    The main reason I like to use the 100-point scale is that it lets me communicate more to my readers. They can tell that I liked a 90-point wine just a little better than an 89-point wine, but a 94-point wine a lot more than one rated 86. Doesn't that say more than giving both the 90-and 94-point wines three stars and both the 89- and 86-point wines 2½ stars?

    (signed)

    HARVEY STEIMAN
    Editor at Large, Wine Spectator

    ReplyDelete
  5. Excerpts from Wine Times (September/October 1989) interview
    with Robert Parker, publisher of The Wine Advocate

    PARKER: . . . The newsletter was always meant to be a guide, one person's opinion. The scoring system was always meant to be an accessory to the written reviews, tasting notes. That's why I use sentences and try and make it interesting. Reading is a lost skill in America. There's a certain segment of my readers who only look at numbers, but I think it is a much smaller segment than most wine writers would like to believe. The tasting notes are one thing, but in order to communicate effectively and quickly where a wine placed vis-à-vis its peer group, a numerical scale was necessary. If I didn't do that, it would have been a sort of cop-out.

    ReplyDelete