Monday, July 8, 2019

The sources of wine quality-score variation

Wine-quality scores are coming under increasing pressure these days, not least because they seem to miss the idea that wine might be something special — see, for example: It’s time to rethink wine criticism. Wine is, after all, more than just numbers!

On the other hand, I have also commented in this blog on the characteristics of wine scores as numbers, noting that almost all scores are biased, whether they come from professionals. semi-professionals, or the general wine community. As noted elsewhere (Wine ratings might not pass the sobriety test):
A rating system that draws a distinction between a cabernet scoring 90 and one receiving an 89 implies a precision of the senses that even many wine critics agree that human beings do not possess. Ratings are quick judgments that a single individual renders early in the life of a bottle of wine that, once expressed numerically, magically transform the nebulous and subjective into the authoritative and objective.

The main issue, as I see it, is the lack of repeatability of the ratings between tasters. I have previously noted (The poor mathematics of wine-quality scores):
Most wine commentators’ wine-quality scores are personal to themselves. That is, the best we can expect from each commentator is that their wine scores can be compared among themselves so that we can work out which wines they liked and which ones they didn’t.
This has also been discussed in this post: In their own words, this is how seven professional wine writers and critics go about rating a bottle.

On the other hand, little has been said about repeatability by the same taster, when they re-taste a wine. However, even at the top, Robert M. Parker Jr once commented:
How often do I go back and re-taste a wine that I gave 100 points and repeat the score? Probably about 50% of the time.
When the Points Guru tells you that even his scores are not repeatable, you should believe him! *

It therefore seems to be of some interest to illustrate a few specific examples where we can clearly see the lack of repeatability of wine-quality scores, and the source of the variation in scores. This is what I do below.

Between magazines

Let's start by looking at the same wines as scored by two different wine magazines, in this case the Wine Spectator and the Wine Advocate. I have used these data as part of several earlier posts (eg. How large is between-critic variation in quality scores?).

In the following graph, each point represents one wine, with the Spectator wine-quality score shown vertically and the Advocate score shown horizontally. The wines are from the top Bordeaux chateaux (Latour, Lafite, Margaux, Mouton, and Haut-Brion) for the vintages 1975-2014. There are a total of 195 wines. Points that lie on the line scored the same from both magazines, whilst those above the line did better from the Spectator, and those below the line did better from the Advocate.

Wine Spectator versus Wine Advocate scores for the top Bordeaux chateaux

Note, first, that 36 of the points lie on the line (18.5%), showing that only one-fifth of the wines were evaluated identically. The remaining wines differ by up to 14 quality points, with an average difference of 2.8 points. A correlation analysis shows that, overall, 65% of the variation in scores is shared between the two magazines, which we can interpret as the magazines sharing two-thirds of their opinions about the wines.

The second thing to note is that none of the wines score 100 points from both magazines simultaneously, although there are 7 perfect scores from the Spectator and 13 from the Advocate. So, 15% of the wines are considered to be potentially very top quality, although there is no agreement on which wines they actually are.

Within a magazine

Most magazines have several people tasting their wines, often covering different geographical areas, although these usually overlap.

The next graph shows the scores from 9 of the people who tasted wines for the Wine Spectator, covering the period 2006–2015. Each of them tasted 5,000–25,000 wines during that time (the data come from Wineinformatics: a quantitative analysis of wine reviewers). In the graph, the quality scores are grouped horizontally, with the percent of scores for each group shown vertically, for each taster.

Wine-quality scores from the Wine Spectator tasting team

Obviously, most of the people scored their wines in the 85–89 range, except for Bruce Sanderson, who preferred 90–94 scores. Also, very few of the wines scored 95–100, from any taster.

However, there are some very different patterns here. For example, compared to his colleagues, James Molesworth greatly preferred the 80–84 and 85–89 ranges, at the expense of the 90–94 range. However, the two people who were most different from their colleagues are: MaryAnn Worobiec, who preferred the 85–89 range much more than did her colleagues, and the 90–94 less than they did; and Bruce Sanderson, who showed the strongest preference for a score of 90–94 over 85–89. Harvey Steiman and James Laube preferred scores of 90–94 over 80–84, although they may both claim that their wines justify those scores. The other tasters showed patterns that were fairly similar to each other.

Repeat tastings by one person

Finally, it is worth noting that, while wine critics sometimes do retrospective tastings of particular wines, there are very few published data about attempts to re-taste (and re-score) wines not long after they were originally tasted. One person who has done this is Rusty Gaffney (Quick trigger: are reviews done too soon?).

The following graph shows his scores for 21 Pinot noir wines, with each point representing one wine. The original score is shown horizontally (note that all of the wines scored ≥ 90 points), and the score when tasted again 16–26 months later is shown vertically. Points that lie on the line scored the same on both occasions, whilst those above the line did better at the second tasting, and those below the line did better the first time.

Rusty Gaffney re-tasting 21 pinot noirs

Note that only 3 wines got the same score on both occasions, with 10 doing better at the re-tasting and 8 doing worse. The maximum difference was 4 points.

So, about half of the wines were better and half were the same or worse when re-tasted 2 years later, which is what might be expected from random chance. While bottle variation may be a factor here, it is unlikely to change the results (although it might determine which wines did better or worse).


All three datasets show that variation in wine-quality scores is substantial, and that it arises from several sources. When you combine these sources of variation, it is difficult to attribute any mathematical precision to the use of numbers for wine commentary.

So, why aren't wines given a range of points, rather than a single score? It would make much more sense, given the mathematical reality of the situation.

* Perhaps more tellingly, in a 1999 article for the Los Angeles Times, David Shaw (He sips and spits — and the world listens) noted of Parker:
More than once he’ll be asked if he’d be willing to demonstrate his consistency. Would he taste and score five or six wines “blind” — without knowing what they are — and then taste and score them again a day or two later? “No,” he says. “I'm not doing trained dog tricks. I’ve got everything to lose and nothing to gain.”
Apparently Parker neither respects scientific experiments nor understands their use.


  1. Wines are given a range of scores when barrel sampled.

    However not later when bottle sampled.

    1. From Wine Spectator Magazine Online:

      "Other Aspects of Our Tastings: Barrel tastings"


      "We also conduct both blind and non-blind tastings of barrel samples -- that is, wines that are not yet finished and bottled—from certain wine regions, including Bordeaux, California Cabernet and Vintage Port.

      "Each wine is rated using a RANGE of scores, and we clearly identify that these ratings and reviews apply to barrel samples. The filtering, fining and blending that may occur from barrel to bottle can alter the finished wines, and we feel these broader score RANGES are a more reliable indicator of the wine's future potential.

      "As of March 2008, to give the tasters more flexibility and describe the wines more accurately, we changed the score RANGES we use for unfinished wines to rolling four-point spreads. For example, one wine may be scored 85-88, another 87-90, another 89-92. We believe this better reflects the subtle differences between wines and gives our readers better information for their buying decisions."

      [CAPITALIZATION used for emphasis. ~~ Bob]

    2. They are insecure about how the barrels will later turn out. They should be equally concerned about the variation in their "final" score.

  2. Caltech lecturer (on randomness) Leonard Mlodinow discussed the subject of wine quality-score variation in his 2009 guest column for The Wall Street Journal cited below.

    Two take-aways:

    Robert Hodgson, a retired professor who taught statistics at Humboldt State University (and current winemaker) who has looked at wine judging at the California State Fair Wine Competition, found that “judges' wine ratings typically varied by ±4 points on a standard ratings scale running from 80 to 100.”

    Robert Parker has stated that he “generally stay[s] within a three-point deviation" when he re-reviews wines.

    From The Wall Street Journal “Weekend” Section
    (November 20, 2009, Page W6):

    “A Hint of Hype, A Taste of Illusion;
    They pour, sip and, with passion and snobbery, glorify or doom wines.
    But studies say the wine-rating system is badly flawed.
    How the experts fare against a coin toss.”


  3. The Concours Mondiale de Bruxelles gives judges the same wine in the blind tasting once per day. The resulting scores are released to the judges of that panel, but no analysis or publication of the scores is done. In my own case I averaged a single point of difference over three days, while another judge's scores varied by as much as fifteen points on a single wine.

    Then again, if you want make sure your scores are consistent, you should score all wine 85 points...

    1. Paul:

      Your note reminded me of W. Blake Gray's report on judging the 2011 Concours Mondiale de Bruxelles.

      "The narrow range of scoring wine"



      ~~ Bob

    2. And here is a close-up view of W. Blake Gray's score card.


    3. And let me add this W. Blake Gray post on judging a winetasting using WSET Level 3 guidelines.

      "A tale of three wine competitions"


    4. I have found two versions of the WSET Level 3 scoring sheet online.

      From 2012:

      From 2016:

  4. You could score 89, just to annoy the winemaker.

    It would be an interesting study, to take a group of judges, give them the same wines (blind) at various times and in various contexts during their Show, and see how variable the scores are.

  5. What about the "specials". These are wines that are made especially for the Judges and magazines and not the same wine released to the public on the shelf. They should rate the wine and the ones they rate the highest should be bought off the shelf when released and comparted with a GC and taste buds. I think it would shock the world how many of these specials are out there. The Mags and competitions don't want to do it because that would be bad for their reputation and entries. Who would buy a magazine or pay attention to scores from a competition when the scores are on the specials and not the wines on the shelf the consumer can buy.

    1. Specials have always been a problem for wine shows and other tastings. So, it would, indeed, be an interesting formal comparison, comparing the "show" wine to the retail wine. On those occasions when it has been reported, the difference is apparently quite obvious.