On the other hand, I have also commented in this blog on the characteristics of wine scores as numbers, noting that almost all scores are biased, whether they come from professionals. semi-professionals, or the general wine community. As noted elsewhere (Wine ratings might not pass the sobriety test):
A rating system that draws a distinction between a cabernet scoring 90 and one receiving an 89 implies a precision of the senses that even many wine critics agree that human beings do not possess. Ratings are quick judgments that a single individual renders early in the life of a bottle of wine that, once expressed numerically, magically transform the nebulous and subjective into the authoritative and objective.
The main issue, as I see it, is the lack of repeatability of the ratings between tasters. I have previously noted (The poor mathematics of wine-quality scores):
Most wine commentators’ wine-quality scores are personal to themselves. That is, the best we can expect from each commentator is that their wine scores can be compared among themselves so that we can work out which wines they liked and which ones they didn’t.This has also been discussed in this post: In their own words, this is how seven professional wine writers and critics go about rating a bottle.
On the other hand, little has been said about repeatability by the same taster, when they re-taste a wine. However, even at the top, Robert M. Parker Jr once commented:
How often do I go back and re-taste a wine that I gave 100 points and repeat the score? Probably about 50% of the time.When the Points Guru tells you that even his scores are not repeatable, you should believe him! *
It therefore seems to be of some interest to illustrate a few specific examples where we can clearly see the lack of repeatability of wine-quality scores, and the source of the variation in scores. This is what I do below.
Let's start by looking at the same wines as scored by two different wine magazines, in this case the Wine Spectator and the Wine Advocate. I have used these data as part of several earlier posts (eg. How large is between-critic variation in quality scores?).
In the following graph, each point represents one wine, with the Spectator wine-quality score shown vertically and the Advocate score shown horizontally. The wines are from the top Bordeaux chateaux (Latour, Lafite, Margaux, Mouton, and Haut-Brion) for the vintages 1975-2014. There are a total of 195 wines. Points that lie on the line scored the same from both magazines, whilst those above the line did better from the Spectator, and those below the line did better from the Advocate.
Note, first, that 36 of the points lie on the line (18.5%), showing that only one-fifth of the wines were evaluated identically. The remaining wines differ by up to 14 quality points, with an average difference of 2.8 points. A correlation analysis shows that, overall, 65% of the variation in scores is shared between the two magazines, which we can interpret as the magazines sharing two-thirds of their opinions about the wines.
The second thing to note is that none of the wines score 100 points from both magazines simultaneously, although there are 7 perfect scores from the Spectator and 13 from the Advocate. So, 15% of the wines are considered to be potentially very top quality, although there is no agreement on which wines they actually are.
Within a magazine
Most magazines have several people tasting their wines, often covering different geographical areas, although these usually overlap.
The next graph shows the scores from 9 of the people who tasted wines for the Wine Spectator, covering the period 2006–2015. Each of them tasted 5,000–25,000 wines during that time (the data come from Wineinformatics: a quantitative analysis of wine reviewers). In the graph, the quality scores are grouped horizontally, with the percent of scores for each group shown vertically, for each taster.
Obviously, most of the people scored their wines in the 85–89 range, except for Bruce Sanderson, who preferred 90–94 scores. Also, very few of the wines scored 95–100, from any taster.
However, there are some very different patterns here. For example, compared to his colleagues, James Molesworth greatly preferred the 80–84 and 85–89 ranges, at the expense of the 90–94 range. However, the two people who were most different from their colleagues are: MaryAnn Worobiec, who preferred the 85–89 range much more than did her colleagues, and the 90–94 less than they did; and Bruce Sanderson, who showed the strongest preference for a score of 90–94 over 85–89. Harvey Steiman and James Laube preferred scores of 90–94 over 80–84, although they may both claim that their wines justify those scores. The other tasters showed patterns that were fairly similar to each other.
Repeat tastings by one person
Finally, it is worth noting that, while wine critics sometimes do retrospective tastings of particular wines, there are very few published data about attempts to re-taste (and re-score) wines not long after they were originally tasted. One person who has done this is Rusty Gaffney (Quick trigger: are reviews done too soon?).
The following graph shows his scores for 21 Pinot noir wines, with each point representing one wine. The original score is shown horizontally (note that all of the wines scored ≥ 90 points), and the score when tasted again 16–26 months later is shown vertically. Points that lie on the line scored the same on both occasions, whilst those above the line did better at the second tasting, and those below the line did better the first time.
Note that only 3 wines got the same score on both occasions, with 10 doing better at the re-tasting and 8 doing worse. The maximum difference was 4 points.
So, about half of the wines were better and half were the same or worse when re-tasted 2 years later, which is what might be expected from random chance. While bottle variation may be a factor here, it is unlikely to change the results (although it might determine which wines did better or worse).
All three datasets show that variation in wine-quality scores is substantial, and that it arises from several sources. When you combine these sources of variation, it is difficult to attribute any mathematical precision to the use of numbers for wine commentary.
So, why aren't wines given a range of points, rather than a single score? It would make much more sense, given the mathematical reality of the situation.
* Perhaps more tellingly, in a 1999 article for the Los Angeles Times, David Shaw (He sips and spits — and the world listens) noted of Parker:
More than once he’ll be asked if he’d be willing to demonstrate his consistency. Would he taste and score five or six wines “blind” — without knowing what they are — and then taste and score them again a day or two later? “No,” he says. “I'm not doing trained dog tricks. I’ve got everything to lose and nothing to gain.”Apparently Parker neither respects scientific experiments nor understands their use.