Wine quality scores from commentators tend to be set in stone while the wines are still available commercially. That is, a single score is produced when the wine is released, and it stands indefinitely. Sometimes, wines are tasted several times, and a consensus score is then produced, but not often. Sometimes, only the latest score is the one presented, without reference to any previous scores — Wine-Searcher does this, for example, when compiling critics' scores.
One thing that is of interest to the consumer is how these repeat scores relate to each other. It would be nice if repeated tastings of a particular wine produced the same score, because then we could have some confidence in it. However, there will be some component of the scores that is due to different bottles of the wines, and so we cannot expect perfect repeatability. More importantly, this issue will be confounded by possible changes in the wines themselves as they age, so that any one bottle varies through time.
However, a lot of the variation in scores will be due to what is technically called intra-individual variation in the scorer, which you might call within-taster variation — the same wine tasted on repeat occasions by the same person does not receive the same score no matter how similar it is. The assessment of quality is the result of a taster’s previous experiences as well as their personal conceptions; and even experienced wine tasters have been shown to incorporate their own preferences in their judgments. In addition to this, the environment of the tasting is also known to affect quality judgments.
This issue has only occasionally been studied in the professional literature; and I have included a list of relevant published papers at the end of this post.
What I will do in this post is look at some particular examples of scores from repeat tastings by five different commentators. Some of these tastings come from retrospective vertical comparisons of a single wine, where many of the previous vintages are tasted on a single occasion, or horizontal tastings of a number of wines from the same region and vintage — these new scores can be compared to the scores previously assigned to those same wines.
Most of these examples are restricted to wines where there are many vintages to compare, and where the producer actively provides retrospective vertical tastings of their products. Furthermore, my expertise in this regard is in Australian wine. This creates a distinct bias in which wines I can use as examples.
For my first example, I will use some scores from a book by the Australian commentator Jeremy Oliver, The Australian Wine Handbook, of which there were at least three editions: 1993, 1994, 1996. I introduced this book in a previous blog post (Wine writing, and wine books). The wine that I will look at is Penfolds Grange Bin 95, of which there are now more than 60 vintages, and which I also introduced in a previous post (Poor correlation among critics' quality scores).
The first graph compares the quality scores for 38 vintages of this wine in the 2nd and 3rd editions of the book, at which time Oliver was using a 10-point quality scale. Each point represents a single vintage; and if the scores were identical in the two editions, then the points would all lie along the pink line.
There are only 21 vintages (55%) for which the scores are the same, 8 (21%) that decrease from 1994 to 1996, and 9 (24%) that increase. The maximum decrease is 2 points, and the maximum increase is 3 points. The book does not make clear what the circumstances were that lead to the two sets of scores, but there is obviously considerable variation in the opinions about quality. Nevertheless, there is no evidence of any bias in Oliver's opinions about the wines,
Now let's consider an example where the second tasting involved a retrospective vertical comparison of the wines. This example involves the same wine and commentator. The first set of scores comes from The Onwine Australian Wine Annual, published in 2000, at which time Oliver was using a 20-point scale. These scores are not exactly the same as those from 1996. The second set of scores comes from a retrospective tasting, Making it a Date with Grange, published in 2004. There are 45 vintages included in this next graph.
Once again, there is no evidence of any bias in Oliver's opinions about
the wines, although the scores change by a maximum of 1.8, both up and
down. The correlation between the scores shows that they share
approximately half (51%) of the variation, which is not particularly high, given that they are the same wines tasted only 4 years apart.
As an alternative example of a retrospective vertical tasting, we can look at another Australian commentator, James Halliday, and the wine Cullen Cabernet Sauvignon Merlot (now called Diana Madeline) from the Margaret River region of Western Australia. The first set of scores comes from various editions of The Australian Wine Companion, using a 100-point scale.
The second set
of scores comes from a retrospective vertical, Tasting an Icon, published in the Halliday Wine Companion Magazine for Feb/Mar 2014. There are 19 vintages included in the next graph.
This time we see several vintages that have very different scores in the two datasets. Two of the vintages have their scores reduced by 5-6 points (bottom of the graph), and one gets a 5-point increase (top-left). Furthermore, the scores are not highly correlated even if we exclude these three vintages, with only 20% of the variation being shared between the two datasets.
Moving on, the wines of Bordeaux often have repeated scores from single sources that cover many vintages. For example, the American magazine Wine Spectator published two retrospective vertical tastings of the top wine from Château Lafite-Rothschild, one on 15 December 1988 and one on 30 November 1991. There are 34 vintages included in the next graph.
Once again, we see several vintages that have very different scores in
the two datasets; these vintages are labeled in the graph. Even if we
exclude these three vintages, then the scores are still not correlated with each other, as only 8% of the variation is shared
between the two datasets. This is a very poor correlation, given that they refer to repeat tastings of the same wines tasted only 3 years apart.
As an alternative approach, we could try comparing a range of wines from the same region in the same vintage year — that is, a horizontal tasting rather than using a vertical one. To do this, let's look at the American commentator James Laube, and the 1986 vintage California Cabernets. The first set of scores comes from the 1988 book California's Great Cabernets: The Wine Spectator's Ultimate Guide for Consumers, Collectors and Investors.
The second set
of scores comes from a retrospective horizontal tasting, 10 Years After, published in Wine Spectator for December 15 1996 (pp. 67-70). (Thanks to Bob Henry for compiling these two datasets.) There are 45 different cabernet wines included in the next graph.
As is obvious, there was a major re-assessment of the wines at the second tasting, as almost all of the points lie below the pink line, rather than being scattered around the line (as in the above graphs). This vintage did not turn out to be as good as originally expected!
On release, the wines were assessed generally in the score range 88-96, but 10 years later these same wines scored only 86-94 points, with an average score reduction of 2.6 points per wine. This seems to relate to the idea that there are bonus quality points available for wines based on their expected longevity (see the post What's all this fuss about red versus white wine quality scores?). After 10 years, it was obvious that the 1986 cabernets were not going to last as long as expected, and so their bonus points disappeared.
It is clear that the consumer should pay attention to the Wine Spectator and the Wine Advocate, both of which are known to conduct reviews of California Cabernets at the 10th, 20th, 30th and sometimes even 40th anniversaries of the vintages.
Finally, we can return to the vertical tastings, and look at a British commentator, Jancis Robinson. We can also return to Australian wines, this time the Henschke Hill of Grace Shiraz, a wine from the Eden Valley region of South Australia, and with almost as long and distinguished a pedigree as the Penfolds Grange discussed above. Robinson took part in two retrospective vertical tastings of this wine, one on 18 March 2003 and the other on 12 March 2013, which marked the 40th and 50th anniversaries, respectively, of the first vintage. There are 23 vintages included in the next graph, scored on a 20-point quality scale.
Once again, there was apparently a major re-assessment of the wines at the
second tasting, as almost all of the points lie above the pink line,
rather than being scattered around the line. Note that this is a comparison of two retrospective tastings, unlike the above graphs.
The age of the wines at the first tasting was 5-28 years, and the scores ranged from 15-18.5, with an average of 16.7 points. The age of the same wines at the second tasting was 15-38 years, and the scores ranged from 16-19, with an average of 17.7 points, for an average increase of 1 point. Obviously, either the wines changed or Robinson's perception of them did. That is, is this score inflation, or did most of the wines get better through time?
To work this out, we can plot all of the scores from the two retrospective tastings, not just those tasted on both occasions. This final graph shows the two sets of scores for each vintage tasted, in different colors. (Note: the missing vintages in the second tasting are ones in which the wine was not considered good enough to release.)
Clearly, all of the scores on the second occasion (in green) are consistent, and so there is no evidence that the quality of the wines improved over time. Instead, we must conclude that Robinson simply scored the wines 1 point higher on the second occasion. This may reflect her better familiarity with this style of wine, or it may reflect her well-known dislike of assigning wine scores in the first place.
Wine quality scores are usually presented as though they are inviolate, and represent a critic's opinion that does not change. That is, we are given one number, which does not get changed. This may be valid if the commentator tastes each wine once, and once only. However, the commentators have been known to reevaluate wines at different times, especially if they are invited to a retrospective tasting, either vertical or horizontal. In this case, the evaluations at different times can be really quite different.
Robert H. Ashton (2012) Reliability and consensus of experienced wine judges: expertise within and between? Journal of Wine Economics 7:70-87.
Chris J. Brien, P. May, Oliver Mayo (1987) Analysis of judge performance in wine-quality evaluations. Journal of Food Science 52:1273-1279.
Richard Gawel, Peter W. Godden (2008) Evaluation of the consistency of wine quality assessments from expert wine tasters. Australian Journal of Grape and Wine Research 14:1-8.
Richard Gawel, Tony Royal, Peter Leske (2002) The effect of different oak types on the sensory properties of chardonnay. Australian and New Zealand Wine Industry Journal 17:14-20.
Robert T. Hodgson (2008) An examination of judge reliability at a major U.S. wine competition. Journal of Wine Economics 3:105-113.
Robert T. Hodgson (2009) How expert are "expert" wine judges? Journal of Wine Economics 4:233-241.
Harry Lawless, Yen-Fei Liu, Craig Goldwyn (1997) Evaluation of wine quality using a small-panel hedonic scaling method. Journal of Sensory Studies 12:317-332.