Monday, 19 March 2018

Laube versus Suckling — their scores differ, but what does that mean for us?

There seem to be two general attitudes toward professional wine-quality scores. First, they can be seen as the sum of assessments of various sensory "components" of the wine. The classic example of this is the UCDavis 20-point score, which was originally designed to train students in detecting wine faults. This approach has been perhaps taken to its logical extreme in the fascinating book by Clive S. Michelson, Tasting and Grading Wine (2005. JAC International).

The alternative view is that the scores are expert, but subjective, opinions about the quality of the wine. For example, on March 15 1994, in response to a reader query, the Editor of the Wine Spectator magazine noted:
In brief, our editors do not assign specific values to certain properties of a wine when we score it. We grade it for overall quality as a professor grades an essay test. We look, smell and taste for many different attributes and flaws, then we assign a score based on how much we like the wine overall.
This seems to be the approach adopted by most of the professional media, especially when they use the 100-point scale. Some of them claim to be considering wine components individually (eg. complexity, concentration, balance, texture, length, overall elegance), but there is little evidence of this in their final scores.

James LaubeJames Suckling

I have shown in several blog posts that professional wine commentators do not necessarily provide comparable wine-quality scores when tasting the same wine. This can happen for many reasons, including different expertise, different personal preferences, different wine bottles, and different tasting conditions. This is why we seem to both love and hate wine critics. Let's look at this issue in more detail.

An interesting exercise

To look at variation in wine-quality scores, it is of interest to eliminate the last two factors listed above (bottles and tasting conditions), by having the scores be produced from the same bottle at the same time. This, of course, is what happens at most group wine tastings; but rarely do we see published the scores from several people at a single tasting, to make the direct comparison.

However, one pair of commentators where we can do this is James Laube and James Suckling who, at various times, have both provided wine-quality scores to Wine Spectator magazine regarding Cabernet wines, with Laube as the California expert and Suckling as the Bordeaux expert. Suckling has subsequently parted company with the magazine, but Laube remains as their California correspondent.

The dataset I will use here is from the "Cabernet Challenge" of 1996 (see Wine Spectator for September 15, 1996, pp. 32–48), in which the two James tasted 10 California Cabernet blends and 10 Bordeaux red wines from both the 1985 and 1990 vintages. This gives us 40 bottles of wine with which to compare their scores.

The data are shown in the first graph, with Laube's scores vertically and Suckling's horizontally. Each point represents one of the 40 bottles.

Suckling vs. Laube for 1985 and 1990 cabernets

I don't know about you, but this does not look too good, to me, in spite of the fact that Marvin Shanken, as the Editor of the article, claimed: "For the most part, our two critics found themselves in much agreement". To me, there is a wide spread of points in the graph — the scores differ by up to 9 points, with 5 of the bottles differing by more than 6 points. Furthermore, the mathematical correlation indicates only 29% agreement between the two sets of scores.

However, it is worth noting that the average scores from the two critics are almost identical (90.5), with very similar maximum (100 vs. 98) and minimum (both 82) scores. On average, Laube gave slightly higher scores to the California wines than to the Bordeaux wines; and Suckling gave lightly higher scores to the Bordeaux wines than to the California wines.

Now, let's look at what we might expect from critics who do agree. This next graph shows what perfect agreement would look like (the solid line) — for bottles whose points are on this line, the two James perfectly agreed with each other. Clearly, this is only 5 out of the 40 bottles. The Laube score is > the Suckling score 18 times, and 17 times it is the other way around.

Suckling vs. Laube for 1985 and 1990 cabernets

The two dashed lines in the graph show us ±2 points from perfect agreement — for bottles between the two lines, the two James' point scores were within 2 points of each other. This allows for the approximate nature of expert opinions — technically, we are allowing for the fact that the scores are presented with 1-point precision (eg. 88 vs. 89 points) but the experts cannot actually be 1-point accurate in their assessment.

There are only 23 of the 40 bottles (58%) between the dashed lines. So, even when we allow for the approximate nature of expert opinions, there is not much more agreement here than there is disagreement.

Another way of dealing with the approximate nature of expert scores is to greatly reduce the number of score categories, so that all the experts need to do to agree is pick the same category. This is the reasoning behind using star scores instead of points (eg. 3 or 5 stars), or word descriptions instead of numbers. The Wine Spectator does it this way:
95 – 100
90 – 94
85 – 89
80 – 84
75 – 79
50 – 74
 Classic: a great wine
 Outstanding: a wine of superior character and style
 Very good: a wine with special qualities
 Good: a solid, well-made wine
 Mediocre: a drinkable wine that may have minor flaws
 Not recommended

So, I have shown this scheme in the third graph. For bottles within the boxes, the two James' point scores agree as to the word categories of wine quality. Once again, this is only 25 of the 40 wines (63%). So, even this broad-brush approach to wine quality assessment provides only two-thirds agreement between the two critics.

Suckling vs. Laube for 1985 and 1990 cabernets

As an aside, it is worth noting the overall low scores given to the wines. Only 17 of the wines scored >90 points, even though they are all quite expensive. The only one of the 40 wines that I have tasted is the 1985 Ch√Ęteau Mouton-Rothschild, and I was no more impressed by it than was either of the two James (85 vs. 89 points).

What does this mean for us?

The magazine is presenting their scores as representing some sort of Wine Spectator standard of quality, but clearly this is not an objective standard of quality. The scores are personal (but expert) judgments by their individual critics, who may have very little in common. At issue here is whether quality is an intrinsic property of wine, or whether it is mainly context dependent (see Jamie Goode).

The formal explanation for the degree of disagreement is this: the tasters are not using the same scoring scheme to make their assessments, even though they are expressing those assessments using the same scale. This is not just a minor semantic distinction, but is instead a fundamental and important property of anything expressed mathematically. As an example, it means that when two tasters produce a score of 85 it does not necessarily imply that they have a similar opinion about the wine; and if one produces 85 points and the other 90 then they do not necessarily differ in their opinion.

This situation is potentially a serious problem for all wine-quality assessments, when the scores represent expert, but subjective, opinions. Scores will look the same because they are written using the same scale, and people will inevitably try to compare them. But, if the scale does not have the same meaning for any given pair of people, then the numbers cannot be validly compared, because they have different meanings.** Not only would we be comparing apples and oranges, we would be comparing different (but unknown) numbers of apples and oranges. What is the point of that?

I will look at the mathematical consequences of this topic in a future post, illustrating the issue with a well-known data set.

Finally, one practical consequence of this mathematical characteristic is clearly being exploited by wine marketers. When looking at these scores on the web, it quickly became obvious that the wine stores are simply choosing to report the higher of the two critics' scores, when advertising any of the 40 wines, almost never producing both scores. This is an interesting example of "cherry picking".

Reproduced from Robert Dwyer at Palate Press

Thanks to Bob Henry for all of his help with this post — he has long championed the use of standardized wine-quality scoring schemes, often in vain.

** As a specific example, here are quotes from each of the two critics. James Suckling: "I was more concerned with the texture and aftertaste of the wines than with their aromatic qualities or flavor characteristics." James Laube: "I like my wines young, rich, highly concentrated and loaded with fruit flavors."