The Wine Gourd: Laube versus Suckling — their scores differ, but what does that mean for us?

Monday, March 19, 2018

Laube versus Suckling — their scores differ, but what does that mean for us?

There seem to be two general attitudes toward professional wine-quality scores. First, they can be seen as the sum of assessments of various sensory "components" of the wine. The classic example of this is the UCDavis 20-point score, which was originally designed to train students in detecting wine faults. This approach has been perhaps taken to its logical extreme in the fascinating book by Clive S. Michelson, Tasting and Grading Wine (2005. JAC International).

The alternative view is that the scores are expert, but subjective, opinions about the quality of the wine. For example, on March 15 1994, in response to a reader query, the Editor of the Wine Spectator magazine noted:

In brief, our editors do not assign specific values to certain properties of a wine when we score it. We grade it for overall quality as a professor grades an essay test. We look, smell and taste for many different attributes and flaws, then we assign a score based on how much we like the wine overall.

This seems to be the approach adopted by most of the professional media, especially when they use the 100-point scale. Some of them claim to be considering wine components individually (eg. complexity, concentration, balance, texture, length, overall elegance), but there is little evidence of this in their final scores.

I have shown in several blog posts that professional wine commentators do not necessarily provide comparable wine-quality scores when tasting the same wine. This can happen for many reasons, including different expertise, different personal preferences, different wine bottles, and different tasting conditions. This is why we seem to both love and hate wine critics. Let's look at this issue in more detail.

An interesting exercise

To look at variation in wine-quality scores, it is of interest to eliminate the last two factors listed above (bottles and tasting conditions), by having the scores be produced from the same bottle at the same time. This, of course, is what happens at most group wine tastings; but rarely do we see published the scores from several people at a single tasting, to make the direct comparison.

However, one pair of commentators where we can do this is James Laube and James Suckling who, at various times, have both provided wine-quality scores to Wine Spectator magazine regarding Cabernet wines, with Laube as the California expert and Suckling as the Bordeaux expert. Suckling has subsequently parted company with the magazine, but Laube remains as their California correspondent.

The dataset I will use here is from the "Cabernet Challenge" of 1996 (see Wine Spectator for September 15, 1996, pp. 32–48), in which the two James tasted 10 California Cabernet blends and 10 Bordeaux red wines from both the 1985 and 1990 vintages. This gives us 40 bottles of wine with which to compare their scores.

The data are shown in the first graph, with Laube's scores vertically and Suckling's horizontally. Each point represents one of the 40 bottles.

Suckling vs. Laube for 1985 and 1990 cabernets

I don't know about you, but this does not look too good, to me, in spite of the fact that Marvin Shanken, as the Editor of the article, claimed: "For the most part, our two critics found themselves in much agreement". To me, there is a wide spread of points in the graph — the scores differ by up to 9 points, with 5 of the bottles differing by more than 6 points. Furthermore, the mathematical correlation indicates only 29% agreement between the two sets of scores.

However, it is worth noting that the average scores from the two critics are almost identical (90.5), with very similar maximum (100 vs. 98) and minimum (both 82) scores. On average, Laube gave slightly higher scores to the California wines than to the Bordeaux wines; and Suckling gave slightly higher scores to the Bordeaux wines than to the California wines.

Now, let's look at what we might expect from critics who do agree. This next graph shows what perfect agreement would look like (the solid line) — for bottles whose points are on this line, the two James perfectly agreed with each other. Clearly, this is only 5 out of the 40 bottles. The Laube score is > the Suckling score 18 times, and 17 times it is the other way around.

The two dashed lines in the graph show us ±2 points from perfect agreement — for bottles between the two lines, the two James' point scores were within 2 points of each other. This allows for the approximate nature of expert opinions — technically, we are allowing for the fact that the scores are presented with 1-point precision (eg. 88 vs. 89 points) but the experts cannot actually be 1-point accurate in their assessment.

There are only 23 of the 40 bottles (58%) between the dashed lines. So, even when we allow for the approximate nature of expert opinions, there is not much more agreement here than there is disagreement.

Another way of dealing with the approximate nature of expert scores is to greatly reduce the number of score categories, so that all the experts need to do to agree is pick the same category. This is the reasoning behind using star scores instead of points (eg. 3 or 5 stars), or word descriptions instead of numbers. The Wine Spectator does it this way:

95 – 100
90 – 94
85 – 89
80 – 84
75 – 79
50 – 74

Classic: a great wine
Outstanding: a wine of superior character and style
Very good: a wine with special qualities
Good: a solid, well-made wine
Mediocre: a drinkable wine that may have minor flaws
Not recommended

So, I have shown this scheme in the third graph. For bottles within the boxes, the two James' point scores agree as to the word categories of wine quality. Once again, this is only 25 of the 40 wines (63%). So, even this broad-brush approach to wine quality assessment provides only two-thirds agreement between the two critics.

As an aside, it is worth noting the overall low scores given to the wines. Only 17 of the wines scored >90 points, even though they are all quite expensive. The only one of the 40 wines that I have tasted is the 1985 Château Mouton-Rothschild, and I was no more impressed by it than was either of the two James (85 vs. 89 points).

What does this mean for us?

The magazine is presenting their scores as representing some sort of Wine Spectator standard of quality, but clearly this is not an objective standard of quality. The scores are personal (but expert) judgments by their individual critics, who may have very little in common. At issue here is whether quality is an intrinsic property of wine, or whether it is mainly context dependent (see Jamie Goode).

The formal explanation for the degree of disagreement is this: the tasters are not using the same scoring scheme to make their assessments, even though they are expressing those assessments using the same scale. This is not just a minor semantic distinction, but is instead a fundamental and important property of anything expressed mathematically. As an example, it means that when two tasters produce a score of 85 it does not necessarily imply that they have a similar opinion about the wine; and if one produces 85 points and the other 90 then they do not necessarily differ in their opinion.

This situation is potentially a serious problem for all wine-quality assessments, when the scores represent expert, but subjective, opinions. Scores will look the same because they are written using the same scale, and people will inevitably try to compare them. But, if the scale does not have the same meaning for any given pair of people, then the numbers cannot be validly compared, because they have different meanings.** Not only would we be comparing apples and oranges, we would be comparing different (but unknown) numbers of apples and oranges. What is the point of that?

I will look at the mathematical consequences of this topic in a future post, illustrating the issue with a well-known data set.

Finally, one practical consequence of this mathematical characteristic is clearly being exploited by wine marketers. When looking at these scores on the web, it quickly became obvious that the wine stores are simply choosing to report the higher of the two critics' scores, when advertising any of the 40 wines, almost never producing both scores. This is an interesting example of "cherry picking".

Reproduced from Robert Dwyer at Palate Press

Thanks to Bob Henry for all of his help with this post — he has long championed the use of standardized wine-quality scoring schemes, often in vain.

** As a specific example, here are quotes from each of the two critics. James Suckling: "I was more concerned with the texture and aftertaste of the wines than with their aromatic qualities or flavor characteristics." James Laube: "I like my wines young, rich, highly concentrated and loaded with fruit flavors."

11 comments:

UnknownMarch 20, 2018 at 4:20 AM
Hi wine gourd. I’m trying to better understand the vast amount of wine quality data being gathered by Vivino. One thing I’m particularly interested in quantifying is the degree to which the cost of the wine influences people’s opinion of its quality. Do you have any guidance in this regard? Perhaps it’s something you’ve looked at in a previous post? My email is tom at theotherbordeaux dot com dot au in case you wanted to contact me directly. Your guidance appreciated. Tom
ReplyDelete
Replies
UnknownMarch 20, 2018 at 6:55 PM
Thanks david.
ReplyDelete
Replies
Brogie62March 23, 2018 at 2:55 AM
I have always found that while Jim Laube and I prefer different styles of wine, his descriptions tend to be accurate. I will read one of his descriptions and think: “that sounds good” but he gave it an 82. For other wines, he will describe them in a way the I think doesn’t meet my preferences but give it a 93. In both cases I will taste the wine and find that his descriptions were accurate. That’s why I value the description more than the score. The score only tells you how much they liked it, not how much you will like it.
ReplyDelete
Replies
Bob HenryMarch 27, 2018 at 5:15 AM
Let me proffer this quote from Robert Parker on written reviews and tasting notes from a 1989 interview with Wine Times magazine (later rebranded Wine Enthusiast) . . .

WINE TIMES: How is your scoring system different from The Wine Spectator’s?

PARKER: Theirs is really a different animal than mine, though if someone just looks at both of them, they are, quote, two 100-point systems. Theirs, in fact, is advertised as a 100-point system; mine from the very beginning is a 50-point system. If you start at 50 and go to 100, it is clear it’s a 50-point system, and it has always been clear. Mine is basically two 20-point systems with a 10-point cushion on top for wines that have the ability to age. . . .

. . . The newsletter was always meant to be a guide, one person’s opinion. The scoring system was always meant to be an accessory to the WRITTEN REVIEWS, TASTING NOTES. That’s why I use sentences and try and make it interesting. Reading is a lost skill in America. There’s a certain segment of my readers who only look at numbers, but I think it is a much smaller segment than most wine writers would like to believe. The TASTING NOTES are one thing, but in order to communicate effectively and quickly where a wine placed vis-à-vis its peer group, a numerical scale was necessary. If I didn’t do that, it would have been a sort of cop-out.

ReplyDelete
Replies
Bob HenryMarch 27, 2018 at 5:17 AM
And let me proffer this 2007 quote from Wine Spectator columnist/reviewer Harvey Steinman . . .

From The San Francisco Chronicle “Food & Wine” Section Letters-to-the-Editor (June 22, 2007, Page Unknown):

“Keeping Score on Ratings”

Link: http://www.sfgate.com/wine/article/LETTERS-TO-WINE-Keeping-score-on-ratings-2572416.php

Editor -- Re [article titled] "Are Ratings Pointless?" (June 15, 2007):

Well, gee, all this fuss (yet again) about the 100-point scale. So it doesn't tell everything you need to know about a wine. Do four stars? Twenty points? No, you have to read the critic's words.

Does it reward subtle wines? That's up to the critic using the scale. Do subtle wines that lack drama ever get four-star ratings from the [San Francisco] Chronicle's team? No, they get two or 2½ stars, which pretty much corresponds to a 100-point score in the 80s.

The main reason I like to use the 100-point scale is that it lets me communicate more to my readers. They can tell that I liked a 90-point wine just a little better than an 89-point wine, but a 94-point wine a lot more than one rated 86. Doesn't that say more than giving both the 90-and 94-point wines three stars and both the 89- and 86-point wines 2½ stars?

(signed)

HARVEY STEIMAN
Editor at Large, Wine Spectator
ReplyDelete
Replies
Bob HenryMarch 27, 2018 at 5:28 AM
Let me proffer these quotes from Robert Parker . . .

From The Wine Advocate (unknown issue from 2002):

“. . . [The Wine Advocate] Readers often wonder what a 100-point score means, and the best answer is that it is pure emotion that makes me give a wine 100 instead of 96, 97, 98 or 99. ”

How does Robert Parker repeat such a euphoric moment score?

He can’t:

“How often do I go back and re-taste a wine that I gave 100 points and repeat the score? Probably about 50% of the time.”

Source: “Perfection isn’t perfect: Parker says only 50% of his 100-point scores are repeatable,” W. Blake Gray, The Gray Report blog (May 13, 2015)

Link: http://blog.wblakegray.com/2015/05/perfection-isnt-perfect-parker-says.html
ReplyDelete
Replies
Bob HenryMarch 27, 2018 at 5:33 AM
The inability to repeat one's score was discussed at length by California Institute of Technology lecturer (on randomness) Leonard Mlodinow's essay for The Wall Street Journal titled “A Hint of Hype, A Taste of Illusion.”

Selective excerpts:

“But what if the successive judgments of the same wine, by the same wine expert, vary so widely that the ratings and [wine competition] medals on which wines base their reputations are merely a powerful illusion? That is the conclusion reached in two recent papers in the Journal of Wine Economics.”

“. . . The [California State Fair Wine Competition] judges’ wine ratings typically varied by ±4 points on a standard ratings scale running from 80 to 100. . . .”

“As a consumer, accepting that one taster’s tobacco and leather is another’s blueberries and currants, that a 91 and a 96 rating are interchangeable, or that a wine winning a gold medal in one competition is likely thrown in the pooper in others presents a challenge. If you ignore the web of medals and ratings, how do you decide where to spend your money?”

Link: http://online.wsj.com/article/SB10001424052748703683804574533840282653628.html
ReplyDelete
Replies
massimoApril 16, 2018 at 5:53 PM
In my Restaurant i made clear that i choose the vines because i know the producers and the wine the magazine have different interest in mind.
the worst part of scoring wines is that we loose the opportunity to enjoy wines for what they are not because someone say the are great
Massimo Navarretta
wine ambassador
ReplyDelete
Replies

Monday, March 19, 2018

Laube versus Suckling — their scores differ, but what does that mean for us?

11 comments:

Get new posts by email: