Monday, March 19, 2018

Laube versus Suckling — their scores differ, but what does that mean for us?

There seem to be two general attitudes toward professional wine-quality scores. First, they can be seen as the sum of assessments of various sensory "components" of the wine. The classic example of this is the UCDavis 20-point score, which was originally designed to train students in detecting wine faults. This approach has been perhaps taken to its logical extreme in the fascinating book by Clive S. Michelson, Tasting and Grading Wine (2005. JAC International).

The alternative view is that the scores are expert, but subjective, opinions about the quality of the wine. For example, on March 15 1994, in response to a reader query, the Editor of the Wine Spectator magazine noted:
In brief, our editors do not assign specific values to certain properties of a wine when we score it. We grade it for overall quality as a professor grades an essay test. We look, smell and taste for many different attributes and flaws, then we assign a score based on how much we like the wine overall.
This seems to be the approach adopted by most of the professional media, especially when they use the 100-point scale. Some of them claim to be considering wine components individually (eg. complexity, concentration, balance, texture, length, overall elegance), but there is little evidence of this in their final scores.

James LaubeJames Suckling

I have shown in several blog posts that professional wine commentators do not necessarily provide comparable wine-quality scores when tasting the same wine. This can happen for many reasons, including different expertise, different personal preferences, different wine bottles, and different tasting conditions. This is why we seem to both love and hate wine critics. Let's look at this issue in more detail.

An interesting exercise

To look at variation in wine-quality scores, it is of interest to eliminate the last two factors listed above (bottles and tasting conditions), by having the scores be produced from the same bottle at the same time. This, of course, is what happens at most group wine tastings; but rarely do we see published the scores from several people at a single tasting, to make the direct comparison.

However, one pair of commentators where we can do this is James Laube and James Suckling who, at various times, have both provided wine-quality scores to Wine Spectator magazine regarding Cabernet wines, with Laube as the California expert and Suckling as the Bordeaux expert. Suckling has subsequently parted company with the magazine, but Laube remains as their California correspondent.

The dataset I will use here is from the "Cabernet Challenge" of 1996 (see Wine Spectator for September 15, 1996, pp. 32–48), in which the two James tasted 10 California Cabernet blends and 10 Bordeaux red wines from both the 1985 and 1990 vintages. This gives us 40 bottles of wine with which to compare their scores.

The data are shown in the first graph, with Laube's scores vertically and Suckling's horizontally. Each point represents one of the 40 bottles.

Suckling vs. Laube for 1985 and 1990 cabernets

I don't know about you, but this does not look too good, to me, in spite of the fact that Marvin Shanken, as the Editor of the article, claimed: "For the most part, our two critics found themselves in much agreement". To me, there is a wide spread of points in the graph — the scores differ by up to 9 points, with 5 of the bottles differing by more than 6 points. Furthermore, the mathematical correlation indicates only 29% agreement between the two sets of scores.

However, it is worth noting that the average scores from the two critics are almost identical (90.5), with very similar maximum (100 vs. 98) and minimum (both 82) scores. On average, Laube gave slightly higher scores to the California wines than to the Bordeaux wines; and Suckling gave slightly higher scores to the Bordeaux wines than to the California wines.

Now, let's look at what we might expect from critics who do agree. This next graph shows what perfect agreement would look like (the solid line) — for bottles whose points are on this line, the two James perfectly agreed with each other. Clearly, this is only 5 out of the 40 bottles. The Laube score is > the Suckling score 18 times, and 17 times it is the other way around.

Suckling vs. Laube for 1985 and 1990 cabernets

The two dashed lines in the graph show us ±2 points from perfect agreement — for bottles between the two lines, the two James' point scores were within 2 points of each other. This allows for the approximate nature of expert opinions — technically, we are allowing for the fact that the scores are presented with 1-point precision (eg. 88 vs. 89 points) but the experts cannot actually be 1-point accurate in their assessment.

There are only 23 of the 40 bottles (58%) between the dashed lines. So, even when we allow for the approximate nature of expert opinions, there is not much more agreement here than there is disagreement.

Another way of dealing with the approximate nature of expert scores is to greatly reduce the number of score categories, so that all the experts need to do to agree is pick the same category. This is the reasoning behind using star scores instead of points (eg. 3 or 5 stars), or word descriptions instead of numbers. The Wine Spectator does it this way:
95 – 100
90 – 94
85 – 89
80 – 84
75 – 79
50 – 74
 Classic: a great wine
 Outstanding: a wine of superior character and style
 Very good: a wine with special qualities
 Good: a solid, well-made wine
 Mediocre: a drinkable wine that may have minor flaws
 Not recommended

So, I have shown this scheme in the third graph. For bottles within the boxes, the two James' point scores agree as to the word categories of wine quality. Once again, this is only 25 of the 40 wines (63%). So, even this broad-brush approach to wine quality assessment provides only two-thirds agreement between the two critics.

Suckling vs. Laube for 1985 and 1990 cabernets

As an aside, it is worth noting the overall low scores given to the wines. Only 17 of the wines scored >90 points, even though they are all quite expensive. The only one of the 40 wines that I have tasted is the 1985 Château Mouton-Rothschild, and I was no more impressed by it than was either of the two James (85 vs. 89 points).

What does this mean for us?

The magazine is presenting their scores as representing some sort of Wine Spectator standard of quality, but clearly this is not an objective standard of quality. The scores are personal (but expert) judgments by their individual critics, who may have very little in common. At issue here is whether quality is an intrinsic property of wine, or whether it is mainly context dependent (see Jamie Goode).

The formal explanation for the degree of disagreement is this: the tasters are not using the same scoring scheme to make their assessments, even though they are expressing those assessments using the same scale. This is not just a minor semantic distinction, but is instead a fundamental and important property of anything expressed mathematically. As an example, it means that when two tasters produce a score of 85 it does not necessarily imply that they have a similar opinion about the wine; and if one produces 85 points and the other 90 then they do not necessarily differ in their opinion.

This situation is potentially a serious problem for all wine-quality assessments, when the scores represent expert, but subjective, opinions. Scores will look the same because they are written using the same scale, and people will inevitably try to compare them. But, if the scale does not have the same meaning for any given pair of people, then the numbers cannot be validly compared, because they have different meanings.** Not only would we be comparing apples and oranges, we would be comparing different (but unknown) numbers of apples and oranges. What is the point of that?

I will look at the mathematical consequences of this topic in a future post, illustrating the issue with a well-known data set.

Finally, one practical consequence of this mathematical characteristic is clearly being exploited by wine marketers. When looking at these scores on the web, it quickly became obvious that the wine stores are simply choosing to report the higher of the two critics' scores, when advertising any of the 40 wines, almost never producing both scores. This is an interesting example of "cherry picking".

Reproduced from Robert Dwyer at Palate Press

Thanks to Bob Henry for all of his help with this post — he has long championed the use of standardized wine-quality scoring schemes, often in vain.

** As a specific example, here are quotes from each of the two critics. James Suckling: "I was more concerned with the texture and aftertaste of the wines than with their aromatic qualities or flavor characteristics." James Laube: "I like my wines young, rich, highly concentrated and loaded with fruit flavors."

11 comments:

  1. Hi wine gourd. I’m trying to better understand the vast amount of wine quality data being gathered by Vivino. One thing I’m particularly interested in quantifying is the degree to which the cost of the wine influences people’s opinion of its quality. Do you have any guidance in this regard? Perhaps it’s something you’ve looked at in a previous post? My email is tom at theotherbordeaux dot com dot au in case you wanted to contact me directly. Your guidance appreciated. Tom

    ReplyDelete
    Replies
    1. Hi Tom. This is an interesting question. I am pretty sure that Vivino has the data to answer it, for every wine in their database. However, they are unlikely to release this information, especially given the problems that Cellar Tracker initially had. These days, this sort of information is commercially valuable, and I believe that several companies are now actively accumulating databases of this type. However, Wine-Searcher might be a suitable place to ask about their data, as they are known to be quite co-operative. /David

      Delete
  2. I have always found that while Jim Laube and I prefer different styles of wine, his descriptions tend to be accurate. I will read one of his descriptions and think: “that sounds good” but he gave it an 82. For other wines, he will describe them in a way the I think doesn’t meet my preferences but give it a 93. In both cases I will taste the wine and find that his descriptions were accurate. That’s why I value the description more than the score. The score only tells you how much they liked it, not how much you will like it.

    ReplyDelete
  3. Let me proffer this quote from Robert Parker on written reviews and tasting notes from a 1989 interview with Wine Times magazine (later rebranded Wine Enthusiast) . . .

    WINE TIMES: How is your scoring system different from The Wine Spectator’s?

    PARKER: Theirs is really a different animal than mine, though if someone just looks at both of them, they are, quote, two 100-point systems. Theirs, in fact, is advertised as a 100-point system; mine from the very beginning is a 50-point system. If you start at 50 and go to 100, it is clear it’s a 50-point system, and it has always been clear. Mine is basically two 20-point systems with a 10-point cushion on top for wines that have the ability to age. . . .

    . . . The newsletter was always meant to be a guide, one person’s opinion. The scoring system was always meant to be an accessory to the WRITTEN REVIEWS, TASTING NOTES. That’s why I use sentences and try and make it interesting. Reading is a lost skill in America. There’s a certain segment of my readers who only look at numbers, but I think it is a much smaller segment than most wine writers would like to believe. The TASTING NOTES are one thing, but in order to communicate effectively and quickly where a wine placed vis-à-vis its peer group, a numerical scale was necessary. If I didn’t do that, it would have been a sort of cop-out.

    ReplyDelete
  4. And let me proffer this 2007 quote from Wine Spectator columnist/reviewer Harvey Steinman . . .

    From The San Francisco Chronicle “Food & Wine” Section Letters-to-the-Editor (June 22, 2007, Page Unknown):

    “Keeping Score on Ratings”

    Link: http://www.sfgate.com/wine/article/LETTERS-TO-WINE-Keeping-score-on-ratings-2572416.php

    Editor -- Re [article titled] "Are Ratings Pointless?" (June 15, 2007):

    Well, gee, all this fuss (yet again) about the 100-point scale. So it doesn't tell everything you need to know about a wine. Do four stars? Twenty points? No, you have to read the critic's words.

    Does it reward subtle wines? That's up to the critic using the scale. Do subtle wines that lack drama ever get four-star ratings from the [San Francisco] Chronicle's team? No, they get two or 2½ stars, which pretty much corresponds to a 100-point score in the 80s.

    The main reason I like to use the 100-point scale is that it lets me communicate more to my readers. They can tell that I liked a 90-point wine just a little better than an 89-point wine, but a 94-point wine a lot more than one rated 86. Doesn't that say more than giving both the 90-and 94-point wines three stars and both the 89- and 86-point wines 2½ stars?

    (signed)

    HARVEY STEIMAN
    Editor at Large, Wine Spectator

    ReplyDelete
  5. Let me proffer these quotes from Robert Parker . . .

    From The Wine Advocate (unknown issue from 2002):

    “. . . [The Wine Advocate] Readers often wonder what a 100-point score means, and the best answer is that it is pure emotion that makes me give a wine 100 instead of 96, 97, 98 or 99. ”

    How does Robert Parker repeat such a euphoric moment score?

    He can’t:

    “How often do I go back and re-taste a wine that I gave 100 points and repeat the score? Probably about 50% of the time.”

    Source: “Perfection isn’t perfect: Parker says only 50% of his 100-point scores are repeatable,” W. Blake Gray, The Gray Report blog (May 13, 2015)

    Link: http://blog.wblakegray.com/2015/05/perfection-isnt-perfect-parker-says.html

    ReplyDelete
  6. The inability to repeat one's score was discussed at length by California Institute of Technology lecturer (on randomness) Leonard Mlodinow's essay for The Wall Street Journal titled “A Hint of Hype, A Taste of Illusion.”

    Selective excerpts:

    “But what if the successive judgments of the same wine, by the same wine expert, vary so widely that the ratings and [wine competition] medals on which wines base their reputations are merely a powerful illusion? That is the conclusion reached in two recent papers in the Journal of Wine Economics.”

    “. . . The [California State Fair Wine Competition] judges’ wine ratings typically varied by ±4 points on a standard ratings scale running from 80 to 100. . . .”

    “As a consumer, accepting that one taster’s tobacco and leather is another’s blueberries and currants, that a 91 and a 96 rating are interchangeable, or that a wine winning a gold medal in one competition is likely thrown in the pooper in others presents a challenge. If you ignore the web of medals and ratings, how do you decide where to spend your money?”

    Link: http://online.wsj.com/article/SB10001424052748703683804574533840282653628.html

    ReplyDelete
    Replies
    1. Find a critic whose palate agrees with yours, and read the words not the score. My main challenge is deciding how long to keep before drinking; in practice this means tasting a bottle early and deciding when next to try the wine.

      Delete
    2. Norman:

      Quoting from that same Wine Times 1989 interview:

      PARKER: ". . . My system applies best to young wines because older wines, once they've passed their prime, end up getting lower scores.

      . . .

      "If a vintage can provide pleasure after 4 or 5 years and continue for 25 to 30 years, all the time being drinkable and providing immense satisfaction, that's an extraordinary vintage. If you have to wait 20 years before you can drink the wines and you have basically a 5 or 10 year period to drink them before [the fruit flavors] 'dry out,' it's debatable then whether that's a great vintage.

      "Most people are hung up on wines that are brawny and tannic. One thing I'm certain about in the wine business is that wines are often too tannic. People perceive that all that tannin is going to melt away and this gorgeous fruit will emerge. But that rarely ever happens. The good wines in good vintages not only have the depth but also the precociousness. I used to think some of the softer ones wouldn't last more than a couple of years, but they get more and more interesting. Most California wines are not only overly acidified, but the type of tannins they have in most of their Cabernets -- whether the vines are too immature, the climate is different, whatever -- are too hard, too astringent. And you see that even in the older ones. . . ."

      Better to drink you wines on the young side, rather than see them slop quietly into senescence.

      ~~ Bob

      Delete
  7. In my Restaurant i made clear that i choose the vines because i know the producers and the wine the magazine have different interest in mind.
    the worst part of scoring wines is that we loose the opportunity to enjoy wines for what they are not because someone say the are great
    Massimo Navarretta
    wine ambassador

    ReplyDelete