Monday, February 10, 2020

Are wine scores from different reviewers correlated with each other?

I recently noted that Quality scores changed the wine industry, and created confusion. The main source of confusion is that the scores are numbers but they do not have any coherent mathematical properties. This issue does not occur for word evaluations of wine quality, of course.

One possible reaction to this situation has been to deride the scores, or even reject the very concept of scores being useful in the wine industry. For example, back in 2005 Elin McCoy (The Emperor of Wine: The Rise of Robert Parker, Jr. and the Reign of American Taste) expressed the view of many people:
I find scoring wine with numbers a joke in scientific terms and misleading in thinking about either the quality or pleasure of wine, something that turns wine into a contest instead of an experience.
We can thus ask ourselves whether wine-quality scores will play as big a part in the future, after Parker’s retirement, as they have over the past three decades (see After Parker: wine in the ‘Age of Re-Discovery’).

From: The World of Fine Wine Magazine

However, let us suppose for a moment that they will. If we take this road, then we need to evaluate the wine scores themselves, to try to understand how scores from different critics relate to each other. If we are going to use numbers, then those numbers need to be interpretable, preferably without us having to know about the unique facets of each and every wine critic.

That is, we need to ask: Is there some common basis to wine scores? That is, is there a shared scale (see Denton Marks, A critique of wine ratings as psychophysical scaling)? We need this to be so, beyond the trivial knowledge that a higher number is better than a lower number. In particular, it would be good in situations where multiple critics are employed, such as widely published wine magazines. Surely we can do better than the old fall-back, of saying: “Find a critic you like and follow their advice”. After all, we can do that without any numbers at all.

The basic issue when trying to compare critics is finding circumstances under which a direct comparison between them would be fair. Speaking as a professional scientist, it has long been established that, for a comparison to be valid, all of the circumstances need to be identical except for the one characteristic that you are studying. In the case of wine scores, this means that the critics should ideally be tasting the same wines, at the same time and the same place.

This does not usually happen. Either different critics are tasting different wines, or they are doing so at very different times (months or years apart), as well as very different places (even on different continents). I do know, however, of one situation where the exact same wines do get tasted at almost the same time and place. This is worth looking at.


I have noted before (Wine monopolies, and the availability of wine) that the Swedish wine chain Systembolaget, along with its general assortment of wines, releases small quantities of wines 20 times per year (c. 60-90 products per release). These wines are tasted by various media commentators shortly before their release. So, while these critics are not actually in the same room doing the tasting, this situation may be as close as we can expect to find in practice.

So, in order to address the question posed in this blog post’s title, I will compare the data from 2019 for two of these media sources. I am well aware that comparing only two critics is rather limited, especially as most of you have never heard of either of them. After all, in 2018 Morten Scholer (Coffee and Wine: Two Worlds Compared) listed 44 different sources of 100-point schemes and 18 different 20-point schemes, plus 16 others, none of which were from the modern social media; and neither of the ones discussed here was included.


Both score sources use a 20-point scale. The first source is one I have used before, from Jack Jakobsson at BKWine Magazine. I deleted the data for beer, cider, saki, fortified wines, and spirits, leaving the reds, whites and rosés. The points are provided in 0.5-point increments.


The second source is from Johan Edström at Vinbanken. The scores are reported separately for reds and whites, with rosés included with the whites. The points are usually provided in 0.5-point increments, although occasionally finer divisions appear.

There were 1,034 wines scored by both sources during 2019, with another 20 solely by Vinbanken and 15 solely by BKWine. This makes for a healthy sample size. The direct comparison between them is shown in the first graph. Each point represents one or more wines (depending on how many wines got the same scores), with the Vinbanken score shown horizontally and the BKWine score shown vertically.

Scatterplot of BKWine versus Vinbanken wine-quality scores

The dashed line is the line of equal point scores for the wines. Clearly, the BKWine scores are below this line quite a lot, indicating that they are often less than the Vinbanken score for the same wine. Indeed, the average difference is 0.57 points — this is shown by the solid line, which clearly runs through the center of the points distribution.

Another way of seeing this same pattern of difference is shown in the next graph, which displays the counts of the difference in points for each wine — Vinbanken score minus BKWine score. It shows that the Vinbanken score varies from 2.5 points less than the equivalent BKWine score to 4 points greater, However, most of the wine-quality scores (71%) are either equal or the Vinbanken score is ≤ 1 point greater. So, the evaluations of wine quality are in broad agreement.

Difference between Vinbanken and BKWine wine-quality scores

However, the amount of information shared by the two sets of scores is c. 55% (ie. R2 = 0.55) and the other 45% is unique to one set of scores or the other — quite literally, the glass is both half full and half empty. In one sense this is quite good, because R2 values this high are relatively rare for subjective (hedonic) judgements. On the other hand, I suspect that most wine drinkers are expecting better than this. If critics only half agree, then the consumer may not be much better off with them than without them.

Note, also, that the differences in points are more pronounced for smaller point scores — that is, there is more variation in points at the left of the first graph. Indeed, the biggest variation is at 15 Vinbanken points. So, it seems that there is more agreement for the better-quality wines than for the lower qualities.

Finally, it is worth considering the relationship between the assessed quality scores and the prices of the wines (see my prior post The relationship of price to wine-quality scores). Based on the exponential relationship used in my previous posts, the BKWine scores correlate slightly better (54%) with the prices than do the Vinbanken scores (51%).

However, this is still the same situation as above (a glass both half full and half empty). Wine prices are only partly associated with wine quality, which means that there are both good-value wines and complete rip-offs. Nevertheless, one-third of the wines have, based on their assessed quality using each of the two score systems, an “expected” value within $US3 of each other, so that either set of scores could be used to identify wines that are selling for below their assessed worth.

10 comments:

  1. "This issue does not occur for word evaluations of wine quality, of course."
    Words are a horrible way to express wine quality. Terse descriptions ("great" vs. "wonderful") are much less useful than 92 points vs. 94 points. Long-form descriptive paragraphs are even worse as the same text can describe wildly different quality levels and they're unusably inconvenient for consumers.

    Numbers, even with imprecision, are a great way to compare two products at a glance. Consumers see points and stars for almost every product category and don't ponder the ontological dilemma of what it means to be a number. 4.3 stars is better than 4.1 stars and is much better than 3.9 stars. 7.2 IMDB score is better than 6.9 - all things being equal, let's watch the 7.2.

    I just don't see "Confused about difference between 92 and 91 points" showing up much in consumer complaints about wine.

    However, there must be consumer confusion due to easy/hard scorers. For a project I'm working on, I have about a million professional reviews and have to normalize scores (which I bucket on region/price band - close enough for my needs). Of course a Decanter or Suckling is going to score much higher than a Vinous or Burghound. Indeed, it's an intentional act for newer reviewers who understand they profit from publishing high scores.

    Is there a need for an industry-wide metacritic that normalizes these scores? Maybe, but I still haven't seen evidence that ratings are a top concern for consumers.

    PS: Thanks so much for doing this analysis. I love this stuff!

    ReplyDelete
    Replies
    1. Thanks for your comments. Word descriptions of wine have rarely been of help to me, personally, because it is hard to work out what the writer actually means. In that sense, scores are potentially better. The issue, as I see it, is that points are personal, and if I don't know the person then their points don't help, either.

      One of the interesting thing about today's comparison is that the scores correlate reasonably well but one lot is half a point higher than the other. That helps me interpret the scores, should I wish to buy any of the wines they recommend.

      One day, I may yet be able to buy wine in Sweden based on the reviewers!

      Delete
    2. Good morning, Michael.

      From a statistical perspective, a score of “92 points” is no different from a score of “94 points,” given the ± 2 points to ± 4 points margin of error found by winemaker/scientist/statistician/emeritus professor Robert Hodgson.

      Chronicled in the pages of The Wall Street Journal by Caltech lecturer Leonard Mlodinow.

      Excerpts from The Wall Street Journal “Weekend” Section
      (November 20, 2009, Page W6):

      “A Hint of Hype, A Taste of Illusion;
      They pour, sip and, with passion and snobbery, glorify or doom wines.
      But studies say the wine-rating system is badly flawed.
      How the experts fare against a coin toss.”

      URL: http://online.wsj.com/article/SB10001424052748703683804574533840282653628.html

      Essay by Leonard Mlodinow

      ". . . what if the successive judgments of the same wine, by the same wine expert, vary so widely that the ratings and medals on which wines base their reputations are merely a powerful illusion? That is the conclusion reached in two recent papers in the Journal of Wine Economics [by Robert Hodgson when he analyzed the judging at the California State Fair wine competition, 'North America’s oldest and most prestigious.']

      . . .

      "The results astonished Mr. Hodgson. The judges’ wine ratings typically varied by ± 4 points on a standard ratings scale running from 80 to 100. A wine rated 91 on one tasting would often be rated an 87 or 95 on the next. Some of the judges did much worse, and only about one in 10 regularly rated the same wine within a range of ± 2 points."

      Delete
  2. David writes:

    “The main source of confusion is that the scores are numbers but they do not have any coherent mathematical properties. This issue does not occur for word evaluations of wine quality, of course.”

    Can we jettison the numbers and devise a “Rosetta Stone” to compare the relative quality of wines solely connoted by words?

    Each of these U.S. wine reviewers (individuals or publications) use a scoring scale with six discrete ranges.

    Citing the Wine Spectator [*] (https://www.winespectator.com/articles/scoring-scale):

    95-100: Classic
    90-94: Outstanding
    85-89: Very good
    80-84: Good
    75-79: Mediocre
    50-74: Not recommended

    Citing The Wine Advocate (https://www.robertparker.com/ratings):

    96-100: Extraordinary
    90 - 95: Outstanding
    80 - 89: Barely above average to very good wine
    70 - 79: Average
    60 - 69: Below average
    50 - 59: Unacceptable

    Citing Stephen Tanzer’s International Wine Cellar (https://www.wine-searcher.com/critics-11-stephen+tanzer):

    95–100: Extraordinary
    90–94: Outstanding
    85–89: Very good to excellent
    80–84: Good
    75–79: Average
    70–74: Below average

    Citing the Wine Enthusiast website (https://www.winemag.com/2010/04/09/you-asked-how-is-a-wines-score-determined/):

    98-100: Classic.
    94-97: Superb
    90-93: Excellent
    87-89: Very Good
    83-86: Good
    80-82: Acceptable
    Wines receiving a rating below 80 are not reviewed.

    [Comment continued below.]

    ReplyDelete
  3. Is a wine rated “classic” [95-100] by the Wine Spectator the equivalent of a wine rated “extraordinary” [96-100] by The Wine Advocate?  Equivalent of a wine rated “extraordinary” [95-100] by Stephen Tanzer?  Equivalent to a wine rated “classic” [98-100] by Wine Enthusiast?

    Is a wine rated “outstanding” [90-94] by the Wine Spectator the equivalent of a wine rated “outstanding” [90-95] by The Wine Advocate?  Equivalent of a wine rated “outstanding” [90-94] by Stephen Tanzer?  Equivalent to a wine rated “excellent” [90-93] to even “superb” [94-97] by Wine Enthusiast?

    Is a wine rated “very good” [85-89] by the Wine Spectator the equivalent of a wine rated (at the upper end of the range) “very good” [80-89] by The Wine Advocate?  Equivalent of a wine rated “very good to excellent” [85-89] by Stephen Tanzer?  Equivalent to a wine rated “excellent” [87-89] by Wine Enthusiast?

    If we say “yes,” then let’s take these reviewers literally at their word and match up those words when assessing quality levels across reviewers?

    The next step would be have access to same wine bottle reviewers and see if there is general congruence or incongruence among the reviewers.

    [Comment continued below.]

    ReplyDelete
    Replies
    1. Differences in the absolute points scale are important, but equally important is potentially non-linear relationships. Maybe "extraordinary" is sometimes equal to "classic" and maybe sometimes it isn't — it may depend on the style of wine, for example. For me, one obvious example is rosé wine, which most reviewers score rather low, in my opinion — I therefore need to adjust my interpretation of the scale accordingly.

      Delete
    2. To the best of my memory, there has never been been awarded a "100-point" score to a dry nonsparkling rosé.

      "Why"?

      Well, citing Robert Parker and his 1989 interview with Wine Times magazine (later rebranded Wine Enthusiast), such wines do not improve with age in the bottle.

      Hence they garner no "bonus" points that place the wine somewhere between 91 points and 100 points.

      Consequently they bump up against a "glass ceiling" of 90 points.

      And yet, we have these words decades later from Robert Parker about tasting the "best example" of a particular wine and having an obligation to award it a "perfect score."

      Is a "perfect score" from Parker 90-points or 100-points?

      (At the time of his 1989 interview, a "perfect score" for cru Beaujolais was "90 points." And he had never given one . . . until a few years later, when he exceeded his 90-point glass ceiling with reviews on the stunning 2009 vintage cru Beaujolaises in the low to mid-90s.)

      EXCERPTS FROM THE DRINKS BUSINESS
      (May 7, 2015):

      "[ROBERT] PARKER [SAYS]: NOT AWARDING 100 POINTS 'IRRESPONSIBLE';
      Wine critics who fail to give perfect scores are 'dodging responsibility' according to the world’s most influential wine reviewer, Robert Parker."

      URL: https://www.thedrinksbusiness.com/2015/05/parker-not-awarding-100-points-irresponsible/

      By Patrick Schmitt

      "During an interview with the drinks business earlier this year, Parker -– who developed the 100-point rating system – expressed his urge to award full marks to great wines, and his dismay at those who don’t.

      "'When, in your mind, the wine is the best example you have ever tasted of this particular wine, you have an obligation to give it a perfect score,' he told db.

      "On the other hand, he branded those who are incapable of awarding a perfect score 'irresponsible'.

      "'I think the person who can’t give 100 is really dodging responsibility, because there’s no way they haven’t tasted a wine that is the best example they have tasted from this producer, the best example they could ever think of.”'

      "He then stated, 'I think it’s irresponsible not to give a perfect score if you think the wine is perfect.'"

      Delete
  4. *Historical footnote.  The Wine Spectator has revised its 100-point scale over time.

    Citing their June 30, 1994 issue:

    URL: https://backissues.com/cgi-bin/backissues.cgi?mid/WS19940630.JPG

    URL: https://backissues.com/issue/Wine-Spectator-June-30-1994

    95 – 100: Classic
    90 – 94: Outstanding
    80 – 89: Good to very good
    70 – 79: Average
    60 – 69: Below average
    50 – 59: Poor, undrinkable, not recommended

    And now:

    95-100: Classic
    90-94: Outstanding
    85-89: Very good
    80-84: Good
    75-79: Mediocre
    50-74: Not recommended

    ReplyDelete
  5. Yes Bob Henry there is a need for an industry-wide metacritic that normalizes these scores! How great would that be - if they only take reputable critics’ scores, normalize them and then (weighted?) average them. And include it on wine-searcher ⭐️⭐️⭐️⭐️⭐️ (I give that idea 5 stars 😉😉)

    ReplyDelete
  6. Dear Anonymous:

    I wrote above: "The next step would be [to] have access to same wine bottle reviewers and see if there is general congruence or incongruence among the reviewers."

    I invite you to visit these Wine Gourd blog posts:

    "Laube versus Suckling — their scores differ, but what does that mean for us?"

    URL: http://winegourd.blogspot.com/2018/03/laube-versus-suckling-their-scores.html

    "Laube versus Suckling — do their scores relate to wine price?"

    URL: http://winegourd.blogspot.com/2018/04/laube-versus-suckling-do-their-scores.html

    Two Wine Spectator reviewers sampled in real time from the same bottles of wine -- and came to diametrically opposite conclusions about relative "quality" (personal preference?) on a number of well-regarded / well-reviewed submissions.

    With apologies to Winston Churchill, this was not Wine Spectator magazine's "finest hour."

    It debunked the myth about their 100-point wine scale "methodology" being rigorously adhered to by their reviewers.

    The rating point difference on some wines was stunningly wide.

    (A "one-off" circumstance? No. I invite you to visit this Wine Gourd blog post: "How much difference can there be between critics?"

    URL: http://winegourd.blogspot.com/2019/02/how-much-difference-can-there-be.html)

    ReplyDelete