The Wine Gourd: Are wine scores from different reviewers correlated with each other?

Monday, February 10, 2020

Are wine scores from different reviewers correlated with each other?

I recently noted that Quality scores changed the wine industry, and created confusion. The main source of confusion is that the scores are numbers but they do not have any coherent mathematical properties. This issue does not occur for word evaluations of wine quality, of course.

One possible reaction to this situation has been to deride the scores, or even reject the very concept of scores being useful in the wine industry. For example, back in 2005 Elin McCoy (The Emperor of Wine: The Rise of Robert Parker, Jr. and the Reign of American Taste) expressed the view of many people:

I find scoring wine with numbers a joke in scientific terms and misleading in thinking about either the quality or pleasure of wine, something that turns wine into a contest instead of an experience.

We can thus ask ourselves whether wine-quality scores will play as big a part in the future, after Parker’s retirement, as they have over the past three decades (see After Parker: wine in the ‘Age of Re-Discovery’).

However, let us suppose for a moment that they will. If we take this road, then we need to evaluate the wine scores themselves, to try to understand how scores from different critics relate to each other. If we are going to use numbers, then those numbers need to be interpretable, preferably without us having to know about the unique facets of each and every wine critic.

That is, we need to ask: Is there some common basis to wine scores? That is, is there a shared scale (see Denton Marks, A critique of wine ratings as psychophysical scaling)? We need this to be so, beyond the trivial knowledge that a higher number is better than a lower number. In particular, it would be good in situations where multiple critics are employed, such as widely published wine magazines. Surely we can do better than the old fall-back, of saying: “Find a critic you like and follow their advice”. After all, we can do that without any numbers at all.

The basic issue when trying to compare critics is finding circumstances under which a direct comparison between them would be fair. Speaking as a professional scientist, it has long been established that, for a comparison to be valid, all of the circumstances need to be identical except for the one characteristic that you are studying. In the case of wine scores, this means that the critics should ideally be tasting the same wines, at the same time and the same place.

This does not usually happen. Either different critics are tasting different wines, or they are doing so at very different times (months or years apart), as well as very different places (even on different continents). I do know, however, of one situation where the exact same wines do get tasted at almost the same time and place. This is worth looking at.

I have noted before (Wine monopolies, and the availability of wine) that the Swedish wine chain Systembolaget, along with its general assortment of wines, releases small quantities of wines 20 times per year (c. 60-90 products per release). These wines are tasted by various media commentators shortly before their release. So, while these critics are not actually in the same room doing the tasting, this situation may be as close as we can expect to find in practice.

So, in order to address the question posed in this blog post’s title, I will compare the data from 2019 for two of these media sources. I am well aware that comparing only two critics is rather limited, especially as most of you have never heard of either of them. After all, in 2018 Morten Scholer (Coffee and Wine: Two Worlds Compared) listed 44 different sources of 100-point schemes and 18 different 20-point schemes, plus 16 others, none of which were from the modern social media; and neither of the ones discussed here was included.

Both score sources use a 20-point scale. The first source is one I have used before, from Jack Jakobsson at BKWine Magazine. I deleted the data for beer, cider, saki, fortified wines, and spirits, leaving the reds, whites and rosés. The points are provided in 0.5-point increments.

The second source is from Johan Edström at Vinbanken. The scores are reported separately for reds and whites, with rosés included with the whites. The points are usually provided in 0.5-point increments, although occasionally finer divisions appear.

There were 1,034 wines scored by both sources during 2019, with another 20 solely by Vinbanken and 15 solely by BKWine. This makes for a healthy sample size. The direct comparison between them is shown in the first graph. Each point represents one or more wines (depending on how many wines got the same scores), with the Vinbanken score shown horizontally and the BKWine score shown vertically.

Scatterplot of BKWine versus Vinbanken wine-quality scores

The dashed line is the line of equal point scores for the wines. Clearly, the BKWine scores are below this line quite a lot, indicating that they are often less than the Vinbanken score for the same wine. Indeed, the average difference is 0.57 points — this is shown by the solid line, which clearly runs through the center of the points distribution.

Another way of seeing this same pattern of difference is shown in the next graph, which displays the counts of the difference in points for each wine — Vinbanken score minus BKWine score. It shows that the Vinbanken score varies from 2.5 points less than the equivalent BKWine score to 4 points greater, However, most of the wine-quality scores (71%) are either equal or the Vinbanken score is ≤ 1 point greater. So, the evaluations of wine quality are in broad agreement.

Difference between Vinbanken and BKWine wine-quality scores

However, the amount of information shared by the two sets of scores is c. 55% (ie. R² = 0.55) and the other 45% is unique to one set of scores or the other — quite literally, the glass is both half full and half empty. In one sense this is quite good, because R² values this high are relatively rare for subjective (hedonic) judgements. On the other hand, I suspect that most wine drinkers are expecting better than this. If critics only half agree, then the consumer may not be much better off with them than without them.

Note, also, that the differences in points are more pronounced for smaller point scores — that is, there is more variation in points at the left of the first graph. Indeed, the biggest variation is at 15 Vinbanken points. So, it seems that there is more agreement for the better-quality wines than for the lower qualities.

Finally, it is worth considering the relationship between the assessed quality scores and the prices of the wines (see my prior post The relationship of price to wine-quality scores). Based on the exponential relationship used in my previous posts, the BKWine scores correlate slightly better (54%) with the prices than do the Vinbanken scores (51%).

However, this is still the same situation as above (a glass both half full and half empty). Wine prices are only partly associated with wine quality, which means that there are both good-value wines and complete rip-offs. Nevertheless, one-third of the wines have, based on their assessed quality using each of the two score systems, an “expected” value within $US3 of each other, so that either set of scores could be used to identify wines that are selling for below their assessed worth.

10 comments:

Michael BrillFebruary 10, 2020 at 2:34 AM
"This issue does not occur for word evaluations of wine quality, of course."
Words are a horrible way to express wine quality. Terse descriptions ("great" vs. "wonderful") are much less useful than 92 points vs. 94 points. Long-form descriptive paragraphs are even worse as the same text can describe wildly different quality levels and they're unusably inconvenient for consumers.

Numbers, even with imprecision, are a great way to compare two products at a glance. Consumers see points and stars for almost every product category and don't ponder the ontological dilemma of what it means to be a number. 4.3 stars is better than 4.1 stars and is much better than 3.9 stars. 7.2 IMDB score is better than 6.9 - all things being equal, let's watch the 7.2.

I just don't see "Confused about difference between 92 and 91 points" showing up much in consumer complaints about wine.

However, there must be consumer confusion due to easy/hard scorers. For a project I'm working on, I have about a million professional reviews and have to normalize scores (which I bucket on region/price band - close enough for my needs). Of course a Decanter or Suckling is going to score much higher than a Vinous or Burghound. Indeed, it's an intentional act for newer reviewers who understand they profit from publishing high scores.

Is there a need for an industry-wide metacritic that normalizes these scores? Maybe, but I still haven't seen evidence that ratings are a top concern for consumers.

PS: Thanks so much for doing this analysis. I love this stuff!
ReplyDelete
Replies
Bob HenryFebruary 10, 2020 at 9:38 AM
David writes:

“The main source of confusion is that the scores are numbers but they do not have any coherent mathematical properties. This issue does not occur for word evaluations of wine quality, of course.”

Can we jettison the numbers and devise a “Rosetta Stone” to compare the relative quality of wines solely connoted by words?

Each of these U.S. wine reviewers (individuals or publications) use a scoring scale with six discrete ranges.

Citing the Wine Spectator [*] (https://www.winespectator.com/articles/scoring-scale):

95-100: Classic
90-94: Outstanding
85-89: Very good
80-84: Good
75-79: Mediocre
50-74: Not recommended

Citing The Wine Advocate (https://www.robertparker.com/ratings):

96-100: Extraordinary
90 - 95: Outstanding
80 - 89: Barely above average to very good wine
70 - 79: Average
60 - 69: Below average
50 - 59: Unacceptable

Citing Stephen Tanzer’s International Wine Cellar (https://www.wine-searcher.com/critics-11-stephen+tanzer):

95–100: Extraordinary
90–94: Outstanding
85–89: Very good to excellent
80–84: Good
75–79: Average
70–74: Below average

Citing the Wine Enthusiast website (https://www.winemag.com/2010/04/09/you-asked-how-is-a-wines-score-determined/):

98-100: Classic.
94-97: Superb
90-93: Excellent
87-89: Very Good
83-86: Good
80-82: Acceptable
Wines receiving a rating below 80 are not reviewed.

[Comment continued below.]
ReplyDelete
Replies
Bob HenryFebruary 10, 2020 at 9:38 AM
Is a wine rated “classic” [95-100] by the Wine Spectator the equivalent of a wine rated “extraordinary” [96-100] by The Wine Advocate? Equivalent of a wine rated “extraordinary” [95-100] by Stephen Tanzer? Equivalent to a wine rated “classic” [98-100] by Wine Enthusiast?

Is a wine rated “outstanding” [90-94] by the Wine Spectator the equivalent of a wine rated “outstanding” [90-95] by The Wine Advocate? Equivalent of a wine rated “outstanding” [90-94] by Stephen Tanzer? Equivalent to a wine rated “excellent” [90-93] to even “superb” [94-97] by Wine Enthusiast?

Is a wine rated “very good” [85-89] by the Wine Spectator the equivalent of a wine rated (at the upper end of the range) “very good” [80-89] by The Wine Advocate? Equivalent of a wine rated “very good to excellent” [85-89] by Stephen Tanzer? Equivalent to a wine rated “excellent” [87-89] by Wine Enthusiast?

If we say “yes,” then let’s take these reviewers literally at their word and match up those words when assessing quality levels across reviewers?

The next step would be have access to same wine bottle reviewers and see if there is general congruence or incongruence among the reviewers.

[Comment continued below.]
ReplyDelete
Replies
Bob HenryFebruary 10, 2020 at 9:39 AM
*Historical footnote. The Wine Spectator has revised its 100-point scale over time.

Citing their June 30, 1994 issue:

URL: https://backissues.com/cgi-bin/backissues.cgi?mid/WS19940630.JPG

URL: https://backissues.com/issue/Wine-Spectator-June-30-1994

95 – 100: Classic
90 – 94: Outstanding
80 – 89: Good to very good
70 – 79: Average
60 – 69: Below average
50 – 59: Poor, undrinkable, not recommended

And now:

95-100: Classic
90-94: Outstanding
85-89: Very good
80-84: Good
75-79: Mediocre
50-74: Not recommended
ReplyDelete
Replies
AnonymousOctober 6, 2023 at 8:10 PM
Yes Bob Henry there is a need for an industry-wide metacritic that normalizes these scores! How great would that be - if they only take reputable critics’ scores, normalize them and then (weighted?) average them. And include it on wine-searcher ⭐️⭐️⭐️⭐️⭐️ (I give that idea 5 stars 😉😉)
ReplyDelete
Replies
Bob HenryOctober 7, 2023 at 9:56 AM
Dear Anonymous:

I wrote above: "The next step would be [to] have access to same wine bottle reviewers and see if there is general congruence or incongruence among the reviewers."

I invite you to visit these Wine Gourd blog posts:

"Laube versus Suckling — their scores differ, but what does that mean for us?"

URL: http://winegourd.blogspot.com/2018/03/laube-versus-suckling-their-scores.html

"Laube versus Suckling — do their scores relate to wine price?"

URL: http://winegourd.blogspot.com/2018/04/laube-versus-suckling-do-their-scores.html

Two Wine Spectator reviewers sampled in real time from the same bottles of wine -- and came to diametrically opposite conclusions about relative "quality" (personal preference?) on a number of well-regarded / well-reviewed submissions.

With apologies to Winston Churchill, this was not Wine Spectator magazine's "finest hour."

It debunked the myth about their 100-point wine scale "methodology" being rigorously adhered to by their reviewers.

The rating point difference on some wines was stunningly wide.

(A "one-off" circumstance? No. I invite you to visit this Wine Gourd blog post: "How much difference can there be between critics?"

URL: http://winegourd.blogspot.com/2019/02/how-much-difference-can-there-be.html)
ReplyDelete
Replies

Add comment

Monday, February 10, 2020

Are wine scores from different reviewers correlated with each other?

10 comments:

Get new posts by email: