Monday, 16 April 2018

Why comparing wine-quality scores might make no sense

There is no mathematical meaning to comparing wine-quality scores between different critics.

If you do want to compare scores, then it can only validly be done between those scores produced by any one critic (eg. the same critic tasting different wines, or even the same wine on different occasions). There is no mathematical justification for comparing scores between critics (eg. different critics tasting the same wine, even at the same time). That is, quality scores provide a ranking only for the wines tasted by any given critic, nothing more.


Background

Wine-quality scores are an important concept in the wine industry, for several reasons. First, wine critics produce them, and there would be precious little wine writing without them. Second, wine drinkers and buyers use them to help make their wine-purchasing and wine-drinking decisions. Third, marketers use them as an advertizing tool, usually along with a lot of flowery words about the wines.

So, these scores are not going away any time soon, no matter how many pundits proclaim their demise. Instead, what we need to do is come to terms with their characteristics, so that we can use them effectively.

To this end, there is actually a small body of professional literature about the vagaries of the range of wine-quality scores that are currently in use; and I have produced several blog posts myself, trying to make sense of what is going on.

Reasoning

I have finally concluded that there are two fundamentally different sorts of wine-quality scores in use: (1) what we might call an objective score, based on explicitly assigning points to a series of pre-defined wine characteristics, and then summing them to get the wine score; and (2) subjective (but expert) scores, where the overall score comes from whatever characteristics the scorer wants to express. There are many variants of these two score types, especially the subjective scores, but for our purposes in this post this variation is not relevant.

What is important, instead, is that these two types of scores should not be confused, although most people still seem to do this — people often refer to "wine scores" as though they are all the same. However, the two types have fundamentally different mathematical behaviors. Their mathematical behavior is of the utmost importance because this is what numbers are all about — if numbers have any meaning then it must be a mathematical meaning, otherwise words would be enough.


So, we need to distinguish between the scoring scheme, which contains the information about wine quality, and the scale, which is the way that the quality is expressed (stars, points, words, etc). Formally, for the objective scores there is a single scoring scheme and a single scale being used to express that scheme (eg. the numbers 1-20 used by the UCDavis quality score). However, for the subjective scores there are many different scoring schemes, even though a single scale is being used to express those schemes (eg. the numbers 50-100 used by the majority of wine critics, as well as by community sites such as Cellar Tracker or Vivino).

This distinction can be illustrated using this 2x2 table:

Objective scores:
Subjective scores: 
Scale
 one (eg. 20 points)
 one (eg. 20 points)
Scoring scheme
 one (pre-defined)
 many (chosen by critic)

This means that: for the objective points scores, since there is only one scoring scheme, then differences in points always reflect differences in the perceived qualities of the wines; but for the subjective points scores, there is a wide choice of scoring schemes — the scoring scheme can mean anything the person wants it to mean. In both cases, there can be personal choices about the wine quality, but in the subjective case there are also choices about how to interpret the scoring scheme (ie. what it actually means).

Of most importance, then, is that the objective scores can be directly compare between critics, because any difference in score will almost always represent a difference of opinion about the quality of the wine. On the other hand, the subjective scores cannot be compared, because any difference or similarity of the scores could be interpreted as either (i) a difference of opinion about the wine quality or (ii) the use of different scoring schemes — for example, there can be different schemes for reds versus whites, or sweet versus dry wines, or even different grape varieties.

For subjective wine-quality scores, we thus cannot tell what numerical similarity or difference of scores actually means. The same scores could mean different qualities (because the scoring schemes are different), and different scores could mean the same quality (because the scoring schemes are different). How on earth are we to know? We can't!

I have listed some of my previous posts at the bottom of this page, which provide illustrative examples of just how many different scoring schemes there are among critics, even when they are ostensibly using the same scale.


Finally, note that that it is the combination of a value judgment with a variable scoring system that is the issue. Variable scoring systems on their own are not problematic, provided they are measuring an objective phenomenon. For example if we are measuring the length of something, then it does not matter whether we use yards, meters or cubits, because the length itself will be the same in all three cases, and we are just describing this using different units. But wine quality is not an objective phenomenon, in this same sense — it is to a large extent a value judgment; and this creates the problem. Different scores may mean different judgments or they may mean different scoring schemes.

Conclusions

Harvey Steiman (Editor at Large, Wine Spectator) once wrote (Are ratings pointless? June 15, 2007):
The main reason I like to use the 100-point scale is that it lets me communicate more to my readers. They can tell that I liked a 90-point wine just a little better than an 89-point wine, but a 94-point wine a lot more than one rated 86.
And that ranking is all the score does — we cannot compare Mr Steiman's numbers to anyone else's numbers. This is a pity.

However, the average quality score produced by community sites like Cellar Tracker might possibly have some meaning, but only if it is an average of enough scores. I have no idea what "enough" would be in this case, but it has to be a large enough set of scores to "average out" the fact that the many people producing the scores may all mean different things. If variation among the scores varies randomly about some average value (as is likely), then calculating an average score will, indeed, address the issue. But an average derived from a small number of scores is itself subject to random variation, although this decreases as the sample size increases. Trying to work out the required sample size might be the topic of another blog post.

Moreover, as I have emphasized, if a consistent scoring scheme is used (ie. an objective score), then the scores can naturally be compared among tasters. I have done this comparison, for example, when I have used data from the Vintners Club, which employs the standard UC Davis 20-point scoring system for its tastings (see the list of posts below). Here, averaging the scores does, indeed, make perfect mathematical sense, because all of the scores are based on the same scoring scheme — differences in scores can only mean differences of opinion about wine quality.

Finally, there are a number of contributions to the professional literature that cover the implications of this topic for wine competitions; but I will cover that in another post.

Previous blog posts illustrating the differences between scoring schemes
Previous blog posts using an objective scoring scheme

1 comment:

  1. Scoring is the patrimony of the reviewer. It can never be accurately compared to scoring of other reviewers. That said, it doesn't really matter, because consumers find a reviewer who aligns with their palate and everything is good in the hood.

    ReplyDelete