Monday, January 14, 2019

The fundamental problem with wine scores

I have written several blog posts about wine-quality scores, pointing out that even though they are expressed as numbers they do not have many useful mathematical properties; and, to me, a score with no mathematical meaning is like trying to construct a Swedish sentence by knowing the words but not the grammar. However, what I have not done, until now, is point out the fundamental issue that leads to this situation in the first place. That is, I have previously pointed out effects, but not causes.

Before proceeding to discuss the cause, however, I will point out that many wine commentators seem to treat wine scores as nothing more than a convenient way to express their own personal preferences (ie. increasing score indicates increasing preference). Under these circumstances the scores have nothing to do with mathematics, at all. Preferences could just as easily be expressed with words; and in this case they probably should be. They certainly used to be, before the 1990s, and for some commentators they still are.


The basic issue

Put formally, wine scores represent multidimensional properties that have been summarized as a single point in one dimension.

Sounds good, doesn't it? Let's put it another way: the single wine-quality number is trying to do too many things all at once.

Whenever a critic tells us how they construct their scoring scheme, they usually list a series of characteristics of wines that purportedly contribute to quality (mainly based on color, aroma, palate and body). Formally, each of these characteristics is a "dimension" of any given wine's quality.

Here is an example, taken from Steve Charters and Simone Pettigrew (2007. The dimensions of wine quality. Food Quality and Preference 18: 997-1007).

The dimensions of wine quality

In terms of quality, most commentators are interested solely in the intrinsic dimensions. However, in order to describe a wine mathematically, we would need a number for each of these intrinsic dimensions. Given this collection of numbers, we would then have a complete description of any given wine's quality.

The situation

As a prime example, take the original UCDavis wine scoring system, which covers the score range 0-20.** The characteristics of quality and their associated numbers are:
Dimension
Appearance
Color
Aroma & bouquet
Volatile acidity
Total acidity
Sweetness
Body
Flavor
Bitterness
General quality
Score
2
2
4
2
2
1
1
2
2
2

There are 11 dimensions here, and we need all 11 numbers to completely describe any given wine's quality. That is, wine quality is multi-dimensional, and we need to "see" all of those dimensions in order to evaluate the wine.

However, rather than doing this, the UCDavis system summarizes the wine down to a single number — in this case, we add the numbers for each dimension, to get a score out of 20. That is, we reduce the multi-dimensional idea of quality down to a single point in one dimension — that dimension simply goes from 0 to 20, and the point on that dimension is the quality score.

The ensuing problem

The problem that arises from this situation actually applies any time we reduce a multi-dimensional concept down to a single dimension. I encountered this issue many times in my professional life as an environmental and evolutionary biologist,* so there is nothing unique about the situation as it arises in wine commentary.

The problem is this: many quite-different wines could end up with the same final score. Summarizing a set of numbers down to a single number must, by definition, lose most of the numerical information (the multiple dimensions become one dimension only). If a wine gets a score of 0, then we know the score for each dimension (it must be 0 in each case), and we have lost no information. The same applies for a wine that scores 20, as this must mean that the wine got the maximum score for each dimension. But for all other scores the situation is ambiguous.

Consider these two wines, which I have described using the 11 UCDavis dimensions listed above:
2 + 2 + 2 + 2 + 2 + 0 + 1 + 1 + 2 + 1 = 15
2 + 2 + 4 + 1 + 1 + 1 + 1 + 2 + 0 + 1 = 15
These would be two very different wines; but I would never know it from the final quality score.

So, you should now see why wine quality scores have a fundamental problem, if we try to treat them as mathematical concepts: how do we interpret the quality score? We have no way of knowing what the score represents in terms of the multi-dimensional concept of wine quality. Two identical scores could easily represent two very different wines.


A problem for all ratings systems

The problem discussed here is general. All ratings systems are one-dimensional, while the data on which they are based are multi-dimensional. A linear rating system makes no sense when you are combining different characteristics — we cannot combine multiple features into a single number in any way that makes much sense. That is, when we look at the final rating score we cannot tell which characteristics were important in producing it.

Take this simple situation, where value for money has two dimensions, quality and price:
A (high quality) a (expensive)
A (high quality) b (inexpensive)
B (low quality)  a (expensive)
B (low quality)  b (inexpensive)
How could I sensibly put these four groups in a single order based on value for money? We know which group is likely to be the best value for money, and we might put this at the top; and we know which is the worst value for money (Ab), and we might put this at the bottom (Ba); but what do we do with Aa and Bb in terms of value for money? If we did put them in some order, we would be doing so solely for the sake of doing so, not because it would be informative.

We have two totally different criteria, and combining them vitiates any attempt at a single order. The only system that would make sense would be multi-dimensional. That is, we should keep the ratings as Aa, Ab, Ba and Bb — the categories would this have meaning even though their order does not.

This is very similar to America's Got Talent, where the judges are trying to compare a magician with a pole dancer, and deciding which is "better". Better at what? Both of them are very good with their hands, but in very different ways! No wonder most of these shows worldwide end up being won by singers.

Wine shows

So, the issue for wine-quality ratings should now be clear. The ratings are based on trying to combine a series of different characteristics, some of which are very different from each other.

This explains why a wine can win a gold medal at one show and nothing at all at the next. The judges were combining the different quality dimensions in different ways, and thereby deciding which is best — that is all that the wine shows tell us.

The wine shows try to alleviate the problem a bit, by having a lot of different categories, based on all sorts of features (grape variety, wine style, vintage age, etc). This certainly helps, but it brings us back to the same problem of comparing two bottles of wine based on a series of vinous characteristics that are very hard to combine into a single number. And this approach certainly does not help at all with "best wine in show" awards.

A solution?

I have discussed multi-dimensional data previously in this blog. I pointed out at the time that, if we are going to take the numbers seriously, then we actually need to draw graphs of them, not reduce them to a single number:
Summarizing multi-dimensional wine data as graphs, Part 1: ordinations
Summarizing multi-dimensional wine data as graphs, Part 2: networks
It is difficult seeing the wine-buying public going for this solution, but I might discuss it in a future post.

An alternative solution?

It has sometimes been claimed that a wine score is not a number, but is more like an adjective. Well, it sure looks like a number to me, so this simply exacerbates the problem. If it is an adjective then it should be a word, not a number. I will discuss this in my next post, but as a preview: it still takes multiple words to describe all aspects of a wine's quality, and summarizing this in a word or two does not change anything — we are still summarizing multiple dimensions (expressed as words, this time) into one dimension (a small set of words).



* For example, in ecology Species Diversity is measured as a combination of two dimensions: (1) a count of the number of species, and (2) the abundance of each species. These two concepts are combined into a single number.

** Here is a more detailed overview of the UCDavis scoring scheme, taken from George Vierra (A better wine scorecard?).


2 comments:

  1. Many readers of this blog may not be able to access this article.

    A sidebar exhibit excerpt:

    "A Better Wine Scorecard?;
    Napa Valley College's new wine scoring system objectively analyzes wine while also allowing for relevant notes on wine style, character, aging, cost and where the wine can be purchased."

    By George Vierra

    "Napa Valley College 25-Point Scorecard"

    APPEARANCE (3 total)
    Clarity (cloudy - clear - brilliant)
    Color (hue) ___________
    Color (depth) ___________
    Other ___________

    ODOR (7 total)
    First impression ___________
    Second impression ___________
    Odor intensity ___________
    Off characters ___________

    TASTE (15 total)
    First impression ___________
    Middle of palate ___________
    Finish ___________
    Aroma in mouth ___________
    Aftersmell ___________
    Duration of aroma and taste ___
    Taste intensity ___________

    FINAL SCORE (25 total) ____
    25-Perfect; 24-23 Excellent; 22-21 Very High Quality; 20-19 Very Good; 18-16 Good; 15-9 Ordinary;
    8-6 Below Average; 5-3 Bad; 2-0 Very Bad

    FINAL PRAISES

    THUMBS DOWN? (If yes, why?) CONCLUDING REMARKS
    Wine style (table or social?) ___
    Drinkable for how long? ________
    Serve with ________
    Serving temperature ________
    Where to buy ________
    Price ________
    Value (great value, fair, bit dear, overpriced) ________

    Taster: ____________________
    Date:______________________

    ReplyDelete
  2. For a discussion of how the movie review ratings system came about, see this article:

    Excerpt from The Wall Street Journal “Main News” Section
    (January 23, 2009, Page A12):

    “Let’s Rate the Ranking Systems of Film Reviews;
    The Stars, Grades and Thumbs Applied to Movies Suffer From Lackluster Performance, Low Production Values”

    URL: https://www.wsj.com/articles/SB123265679206407369#printMode%20[See%20accompanying%20exhibit]%20By%20Carl%20Bialik%20%E2%80%9CThe%20Numbers%20Guy%E2%80%9D%20Column

    By Carl Bialik
    “The Numbers Guy” Column

    "More than 80 years ago [now, almost 90 years ~~ Bob], Hollywood's star system was born -- not the studio machine for building franchises around actors, but the method of rating movies with a certain number of stars.

    "The first appearance may have been on July 31, 1928, in the New York Daily News, which several critics and film historians remember as the pioneer in the field of quantifying movies' merits. The one-star review of 'The Port of Missing Girls' launched the star system, which the newspaper promised would be 'a permanent thing.' Three stars meant 'excellent,' two 'good,' and one star meant 'mediocre.' And no stars at all 'means the picture's right bad,' the News's Irene Thirer wrote.

    "Today, the star system is ubiquitous but far from simple for critics who must fit an Oscar hopeful and a low-ambition horror movie on the same scale. Even those critics who don't assign stars or grades find their carefully wrought opinions converted into numbers -- or a thumbs up or thumbs down -- and mashed together with other critics' opinions. Critics tend to loathe the system and succumb to it at the same time. It all makes for an odd scale that, under the veneer of objective numerical measurement, is really just an apples-to-oranges mess. . . ."

    ReplyDelete