Monday, November 30, 2020

Are CellarTracker wine scores meaningful?

I have written before about the quality points awarded to wine by many critics (Quality scores changed the wine industry, and created confusion). This sort of thing is not unusual in the modern world, where just about every human endeavor is rated by someone, somewhere; everything from restaurants and hotels to books, music and holidays. The only things that is not rated are the customers, which may explain a lot of the problems.

All of these scores are assigned by individuals, and thus represent a single opinion, whether by an expert or not. However, particularly in the modern world, there are now commercial groups that aggregate what might loosely be called user ratings, to provide some sort of consensus score. The important word there is "consensus", which is a fairly nebulous concept, but which needs to be made concrete if a combined score is going to be reported.

This is the topic of this post. CellarTracker, as but one example, aggregates scores provided by wine drinkers, and provides a consensus score. So, the blog post title translates as: Does the concept of a consensus rating make much sense for wines?


There are many types of ratings, and also many types of consensuses. Ratings can be quite detailed, such as scores out of 100 or 20, or relatively simple, such as 3 or 5 stars, or even binary, such as approve vs. disapprove. How to then get a quantitative consensus out of these ratings is the step being examined here.

For the third type of rating, the usual consensus is simply to report the percentage of raters who approve (or disapprove). For example, in the world of politics, elections are reported as the percentage of voters who chose each candidate, and a "winner" is then declared. Does this approach actually make sense? After all, in a three-candidate race, a winner can be elected even if more than a half of the people disapprove of them. Indeed, in situations where voting is not compulsory, it is commonly reported that only c. 40% of the people actually vote, which does suggest that 60% of people think that the consensus approach is nonsense. It is tempting to see this majority of the people as being the sensible group.

For more complex situations, the consensus is usually reported as some sort of mathematical "average" score. How do we go about combining ratings into a consensus? Does it even make sense to combine user opinions, or expert judgements, in this way? What on earth does such a mathematical consensus represent?

Would I really proceed to work out what color is the sky by trying to combine people's opinions on the matter? Would that really make the sky the same as the calculated consensus color? Or would the consensus be some sort of wishy-washy middle result that does not represent any sky ever observed?

Obviously, I could work out the light's wavelength on the standard color scale, and make lots of such measurements, and then calculate their average. That might make some sense. But that is not what we are doing when we average people's opinions about wines, or about almost anything else.

I think that a reasonable case can be made that individual opinions about wines do matter, but a consensus of those opinions does not mean much at all. These aggregation sites are likely to be wasting our time, as well as their own.


Leaving aside these philosophical questions, we could, of course, proceed by simply calculating an average rating of whatever product we are interested in. Mathematically, there are actually three quite different ways to calculate an average: the mean, the median, and the mode. You are all familiar with the mean, since it is far and away the most common calculation method when people refer to an average.

However, the median (which is the middle value when the scores are arranged in increasing order) makes much more sense as a consensus. This situation is explicitly acknowledged in economics, where an average salary, for example, is calculated as the median. This takes into account the obvious bias introduced by billionaires and the like. If Bill Gates walks into the room, then the mean salary in the room would make us all millionaires, but the median would not change much, if at all.

Is the median used by sites like Amazon and CellarTracker, who provide us with consensus ratings? Not usually, although CellarTracker actually does (as illustrated above). We usually get the mean, instead, which has many odd mathematical characteristics.

The mode can also make much more sense than the mean, as a consensus score. This is defined as the most commonly reported score among the ratings being collated. The problem here is that you need a lot of data to actually calculate the mode; and so we rarely see it, in practice.


For ratings using a small number of stars or the like, a better approach would be to simply report all of the individual ratings, and that is what many aggregators actually do. This focuses attention on the individual ratings, not on their consensus.

However, the individual ratings only make much sense if they are accompanied by written comments. Sadly, even then we may not get much information. After all, we all know what a 5-star review is going to say — the person loved the holiday, or whatever else they are rating. We also know what a 1-star review is going to say, although, rather bizarrely, this will combine the true 1-star ratings with the 0-star ratings — why do we live in a world that pools these two quite different things together?

So, the real information resides in the comments from the 3-star and 4-star reviews, where something was not quite right, and we need to work out whether this particular thing would affect us or not. Someone else's dislike might not be our own, after all.

I well remember once trying to find a place for my wife and I to stay in Sicily. One place I looked at had an average rating (3 stars), but when I looked at the comments I realized that this was a perfect example of why a consensus rating means nothing. Half the reviews said that the place was great and the other half said to avoid it completely. So, the average of "good" and "poor" is (mathematically) "okay". Not likely, I say.

After reading the user comments, I realized that the issue was that the good reviews all occurred when the manager was on site, and the poor reviews occurred when the manager was absent — when the cat is away ...

Anyway, we did not stay there, because I could hardly ring up and ask if the manager was going to be there, just for my holiday. So, we chose to stay somewhere else (shown above); and it turned out to be one of the most memorable experiences of our trip. We were the only people there that night; and, in Sicily, mamma's home cooking leaves every Michelin starred restaurant for dead. I can still remember it all.

So, what is the problem here? Well, the words in most short wine reviews are piffle. Indeed, winespeak is frequently laughable. User comments are more useful than points, except in the wine industry, which is rather a sad thing to say. The most useful information the reviews could provide is what food to drink the wine with, and when — and we do not get even that information, very often. Whether the wine is value for money would also be useful, but that information is even rarer (Calculating value for money wines).

So, what practical use, then, is CellarTracker's score?

1 comment:

  1. David writes:

    "So, what is the problem here? Well, the words in most short wine reviews are piffle. Indeed, winespeak is frequently laughable. . . ."

    I am reminded of this opinion piece:

    Excerpt from Slate
    (posted June 15, 2007):

    “Cherries, Berries, Asphalt, and Jam.
    Why wine writers talk that way.”

    URL: http://www.slate.com/articles/life/drink/2007/06/cherries_berries_asphalt_and_jam.html

    By Mike Steinberger
    “Drink: Wine, beer, and other potent potables” Column

    One of the more famous assaults on the new language of wine came from novelist and children's writer Roald Dahl, a renowned oenophile himself. In 1988, he wrote a letter to Britain's Decanter magazine in which he lambasted as "tommyrot" the "extravagant, meaningless similes" that were suddenly being used to describe wines. "Wine … tastes primarily of wine -- grape-juice, tannin, and so on," Dahl wrote. "If I am wrong about this, and the great wine-writers are right, then there is only one conclusion. The chateaux in Bordeaux have begun to lace their grape-juice with all manner of other exotic fruit juices, as well as slinging in a bale or two of straw and a few packets of ginger biscuits for extra flavouring. Someone had better look into this." He went on, "I wonder, by the way, if these distinguished persons know that their language has become a source of ridicule in many sensible wine-drinking households. We sit around reading them aloud and shrieking with laughter."





    ReplyDelete