Monday, December 11, 2017

Do community wine-quality scores converge to the middle ground?

The short answer appears to be: not very often. This is surprising, given what is reported for other communities. This may indicate something unique about the wine community.

A few weeks ago, I discussed community wine-quality scores, such as those in the Cellar Tracker database (Cellar Tracker wine scores are not impartial). One of the subjects I commented on was the suggestion that the "wisdom of crowds" can mean that members of the crowd allow their judgement to be skewed by their peers. In the case of wine-quality scores, this would mean that scores from large groups of tasters may converge towards the middle ground, as the number of scores increases.

In the formal literature, this topic has been examined by, for example, Omer Gokcekus, Miles Hewstone & Huseyin Cakal (2014. In vino veritas? Social influence on ‘private’ wine evaluations at a wine social networking site. American Association of Wine Economists Working Paper No. 153). They looked at the trend in Cellar Tracker scores for wines through time, from when the first score is added for each wine. They wanted to see whether the variation in scores for a wine decreases as more scores are added for that wine, which would support the thesis about crowd behavior. They concluded that there is some evidence of this.

The important practical point here is that Cellar Tracker displays the average score for each wine when a user tries to add a new score of their own, and it is hard to ignore this information. So, it would be rather easy for a user to be aware of the difference between their own proposed score and the current "wisdom of the crowds". This would presumably have little or no effect when only a few scores have been added for each wine, but it might potentially have an effect as more score are added, because the crowd opinion then becomes so much clearer.

It has occurred to me that some data that I used in another blog post (Are there biases in community wine-quality scores?) might also be used to examine the possibility that Cellar Tracker scores are biased in this way. In my case, I will look at individual wines, rather than pooling the data across all wines, as was done in the research study described above.

The data at hand are the publicly available scores from Cellar Tracker for eight wines (for my data, only 55-75% of the scores were available as community scores, with the rest not being shared by the users). These eight wines included red wines from several different regions, a sweet white, a still white, a sparkling wine, and a fortified wine. In each case I searched the database for a wine with at least 300 community scores; but I did not succeed for the still white wine (which had only 189 scores).

The results for the eight wines are shown in the graphs at the end of the post. Each point represents one quality score for the wine (some users enter multiple scores through time). For each wine, each score is shown (vertically) as the difference from the mean score for the wine — positive scores indicate that score was greater than the average score, while negative scores indicate that it was less than the average. The time is shown (horizontally) as the number of days after the first tasting recorded for that wine.

The expectation is that, if the wine-quality scores do converge towards the middle ground, then the variability of the scores should decrease through time. That is, the points in the graphs will be more spread out vertically during the earliest times, compared to the later times.

The results seem to be quite consistent, with one exception. That exception is the first one, where the scores are, indeed, more variable through the first third of the time period. In all of the other cases, the scores are most variable during the middle period, which is when most of the scores get added to the database, or sometimes also in the late period.

So, for these wines at least, I find little evidence that Cellar Tracker scores do converge towards the middle ground. This seems to disagree with the study of Gokcekus, Hewstone & Cakal (mentioned above), who concluded that community scores are normative (= "to conform with the positive expectations of another") rather than informational ("to accept information obtained from another as evidence about reality").

However, a study by Julian McAuley & Jure Leskovec (2013. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. Proceedings of the 22nd International Conference on the World Wide Web, pp. 897-908), found that user behavior in the Cellar Tracker database was quite different from the other four community databases that they studied (Beer Advocate; Rate Beer; Amazon Fine Foods; Amazon Movies).

So, maybe wine drinkers really are different from beer drinkers and movie goers, when it comes to community assessment of their products? The wisdom of the wine crowd may be unique! In particular, you will note that wine drinkers are not afraid to give rather low scores for wines — the scores in the graphs go much further below the average than they do above it. Note that the dataset excludes wines that are considered to be flawed, which are usually not given scores at all (although very rarely they receive scores in the 50-60 range, which I excluded, as representing faulty wines).

It seems to me that community wine scores are actually informational, rather than normative, expressing the opinion of the drinker rather than that of the crowd. This also fits in with the easily observed fact that the community scores are consistently lower than are those of the professional wine critics (see my previous post Cellar Tracker wine scores are not impartial) — the wine community is not easily swayed by expert opinion. However, the tendency of all wine reviewers, professional, semi-professional and amateur, to favor a score of 90 over a score of 89 certainly represents an unfortunate bias.

Cellar Tracker wine-quality scores through time for Alvear Pedro Ximenez

Cellar Tracker wine-quality scores through time for Barbaresco 2006

Cellar Tracker wine-quality scores through time for Caymus Cabernet 2012

Cellar Tracker wine-quality scores through time for Clicquot NV Brut

Cellar Tracker wine-quality scores through time for Edwards Sauvignon Blanc 2012

Cellar Tracker wine-quality scores through time for Pontet-Canet 2003

Cellar Tracker wine-quality scores through time for Rieussec 2001

Cellar Tracker wine-quality scores through time for Tondonia 2001


  1. Here in the United States, what influences a consumer to purchase a bottle of wine?

    Well . . . not necessarily a published wine critic.

    Referencing a Wine Opinions online survey cited by W. Blake Gray in his wine blog titled "Social media doesn't sell much wine":

    "Purchase Influences"


    Dedicated wine enthusiasts seem to be an independent-minded lot!

  2. Regarding the Veuve Clicquot Brut Champagne, quality scores were recorded over 5,000 days (if I read the X-axis correctly in the two-axis plotted graph).

    Given that the wine is a non-vintage blend, it begs the question: are all 894 respondents drinking the same wine?

    With quality scores posted over 5,000 days (13.7 years), multiple sequential bottles of that wine would have been released to the market. Each sequential bottling comprising different discrete blending vintages.

    That would be an "apples-to-oranges" comparison.

    (Wines blended from "great" vintages would be expected to garner more favorable quality scores than those from "lesser" vintages.)

    What would the plotted data look like if you narrow the analysis to a single declared vintage of Veuve Clicquot La Grande Dame?

    1. Bob,

      I don't think that the comparison is too much like apples and oranges, more like different cultivars of apples. The champagne companies go to a fair bit of trouble to make their blends consistent.

      Sadly, there are not enough scores to be worth looking at any of the Grand Cru marques.

  3. One of the great "non-vintage" (the French would correctly describe it as "multi-vintage") tête de cuvée Champagnes is Laurent-Perrier "Grand Siècle" Brut Champagne -- a blend of three declared vintages.

    Back in the early 2000s when the 1982/1985/1988 vintage blend was released, it was an overachiever.

    In my humble opinion, no subsequent three vintage blend has topped it.

    Contemporary collectors are frustrated in trying to buy that specific bottle at auction, because there is no notation on any of the bottles on the vintage blend.

    That information came from the press release accompanying its original debut.

    (Based on the Laurent-Perrier example, I suspect the multi-vintage Veuve Clicquot Brut Champagne has benefited on occasion from "greater" vintages going into the blend.)

    With the recent fraud prevention practice of laser etching disgorgement dates on Champagne bottles, future collectors will know what blended vintage bottling they have.

    1. There are quite a few Cellar Tracker scores for the NV version, spread over the past 10 years. These presumably represent different releases, but not back as far as you are discussing. These data might be worth looking at.