The current thinking seems to be that the "wisdom of crowds" (CellarTracker, TripAdvisor, Amazon Reviews) is the most reliable way to judge something; but that thinking is deeply flawed. It's not just the anonymity of those making the judgements, and the fact that they may or may not have experience, independence, or have an agenda, but that crowd behaviour itself is far from impartial. We've all experienced people "tasting the label"; but there is also no doubt that members of the crowd allow their judgement to be skewed by others. That's why in big groups scores always converge toward the safe middle ground.So, we can treat this as Proposition #1 against the potential impartiality of the CellarTracker wine-quality scores.
For Proposition #2, I have previously asked Are there biases in community wine-quality scores? In answering that question I showed that CellarTracker users have (for the eight wines I examined) the usual tendency to over-use quality scores of 90 at the expense of 89 scores.
For Proposition #3, Reddit user mdaquan has suggested:
Seems like the CellarTracker score is consistently lower than multiple professional reviewers on a consistent basis. I would think that the populus would trend higher, [and] not be as rigorous as "pro" reviewers. But consistently the CT scores are markedly lower than the pros.So, we have three different suggestions for ways in which the wine-quality scores of the CellarTracker community might be biased. This means that it is about time that someone took a detailed look at the CellarTracker quality scores, to see how much bias is involved, if any.
The quality scores assigned by some (but not all) of the CellarTracker community are officially defined on the CellarTracker web site: 98-100 A+ Extraordinary; 94-97 A Outstanding; 90-93 A– Excellent; 86-89 B+ Very Good; 80-85 B Good; 70-79 C Below average: 0-69 D Avoid. However, the "wisdom of crowds" never follows any particular formal scheme, and therefore we can expect the users to each be doing their own thing.
But what does that "thing" look like when you pool all of the data together, to look at the community as a whole? This is a daunting question to answer, because (at the time of writing) CellarTracker boasts of having "7.1 million tasting notes (community and professional)". Not all of these notes have quality scores attached to them, but that is still a serious piece of Big Data (see The dangers of over-interpreting Big Data). So, I will look at a subset of the data, only.
This subset is from the study by Julian McAuley & Jure Leskovec (2013. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. Proceedings of the 22nd International Conference on the World Wide Web, pp. 897-908). This dataset contains the 2,025,995 review notes entered and made public by the CellarTracker users up until October 2012. I stripped out those notes without associated quality scores; and I then kept those notes where the wine was stated to have been tasted between 2 October 2002 and 14 October 2012. This left me with 1,569,655 public quality scores (and their associated tasting date), which covers the first 10 years of CellarTracker but not the most recent 5 years.
Time patterns in the number of scores
The obvious first view of this dataset it to look at the time-course of the review scores. The first graph shows how many public user quality scores are represented for each month of the study period.
CellarTracker was designed in 2003 and released in 2004; therefore, all wines before that time have been retrospectively added. So, the graph's time-course represents recorded tasting time, not time of entry into the database, although the two are obviously related. The database reached its maximum number of monthly scored wines at the beginning of 2011, after which it remained steady. The dip at end of the graph is due to the absence of wines that were tasted before the cutoff date but had not yet been added to the database at that time.
The annual cycle of wine tasting is obvious from 2005 onwards — the peak of tasted wines is at the end of each year, with a distinct dip in the middle of the year. This presumably represents wine consumption during the different northern hemisphere seasons — wine drinking is an early winter thing.
The quality scores
The next graph show the frequency (vertically) of the wine-quality scores (horizontally). This should be a nice smooth distribution if the quality scores are impartial; any deviations might be due to any one of the three propositions described above. Although it is somewhat smooth, this distribution shows distinct peaks and dips.
For the lower scores, there are distinct peaks at scores of 50, 55, 60, 65, 70, 75, and 80. This is not unexpected, as wine tasters are unlikely to be interested in fine-scale differences in wine quality at this level, or even be able to detect them.
For the scores above 80, 57% of the scores are in the range 88-92. If we are expecting some sort of mathematically average score for wines, then these data make it clear that it is a score of 89-90. That is, the "average" quality of wine consumed by the CellarTracker community is represented by a score of c. 90, with wines assessed as being either better or worse than this.
However, a quality score of 90 shows a very large peak compared to a score of 89, exactly as discussed under Proposition #2 above. I have previously reported this fact for both professional (Biases in wine quality scores) and semi-professional (Are there biases in wine quality scores from semi-professionals?) wine commentators, as well as the CellarTracker community. So, there is apparently nothing unusual about this, although it could be seen as questioning the general utility of wine-quality scores. If subjective factors make people use 90 in preference to 89, then what exactly is the use of a score in the first place?
Moving on, we now need to look for other possible biases in the scores. In order to evaluate whether any of the scores are biased, we need an unbiased comparison. As I explained in my first post about Biases in wine quality scores, this comes from an "expected frequency distribution", also known as a probability distribution. As before, it seems to me that a Weibull distribution is suitable for wine-score data.
This Weibull expected distribution is compared directly with the observed frequency distribution in the next graph. In this graph, the blue bars represent the (possibly biased) scores from CellarTracker, and the maroon bars are the unbiased expectations (from the probability model). Those scores where the heights of the paired bars differ greatly are the ones where bias is being indicated.
This analysis shows that quality scores of 88, 89, and 90 are all over-represented, while scores of 93, 94, and 95 are under-represented, compared to the expectation. This indicates that the CellarTracker users are not giving as many high quality scores as expected, but are tending to give too many scores of 88-90, so that scores are skewed towards values below just 90 rather than just above.
This is exactly what was discussed under Proposition #3 above, where the professionals seem to give somewhat higher scores when the same wines are compared. Furthermore, it is in line with Proposition #1, as well, where the community scores simply converge on a middle ground — a CellarTracker score is more likely to be in the small range 88-90, rather than most other numbers.
Furthermore, quality scores of 81, 83, and 86 are also under-represented, according to the analysis. This creates a clustering of the lower scores at certain values. Presumably, the tasters are not bothering to make fine distinctions among wines below their favorite scores of 88-90.
Time patterns in the quality scores
We can now turn to to look at the time-course of the wine-quality scores. This next graph shows the average quality score for the wines tasted during each month of the study.
The average score was erratic until mid 2005, which is when the number of wines (with scores) reached 3,000 per month. So, that seems to be the number of wine scores required to reliably assess the community average.
From 2007 to 2009 inclusive, the average quality score was c. 88.5, although there was a clear annual cycle of variation. Notably, after 2009 the average quality score rose to >89. Does this represent the proverbial score inflation? Or perhaps it is simply the community maturing, and starting to give scores more in line with those of the professionals (which are higher)?
To try to assess this, the final graph shows the time-course of the proportion of scores of 95 or above. Many of the professional reviewers have been accused (quite rightly) of over-using these very high scores, compared to the situation 20 years ago, and so we can treat this as an indication of score inflation.
This graph shows no post-2009 increase in the proportion of very high scores. So, the increase in the average CellarTracker quality score does not represent an increased usage of very high scores, but is instead a general tendency to assign higher scores than before. Or perhaps it represents the community drinking better wines?
Finally, it is worth pointing out the annual cycle in the average scores and in the proportion of very high scores. The annual peak in quality scores is in December. That is, wines get higher scores in December than at most other times of the year. I hope that this represents people buying better wines between All Hallows Day and New Year, rather than drinking too much wine and losing their sense of values!
Conclusions
All three predicted biases in the CellarTracker wine-quality scores are there! The community scores are generally lower than expected, they cluster in a smaller range around the average than expected, and a score of 90 is over-used compared to 89. There are also very distinct seasonal patterns, not only in the number of wines tasted but also in the scores assigned to them.
These conclusions are not necessarily unexpected. For example, Julian McAuley & Jure Leskovec (cited above) noted: "experienced users rate top products more generously than beginners, and bottom products more harshly." Furthermore, Omer Gokcekus, Miles Hewstone & Huseyin Cakal (2014. In vino veritas? Social influence on ‘private’ wine evaluations at a wine social networking site. American Association of Wine Economists Working Paper No. 153) have noted that community scores are normative (= "to conform with the positive expectations of another") rather than informational ("to accept information obtained from another as evidence about reality").
In the modern world, it may well be true that "the crowd is the new critic", but it turns out that the crowd as a group is actually no more impartial than is any single person.
"Can You Trust Crowd Wisdom?;
ReplyDeleteResearchers say online recommendation systems can be distorted by a minority of users."
https://www.technologyreview.com/s/415337/can-you-trust-crowd-wisdom/
-- and --
"Better wisdom from crowds
MIT scholars produce new method of harvesting correct answers from groups."
http://news.mit.edu/2017/algorithm-better-wisdom-crowds-0125