The data I will look at is the compilation provided by Suneal Chaudhary and Jeff Siegel in their report entitled Expert Scores and Red Wine Bias: a Visual Exploration of a Large Dataset. I have discussed these data in a previous post (What's all this fuss about red versus white wine quality scores?). The data are described this way:
We obtained 14,885 white wine scores and 46,924 red wine scores dating from the 1970s that appeared in the major wine magazines. They were given to us on the condition of anonymity. The scores do not include every wine that the magazines reviewed, so the data may not be complete, and the data was not originally collected with any goal of being a representative sample.This is as big a compilation of wine scores as is readily available, and presumably represents a wide range of professional wine commentators. It is likely to represent widespread patterns of wine-quality scores among the critics, even today.
In my previous analyses, and those of Alex Hunt, who has also commented on this (What's in a number? Part the second), the most obvious and widespread bias when assigning quality scores on a 100-point scale is the over-representation of the score 90 and under-representation of 89. That is, the critics are more likely to award 90 than 89, when given a choice between the two scores. A similar thing often happens for the score 100 versus 99. In an unbiased world, some of the "90" scores should actually have been 89, and some of the "100" scores should actually have been 99. However, assigning wine-quality scores is not an unbiased procedure — wine assessors often have subconscious biases about what scores to assign.
It would be interesting to estimate just how many scores are involved, as this would quantify the magnitude of these two biases. Since we have at hand a dataset that represents a wide range of commentators, analyzing this particular set would tell us about general biases, not just those specific to each individual commentator.
Estimating the biases
As in my earlier posts, the analysis involves frequency distributions. The first two graphs show the quality-score data for the red wines and the white wines, arranged as two frequency distributions. The height of each vertical bar in the graphs represents the proportion of wines receiving the score indicated.
The biases involving 90 versus 89 are clear in both graphs; and the bias involving 100 is clear in the graph for the red wines (we all know that white wines usually do not get scores as high as for red wines — see What's all this fuss about red versus white wine quality scores?).
For these data, the "expectation" is that, in an unbiased world, the quality scores would show a relatively smooth frequency distribution, rather than having dips and spikes in the frequency at certain score values (such as 90 or 100). Mathematically, the expected scores would come from an "expected frequency distribution", also known as a probability distribution (see Wikipedia).
In my earlier post (Biases in wine quality scores), I used a Weibull distribution (see Wikipedia) as being a suitable probability distribution for wine-score data. In that post I also described how to use this as an expectation to estimate the degree of bias in our red- and white-wine frequency distributions.
The resulting frequency distributions are shown in the next two graphs. In these graphs, the blue bars represent the (possibly biased) scores from the critics, and the maroon bars are the unbiased expectations (from the model). Note that the mathematical expectations both form nice smooth distributions, with no dips or spikes. Those quality scores where the heights of the paired bars differ greatly are the ones where bias is indicated.
For a score of "100" we can only refer to the red-wine data. These data indicate that this score occurs more than 8 times as often as expected from the model. This is what people are referring to when they talk about "score inflation" — the increasing presence of 100-point scores. It might therefore be an interesting future analysis to see whether we can estimate any change in 100-point bias through recent time, and thereby quantify this phenomenon.
Finally, having produced unbiased expectations for the red and white wines, we can now compare their average scores. These are c.91.7 and c.90.3 for the reds and whites, respectively. That is, on average, red wines get 1⅓ more points than do the whites. This is much less of a difference than has been claimed by some wine commentators.
Personal wine-score biases are easy to demonstrate for individual commentators, whether professional or semi-professional. We now know that there are also general biases shared among commentators, whether they are professional or amateur. The most obvious of these is a preference for over-using a score of 90 points, instead of 89 points. I have shown here that one in every three 90-point wines from the professional critics is actually an 89-point wine with an inflated score. Moreover, the majority of the 100-point wines of the world are actually 99-point wines that are receiving a bit of emotional support from the critics.