Wine competitions, and many web sites, involve summing assessors’ scores into a consensus “average” score for each wine. However, as an example for three assessors, the scores of:
5,5,5 have the same sum / average as 0,5,10
However, the former situation indicates complete agreement among the three assessors about the quality of the wine, while the latter situation is no different from random quality scores. Surely this difference matters?
This contradictory situation has long been ignored. Obviously, this issue does not matter when looking at a single critic’s review in a magazine, for example. However, it may matter enormously at sites like CellarTracker, which claim to represent the consensus of wine quality among many people. However, such sites seem not to have cared about this issue at all.
Recently, Jeffrey Bodington has looked at this situation in detail:
This contradictory situation has long been ignored. Obviously, this issue does not matter when looking at a single critic’s review in a magazine, for example. However, it may matter enormously at sites like CellarTracker, which claim to represent the consensus of wine quality among many people. However, such sites seem not to have cared about this issue at all.
Recently, Jeffrey Bodington has looked at this situation in detail:
Here, I will summarize some of his ideas.
First, however, let’s make the situation clear. The following lines indicate increasing agreement around the same average of 5, with 3 assessor scores having a maximum of 10 each:
0,5,10
1,5,9
2,5,8
3,5,7
4,5,6
5,5,5
1,5,9
2,5,8
3,5,7
4,5,6
5,5,5
Furthermore, there can also be clusters of scores (e.g. some scores indicating poor quality and some indicating good quality, with nothing in between), such as:
1,7,7
3,6,6
3,6,6
The basic mathematical issues here are that (potentially) many billions of combinations of scores have the same sum (and therefore average), and that uncertain ratings can have many different sums. Mathematically, an observed wine rating is one draw from a (latent) distribution of all possible ratings that is both wine-specific and judge-specific.
The further practical issues for wine assessments are that: (i) sample sizes (number of assessments) are often small (especially in competitions); (ii) some wine judges are more reliable or consistent assessors than are others; and (iii) clusters of scores can happen, for example in the case of stylistically distinctive wines. These three situations mean that the issue discussed by Bodington can potentially have a big effect.
Bodington proceeds mathematically:
A weighted sum of judges’ wine ratings is proposed and tested that (1) recognizes the uncertainty about a sum and (2) minimizes the disagreement among judges about that sum. A simple index of dispersion is [also] proposed and tested that measures a continuum from perfect consensus (dispersion = 1), to ratings that are indistinguishable from random assignments (dispersion = 0), and then to distant clusters of ratings when groups of judges disagree (dispersion is negative).To make sense of this for you, he then illustrates his ideas with a straightforward example. This involves the 10 white wines from the 1976 Judgment of Paris comparative tasting of French and American wines, with 9 assessors per wine. The sums of the blind scores are shown in the graph above. The blue bars indicate the distribution of all possible sums of 9 scores of 20 each (ie. a minimum of 0 and a maximum of 180). The black lines represent the sums for each of the 10 Judgment wines (as labeled). As shown:
... the sums of points for the top two white wines, Chateau Montelena and Meursault Charmes are calculated to be the same at 130.5. The respective ranges of points assigned to those two wines were 3.0-to-18.5 and 12.0-to-16.0 ...So, the overall mathematical assessment is the same for the two wines, but there is clearly much more consensus among the judges for Meursault Charmes (scores 12-to-16 out of 20) than for Chateau Montelena (scores 3-to-18.5). Bodington thinks that this difference should be dealt with, and that is the purpose of his weighted sum and his index of dispersion.
These calculations are shown in the next figure, with the weighted sum shown horizontally (ie. increasing assessed quality of the wine) and the index of dispersion vertically (ie. increasing agreement among the assessors), and each wine represented by a labeled point.
Bodington notes:
Results for the dispersion index show that none are close to zero so they do not appear to be random results, and none are negative to indicate distant clusters. The weighted sum of points for highest-scoring Chateau Montelena has the second lowest dispersion index of any wine and the [other] wine Meursault Charmes has the highest index of any wine. Considering that finding, does it make sense to conclude that Montelena was better than Charmes?In other words, the consistent critic judgements for Meursault Charmes should outweigh the relatively inconsistent ones for Chateau Montelena. This can be interpreted as indicating the “best” white wine at the Judgment of Paris.
This sort of situation can have a strong effect any time there is a relatively small number of wine assessments.