For example, at the 1984 Winter Olympics the figure-skating pair of Jayne Torvill and Christopher Dean received maximum artistic-impression scores of 6.0 from each of the 12 judges, which had never happened before (for a single performance). Does this mean that no-one can ever do better? Not unexpectedly, the International Skating Union's International Judging System eventually replaced the previous 6.0 system (in 2004), so that scores no longer get near the maximum possible.
In a similar vein, it has been pointed out innumerable times that the top end of the 100-point wine-quality scale has become unnaturally crowded. This graph of the frequency distribution of some of Robert Parker's wines scores illustrates the issue (taken from my post Biases in wine quality scores). Here, the height of each vertical bar in the graph represents the proportion of wines receiving each score, as shown horizontally.
There is a distinct bump in the graph at a score of 100, indicating that more wines are being awarded this score than would be expected. This is precisely what happens when we reach the ceiling of any quality scale — there are lots of very good wines, and we cannot distinguish among them because we have to give them all the same score: 100.
We probably need to address this issue. Given the large subjective component in such ratings, there are only two general ways to go about this. We either:
- re-scale the 100-point scale, thus reducing the quality implication of the scores, so that "100-point wines" no longer get 100 points but instead get a wider range of lower points; or
- go past the 100 limit, and start doling out scores that exceed 100 points.
In September 1998, Jancis Robinson posted on her web site a set of quality scores from a vertical tasting of the wines of Château d'Yquem (Notes from attending an Yquem vertical tasting).** The data are shown in the next graph, with the quality scores vertically and the wine vintages horizontally. The first two vintages were the "Thomas Jefferson wines" supplied by Hardy Rodenstock, and so their provenance is considered doubtful.
The quality of the remaining wines is nominally scored on Robinson's usual 20-point scale. Note that three of the wines received a score of 20, while four of them were awarded scores that notably exceed 20 points (marked by the red line). Robinson made no comment about her unexpected scores, but she did use a series of superlatives in her tasting notes, the like of which we do not usually see from her pen (eg. "absolutely extraordinary").
Obviously, Robinson has her own personal quality scale, and what we are presumably being told here is that these wines exceed her usual expectations for a "20-point wine". It therefore seems to me that this is a prime example of option (2) presented above.
As such, the question does now rise as to whether this approach was actually necessary in this particular case. We might find a possible answer by looking at what other people have done when confronted with these same wines.
As one example, Per-Henrik Mansson published a set of quality scores for many of the same wines in the May 1999 issue of the Wine Spectator magazine (Three centuries of Château d'Yquem). He used a 100-point scale for his scores, so I have converted them to a 20-point scale for the comparison shown in the next graph (Mansson's relevant scores are in maroon).
The correlation between the two sets of scores is 48%, which is slightly higher than we have come to expect from wine professionals (10-40%). However, Mansson never exceeded the nominal limit of his scale — of the 121 scores in his article, there are four 100-point scores, but none scored higher. Indeed, a comparison of the scores on the 20-point scale shows that Robinson's scores are generally 25% higher than Mansson's, across the board.
I think that we might therefore argue that Mansson has provided an example of option (1) presented above (ie. re-structuring the scale so that we don't bump our head against the score ceiling). Actually, Mansson provided nine scores that are <70 and 30 scores that are <80, so that he used a large part of the score range from 50-100 points (his lowest score is 55). This wide range of scores would be considered very unusual during the 20 years since he published his scores!
As a final note, there are only two vintages for which Robinson and Mansson strongly disagree — Robinson scored the 1931 vintage much higher than did Mansson, and he returned the favor with the 1971 vintage.
* This was not actually true at the undergraduate university I attended. The final (research) year of my science degree was assessed on a scale of 1-20. In this case, 20 points represented perfection, which could not be obtained in practice by anyone, let alone a student. Nor could a student get 18 or 19 points, although these might be obtained by a professional scientist. The best that might be expected for a student was 16 points, in which case the student was awarded the University Medal, which happened only occasionally. The top mark that might regularly be expected (ie. every year) was 14 points. At the other end, 0 points was a fail at the Honours year, which meant that the student would get a Pass award, instead.
** Thanks to Bob Henry for providing a copy of the blog post.