The Wine Gourd: 11 tasters and 20 wines, and very little consensus

In my previous post about wine-quality scores (Why comparing wine-quality scores might make no sense), I outlined why it is often (usually?) a bad idea to mathematically compare the numbers — even when the scales look the same, the scoring systems are probably not the same, so that we cannot tell what any similarities or differences in scores actually mean.

I also provided a list of blog posts in which I have previously illustrated just how diverse are the different scoring schemes among critics. What I have not done, until now, is discuss the best example that I know of, where it is clear that many members of a single group of wine tasters had a different idea of what the wine-quality numbers should mean.

The tasting I am referring to is what has become known as the Judgment of Paris. I have previously provided an introduction, for anyone not familiar this event (A mathematical analysis of the Judgment of Paris).

My interest in this tasting actually has nothing to do with its social interest or importance — here, I am solely interested in the fact that the data highlight almost all of the mathematical problems with evaluating wine quality using numbers. Nominally, the 11 wine-tasters were using the same 20-point wine-quality scale, but in reality there was a different scoring scheme for each person — that is, there are clear differences in their interpretation of the 20-point scale.

For us, this means that there was a lack of repeatability among the wine-quality assessments, and therefore severe problems with taking average quality scores.

Background

The tasting was organized by Steven Spurrier and Patricia Gallagher in 1976. Its fame derives from the fact that it (deliberately) coincided with the US Bicentennary. Its infamy derives from the fact that several people with a modicum of mathematical knowledge have been criticizing it ever since then (these are listed in the blog post referred to above).

The situation is this: 9 French people from diverse wine backgrounds were invited to a tasting of French and Californian wines, 10 cabernet-based and 10 chardonnay-based. All 11 people (including the organizers) were asked to score the wines on a 20-point scale, without further instruction — they were all blind to the origins of the wines (including Spurrier and Gallagher).

The outcome was a set of scores based on whatever scoring system the taster chose (ie. their own personal criteria), but their results were all presented on the same scale (ie. a number from 0-20). There is thus no reason whatsoever to assume that the numbers are comparable, unless we can demonstrate that the tasters were all using the same scoring system.

There is one technical difficulty concerning the data for the white wines, which I have previously presented (Why we no longer have the data from the Judgment of Paris). For our purposes here, I have left the scores as originally reported, even though the numbers do not add up to the reported totals.

The data

I am not interested here in evaluating either the wines or the people, so that both will remain anonymous for most of this blog post. I have presented the original data in the first pair of graphs, with the scores shown vertically for each taster (horizontally). Each wine is represented by either a + or an x, so that if two wines were given the same score by the same taster then the result looks like *. Note, also, that the tasters are not in the same order in the two graphs, but are simply listed in the order of decreasing average score for that wine type.

Scores for the Judgment of Paris cabernet wines

Scores for the Judgment of Paris chardonnay

Two things are immediately obvious: (i) some assessors scored higher than others, and (ii) some assessors used a much greater range of scores. The mathematical consequences of these characteristics will be explained below.

For the moment, though, let's look at some of the features of the data. For example, the top score for the red wines (17) was awarded by five different people, although it was given to four different wines. So, there is a ceiling for the upper score but not for the lowest score, which varied greatly between tasters (from 2 to 8).

Labeling the tasters A-K (in alphabetical order), then we can say the following: (a) at one extreme, G and K produced one of the highest or lowest two scores only 3 times, while (b) at the other extreme, E and J produced one of the highest or lowest two scores 13 times out of 20 wines. That is, some of the tasters used low scores quite consistently.

If we were to drop the top and bottom scores for each of the 20 wines (ie. 40 scores), as is often done for sporting competitions where value judgments are involved, then: (a) at one extreme, B would never have a point dropped, and G would have only one dropped (for being the highest), while (b) at the other extreme, J would lose at least 6 out of 20, and C and D would lose at least 5 out of 20 scores.

These patterns can be summarized as shown in the next graph, where each point represents one of the assessors, with their average score shown horizontally and their score range (maximum minus minimum) shown vertically.

Average scores versus the variability of the scores for the Judgment of Paris

Clearly, some tasters scored consistently high (those at the bottom-right of the graph), while others used a broad range of scores (at the upper-left), which must consequently lower their average score. The two non-French tasters are shown in pink. Obviously, including or excluding these two assessors can be expected to make a difference to the calculation of the average score for any given wine.

Finally, we can try to summarize the whole thing using a multivariate ordination. This technique is described in a previous blog post (Summarizing multi-dimensional wine data as graphs, Part 1: ordinations) — in this case, the method tries to put the tasters in some order based on their scoring patterns. This order is shown in the final graph, where the tasters are named — at the top are those people who scored highest, with the lowest (ie. using a lot of the score range) at the bottom.

Multivariate ordering of the score patterns for the tasters at the Judgment of Paris

There are five people whose scoring schemes were rather similar to each other, clustered in the middle of the order — Claude Dubois-Millot and Aubert de Villaine are at what is called the centroid (the weighted average), making them the "middle-scoring" tasters. The other six people deviated more or less strongly in their scoring schemes, from this middle ground of five people, with some of them diametrically opposed to each other in their behavior.

What does this all mean?

So, what is the importance of this obvious variation in scoring schemes among the tasters? Mathematically, this has an enormous influence on calculating average scores for the wines. In effect, those people who use a large part of the scoring scale will have a disproportionate effect on the average score.

Consider this simple example of two people scoring two wines on a 20-point scale:

Taster A
Taster B
Average

Wine 1
16
18
17

Wine 2
18
14
16

The two tasters disagree as to the relative quality of the two wines (A prefers 2 and B prefers 1). However, Taster B uses a greater range of scores than does Taster A, and consequently the average scores for the wines reflect Taster B's preference, rather than Taster A's — using a low score greatly down-weights the average score for that wine.

So, the people who had the biggest effect on the average scores of the Judgment of Paris wines are those listed at the bottom of the final graph above — and the further down the list they are then the bigger was their effect.

Conclusions

We cannot possibly work out whether the tasters agreed about the wines or not — each person had their own scoring scheme (although five of them were similar), and so we cannot work out how to interpret their scores. And yet, their scores were averaged, and then presented to the world as having some meaning about the quality of the wines. This was mathematical nonsense, because it allows some of the people to have a disproportionate effect on the outcome.

Not unexpectedly, the results of this Paris tasting were not repeatable at the time (see Was the Judgment of Paris repeatable?), least of all by those people who used an objective wine-quality scheme rather than a subjective one (eg. the Vintners Club, where all of the tasters must use the UCDavis scoring scheme — Did California wine-tasters agree with the results of the Judgment of Paris?).

Clearly, if everyone scores identically (ie. uses the same scoring scheme) then there would be no problem (as at the Vintners Club); and if everyone scores differently there would be little purpose to averaging the scores. At a tasting like the Judgment of Paris, the obvious procedure is that everyone must get together beforehand and agree about which scoring scheme will be used, as happens at many modern-day wine shows.

The only alternative is to try some fancy mathematics on the scores, to try to make them comparable, before averaging them. I will write about this idea in a later post.

Monday, May 14, 2018

11 tasters and 20 wines, and very little consensus

No comments:

Post a Comment

Get new posts by email: