The Wine Gourd: The Judgment of Paris demonstrated nothing, statistically speaking

Monday, August 3, 2020

The Judgment of Paris demonstrated nothing, statistically speaking

Way back in 1976 there was a comparative tasting of several fancy (and expensive) American and French wines, in which the US wines acquitted themselves quite well. This became known as The Judgment of Paris, a classical allusion that probably escapes most people.

The media made much of this outcome at the time, claiming that the American wines “won” some sort of contest, and should now be considered to be the best in the world, or at least the equal of the French stuff. Indeed, whole books have been written about this; and the volume of words in the media is horrendous to think of.

However, it seems to me that there is one thing missing here, and always has been. Since the quality scores assigned by each taster to each wine are available to us, we can judge for ourselves whether the scores actually show anything more than random variation. After all, the sample size was very small: 20 wines tasted by 11 people. I have already argued the case that the results are very variable in all the wrong ways, if one wishes to compare the wines of two countries (11 tasters and 20 wines, and very little consensus). Well, I have finally decided to bite the bullet, and apply some formal statistical analyses to the data — this blog is the only place likely to try this exercise!

Now, before you all jump up and down reminding everyone of the old saying that there are “Lies, damned lies, and statistics”, bear with me for a moment. That statement dates from the 1890s, long before the development of modern mathematical methods to study data involving probabilities. There are now many powerful methods for analyzing data in an objective and repeatable way, ones that generate no controversy about their outcome. Statistics is no longer whatever you want to make of it.

In the case at hand, there are several possible causes of variation in the quality scores assigned to the wines:

the wines come from two countries — this is what we want to examine
for each country, there were 10 reds and 10 whites selected out of all of the high-quality wines that could have been chosen for inclusion — the results might have been different if other wines had been chosen
there were only 11 people selected as tasters, out of all of the wine critics that could have been chosen — the results might have been different if other tasters had been chosen
these people were of different genders, and there were not equal numbers of males and females.

There are other factors that might be important, of course, but this list is all the information that we have. The point here is that we wish to study factor 1, and to do so we need to take into account the variation caused by factors 2, 3 and 4. This is what modern statistical analysis is all about — studying one source of variation if the face of variation cased by other factors. How do we do this in some objective and repeatable way?

The answer, as formalized by Ronald Fisher in the 1920s, is called Analysis of Variance. It does precisely what the name says — it tries to measure the amount of variation caused by each of the factors. If the variation for any particular factor has a strong pattern, compared to the others, then we can consider it to be an important one, and if the variation has no particular pattern then it is not important. This comparison is made with respect to the variation that is not accounted for by the specified factors.

The concept is fairly simple, but (of course) the mathematics is not. It can often be done by hand, if you are inclined to try that sort of thing, but I always use a computer program. I have used this type of analysis many times in my own research career; and I have even taught this analysis to undergraduate and postgraduate biology students. So, what happens if I apply it to the data from the Judgment of Paris?

Well, you already know the answer to that, from the title of this blog post. The formal details of the analysis are included at the bottom of the post, for those of you who might be interested in them. Here is a summary of the results:

Factor
Country
Wine
Gender
Person

F-value
1.37
5.64
0.01
3.63

Probability
0.256
0.000
0.908
0.000

Each row of the table refers to a formal statistical test of one of the four factors. The F-value (named after Fisher himself) is the result of the calculations — for purely random data this value will = 1. The probability is a measure of how likely it is that the F-value is not > 1. By convention, we might choose p < 0.05 as our criterion for concluding that a factor shows important variation.

As you can see, there is almost no variation between the two genders, in our sample, which may surprise none of you. There is some variation between the two countries, but it is not very large, and is nowhere near “statistical significance”. All of the fuss about the results is nonsense — the differences are nothing more than we would expect from random chance, 26% of the time.

This does not mean that there were no differences between the wines, because there surely were. However, the analysis indicates that these differences had nothing to with which country they came from. Simply put, some wines scored consistently much higher than others, meaning that the tasters agreed that they were better wines, irrespective of where they came from.

Unsurprisingly, the analysis also shows that there were big differences between the critics, with some of them giving much higher scores than others, irrespective of the wine. I produced a graph of this in my previous post on this topic, which I have reproduced here. It shows who scored highest (at the top) and who scored lowest (at the bottom).

Conclusion

The statistical analysis makes it clear that the variation in scores was no different from what we would expect from any single collection of wines and people at one place and one time. Some wines scored better than others, and some people gave higher scores than others, and the country of origin had nothing to do with it.

That is, we would not expect the results to be repeatable. Nor were they, as have I already pointed out in two previous blog posts (Was the Judgment of Paris repeatable? Did California wine-tasters agree with the results of the Judgment of Paris?). Different people tasting the same wines produced different results.

There were lots of differences in scores between the wines and between the people, but not much difference between the countries. So, what was all of the fuss about? Cultural politics, is my guess.

Analyses

I tried several different General Linear Models, using the Minitab package. The Country and Gender factors were both fixed; and the Wine and Person factors were nested within them.

Factor
Country
Wine(Country)
Gender
Person(Gender)

Type
fixed
random
fixed
random

Levels
2
20
2
11

The simplest model (as reported above) has only the four main factors.

Factor
Country
Wine
Gender
Person
Residual

DF
1
18
1
9
190

Sum Squares
69.323
908.359
0.456
292.376
1699.168

F-value
1.37
5.64
0.01
3.63

P-value
0.256
0.000
0.908
0.000

It is also possible to add several of the interactions between pairs of the four factors. However, in this case some of the F-tests will only be approximate, based on adjusted sums-of-squares (these are indicated with an asterisk). The most complex model has three 2-factor interactions; adding the fourth 2-factor interaction collapses the model.

Factor
Country
Wine
Gender
Person
Country*Gender
Country*Person
Gender*Wine
Residual

DF
1
18
1
9
1
9
18
162

Adj. Sum Squares
50.464
576.866
0.767
250.450
1.146
104.767
67.162
1526.094

F-value
1.47
8.59
0.03
2.39
0.19
1.24
0.40

P-value
0.242 *
0.000
0.859 *
0.105
0.701 *
0.277
0.987

Note that in this analysis the Person factor is no longer statistically significant (due to the presence of the Person*Country interaction). Otherwise, the conclusions do not change.

9 comments:

paul wagerAugust 3, 2020 at 7:04 PM
I think you've missed a much larger point in this tasting. Regardless of the final rankings, the panel of experts did not recognize that some of the wines were not either Burgundy or Bordeaux. And there is no statistical doubt about that.

From which we can conclude that the whole idea of those regions having a unique terroir that is unachievable in any other region goes right out the window. And so does one of the primary building blocks of most wine education.

IF the experts can't tell the difference, why are we teaching it?
ReplyDelete
Replies
UnknownAugust 4, 2020 at 12:36 AM
The great thing about the Judgement of Paris was the international publicity it gave to California wines. Not points or who won.
ReplyDelete
Replies
UnknownAugust 4, 2020 at 6:20 PM
Paul Wagner has no clue what he's talking about. EVERYONE KNEW that all the reds were Cabernet based or Bordeaux blends, and all the whites were of Burgundian variety (Chardonnay). And terroir does make a huge difference. You can't tell me that Chablis tastes like Russian River Chardonnay.
And as for David Morrison's analysis, I guess every gold medal won in the Olympics by means of "scores" (gymnastics, figure skating, diving, etc) demonstrates nothing, statistically speaking.
ReplyDelete
Replies
Steve EdmundsAugust 12, 2020 at 6:53 PM
it has always seemed to me that the primary takeaway from the '76 tasting had to do with the expectations brought to the tasting, which were, apparently, that the perceived differences in quality between French and California wines (and winemaking capability) could not be reconciled with the actual experience of the participants, once the wines were tasted side-by-side. It made for a story that people seemed to welcome eagerly.
ReplyDelete
Replies

Add comment

Monday, August 3, 2020

The Judgment of Paris demonstrated nothing, statistically speaking

9 comments:

Get new posts by email: