Monday, August 3, 2020

The Judgment of Paris demonstrated nothing, statistically speaking

Way back in 1976 there was a comparative tasting of several fancy (and expensive) American and French wines, in which the US wines acquitted themselves quite well. This became known as The Judgment of Paris, a classical allusion that probably escapes most people.

The media made much of this outcome at the time, claiming that the American wines “won” some sort of contest, and should now be considered to be the best in the world, or at least the equal of the French stuff. Indeed, whole books have been written about this; and the volume of words in the media is horrendous to think of.


However, it seems to me that there is one thing missing here, and always has been. Since the quality scores assigned by each taster to each wine are available to us, we can judge for ourselves whether the scores actually show anything more than random variation. After all, the sample size was very small: 20 wines tasted by 11 people. I have already argued the case that the results are very variable in all the wrong ways, if one wishes to compare the wines of two countries (11 tasters and 20 wines, and very little consensus). Well, I have finally decided to bite the bullet, and apply some formal statistical analyses to the data — this blog is the only place likely to try this exercise!

Now, before you all jump up and down reminding everyone of the old saying that there are “Lies, damned lies, and statistics”, bear with me for a moment. That statement dates from the 1890s, long before the development of modern mathematical methods to study data involving probabilities. There are now many powerful methods for analyzing data in an objective and repeatable way, ones that generate no controversy about their outcome. Statistics is no longer whatever you want to make of it.

In the case at hand, there are several possible causes of variation in the quality scores assigned to the wines:
  1. the wines come from two countries — this is what we want to examine
  2. for each country, there were 10 reds and 10 whites selected out of all of the high-quality wines that could have been chosen for inclusion — the results might have been different if other wines had been chosen
  3. there were only 11 people selected as tasters, out of all of the wine critics that could have been chosen — the results might have been different if other tasters had been chosen
  4. these people were of different genders, and there were not equal numbers of males and females.
There are other factors that might be important, of course, but this list is all the information that we have. The point here is that we wish to study factor 1, and to do so we need to take into account the variation caused by factors 2, 3 and 4. This is what modern statistical analysis is all about — studying one source of variation if the face of variation cased by other factors. How do we do this in some objective and repeatable way?

The answer, as formalized by Ronald Fisher in the 1920s, is called Analysis of Variance. It does precisely what the name says — it tries to measure the amount of variation caused by each of the factors. If the variation for any particular factor has a strong pattern, compared to the others, then we can consider it to be an important one, and if the variation has no particular pattern then it is not important. This comparison is made with respect to the variation that is not accounted for by the specified factors.

The concept is fairly simple, but (of course) the mathematics is not. It can often be done by hand, if you are inclined to try that sort of thing, but I always use a computer program. I have used this type of analysis many times in my own research career; and I have even taught this analysis to undergraduate and postgraduate biology students. So, what happens if I apply it to the data from the Judgment of Paris?

Well, you already know the answer to that, from the title of this blog post. The formal details of the analysis are included at the bottom of the post, for those of you who might be interested in them. Here is a summary of the results:

Factor
Country
Wine
Gender
Person
F-value
1.37
5.64
0.01
3.63
Probability
0.256
0.000
0.908
0.000

Each row of the table refers to a formal statistical test of one of the four factors. The F-value (named after Fisher himself) is the result of the calculations — for purely random data this value will = 1. The probability is a measure of how likely it is that the F-value is not > 1. By convention, we might choose p < 0.05 as our criterion for concluding that a factor shows important variation.

As you can see, there is almost no variation between the two genders, in our sample, which may surprise none of you. There is some variation between the two countries, but it is not very large, and is nowhere near “statistical significance”. All of the fuss about the results is nonsense — the differences are nothing more than we would expect from random chance, 26% of the time.

This does not mean that there were no differences between the wines, because there surely were. However, the analysis indicates that these differences had nothing to with which country they came from. Simply put, some wines scored consistently much higher than others, meaning that the tasters agreed that they were better wines, irrespective of where they came from.

Unsurprisingly, the analysis also shows that there were big differences between the critics, with some of them giving much higher scores than others, irrespective of the wine. I produced a graph of this in my previous post on this topic, which I have reproduced here. It shows who scored highest (at the top) and who scored lowest (at the bottom).


Conclusion

The statistical analysis makes it clear that the variation in scores was no different from what we would expect from any single collection of wines and people at one place and one time. Some wines scored better than others, and some people gave higher scores than others, and the country of origin had nothing to do with it.

That is, we would not expect the results to be repeatable. Nor were they, as have I already pointed out in two previous blog posts (Was the Judgment of Paris repeatable? Did California wine-tasters agree with the results of the Judgment of Paris?). Different people tasting the same wines produced different results.

There were lots of differences in scores between the wines and between the people, but not much difference between the countries. So, what was all of the fuss about? Cultural politics, is my guess.




Analyses

I tried several different General Linear Models, using the Minitab package. The Country and Gender factors were both fixed; and the Wine and Person factors were nested within them.

Factor
Country
Wine(Country)
Gender
Person(Gender)  
Type
fixed
random
fixed
random
Levels
2
20
2
11

The simplest model (as reported above) has only the four main factors.

Factor
Country
Wine
Gender
Person
Residual  
DF
1
18
1
9
190
Sum Squares
69.323
908.359
0.456
292.376
1699.168
F-value
1.37
5.64
0.01
3.63
  
P-value
0.256
0.000
0.908
0.000
  

It is also possible to add several of the interactions between pairs of the four factors. However, in this case some of the F-tests will only be approximate, based on adjusted sums-of-squares (these are indicated with an asterisk). The most complex model has three 2-factor interactions; adding the fourth 2-factor interaction collapses the model.

Factor
Country
Wine
Gender
Person
Country*Gender
Country*Person
Gender*Wine
Residual  
DF
1
18
1
9
1
9
18
162
Adj. Sum Squares
50.464
576.866
0.767
250.450
1.146
104.767
67.162
1526.094
F-value
1.47
8.59
0.03
2.39
0.19
1.24
0.40
  
P-value
0.242 *
0.000
0.859 *
0.105
0.701 *
0.277
0.987
  

Note that in this analysis the Person factor is no longer statistically significant (due to the presence of the Person*Country interaction). Otherwise, the conclusions do not change.

9 comments:

  1. I think you've missed a much larger point in this tasting. Regardless of the final rankings, the panel of experts did not recognize that some of the wines were not either Burgundy or Bordeaux. And there is no statistical doubt about that.

    From which we can conclude that the whole idea of those regions having a unique terroir that is unachievable in any other region goes right out the window. And so does one of the primary building blocks of most wine education.

    IF the experts can't tell the difference, why are we teaching it?

    ReplyDelete
    Replies
    1. Absolutely!

      Delete
    2. Paul:

      Steven Spurrier did not "task" the panelists with identifying blind which wines were French and which wines were Californian.

      Rather, it was a subjective personal preference vote.

      (Or as Robert Lawrence Balzer -- who later staged his own version of the Judgment of Paris back in California -- would instruct us wine appreciation course students: "Okay children, time for the beauty contest!")

      Quoting from Wikipedia's entry:

      "Blind tasting was performed and the judges were asked to grade each wine out of 20 points. No specific grading framework was given, leaving the judges free to grade according to their own criteria.

      Rankings of the wines preferred by individual judges were based on the grades they individually attributed.

      An overall ranking of the wines preferred by the jury was also established in averaging the sum of each judge's individual grades (arithmetic mean). However, grades of Patricia Gallagher and Steven Spurrier were not taken into account, thus counting only grades of French judges.[3]"

      So questions about "terroir" and "typicity" by AOC and AVA were neither addressed or resolved.

      And Paul, speaking was a fellow wine marketer, the waggish short answer to your question . . .

      "IF the experts can't tell the difference, why are we teaching it?

      . . . is: because it's good marketing "story telling."

      Regards,

      ~~ Bob

      Delete
    3. By the way, I am NOT convinced that the genuine experts can't tell the difference.

      Let me nominate my expert: Darrell Corti, the famous grocer of Sacramento (California).

      And let me proffer this blind tasting anecdote:

      Excerpt from the Los Angeles Times “Sunday Calendar”
      (December 18, 1988):

      “A Connoisseur's Connoisseur:
      Darrell Corti probably knows more about food
      than anybody else in the state”

      URL: https://www.latimes.com/archives/la-xpm-1988-12-18-ca-762-story.html

      By Ruth Reichl
      Times Restaurant Critic

      "Oh no," groaned one of the guests, "do we have to?" A bottle of wine had just been plunked on the table, neatly wrapped in a plain brown bag. Once again it was time to play "Guess the Vintage."

      The mystery wine was poured for each of the dozen invitees. They swirled and sniffed. They hazarded guesses. For this group of august wine experts, it was not a new game. Somebody suggested that the wine came from Napa. Another detected the flavor of the 1953 Medoc.

      After 15 minutes they had determined only that it was red, not American, and made before 1935. "Oh, let's give up," said wine authority Robert Finigan, "none of us has a clue." There were nods all around the table as wine writer Barbara Ensrud, wine maker Jack Cakebread and wine connoisseur Narsai David all admitted that they were stumped.

      "Not yet," said a measured voice from the end of the table; it was the first time it had been heard. "I know the wine. I've just been trying to decide if it was the 1928 or the 1929. I think it must be the 1928."

      There was a gasp from the back of the room and winemaker John Trefethen announced the name of the wine he had placed on the table. Fifty-two years after the grapes were gathered, Darrell Corti had not only correctly named the vintage year, but placed the wine within yards of where it was made.

      Even the experts are not supposed to be able to do this sort of parlor trick; it is not for nothing that Corti is known as the "walking wine encyclopedia."

      Delete
  2. The great thing about the Judgement of Paris was the international publicity it gave to California wines. Not points or who won.

    ReplyDelete
  3. Paul Wagner has no clue what he's talking about. EVERYONE KNEW that all the reds were Cabernet based or Bordeaux blends, and all the whites were of Burgundian variety (Chardonnay). And terroir does make a huge difference. You can't tell me that Chablis tastes like Russian River Chardonnay.
    And as for David Morrison's analysis, I guess every gold medal won in the Olympics by means of "scores" (gymnastics, figure skating, diving, etc) demonstrates nothing, statistically speaking.

    ReplyDelete
    Replies
    1. Your last point is NOT correct. Each athlete is evaluated on their own merits. If each wine is assessed the same way, then that would be one thing. However, in this case the wines were assessed as a group of California versus French wines. This then requires the statistical analysis to conclude anything. Otherwise, it is just another collection of individual wines — and then there would be no Judgment of Paris!

      Delete
  4. it has always seemed to me that the primary takeaway from the '76 tasting had to do with the expectations brought to the tasting, which were, apparently, that the perceived differences in quality between French and California wines (and winemaking capability) could not be reconciled with the actual experience of the participants, once the wines were tasted side-by-side. It made for a story that people seemed to welcome eagerly.

    ReplyDelete
    Replies
    1. I agree with both your points: the expectation, and the response. These were, indeed, the main points. This does not obviate the idea that these are subjective, rather than objective, things.

      Delete