The Wine Gourd: How large is between-critic variation in quality scores?

I have written before about the Poor correlation among critics' quality scores (see also Can non-experts distinguish anything about wine?). This topic refers to what is technically called inter-individual variation in the scorer, which you might call "between-taster variation" — the same wine tasted by different people, even on the same occasion, does not necessarily receive the same quality score, even when it comes from the same bottle.

This results from two things: (i) variation in personal assessment of the wine (the assessment of quality is the result of each taster’s previous experiences as well as their personal conceptions); and (ii) differences in how this assessment is expressed in terms of a score.

This is an important issue for anyone who reads the opinions of wine commentators. After all, if there is more disagreement than agreement, then we might ask ourselves what it is that we are expecting to get out of reading the critics in the first place. It is for this reason that we are often advised to find a commentator whose wine tastes match our own, and read that person's reviews only.

This issue of inter-individual variation has been studied in the professional literature; and, indeed, many authors have concluded that wine criticism is a somewhat fraudulent activity, given the large personal component in the scores. I have included a list of relevant published papers at the end of this post.

What I will do in this post is taker a broader look at this topic than I did in my previous post, but still examine particular examples of scores from particular wines and wineries. All of the wines will be red, since it seems to be rather hard to find large datasets of white wines that have been evaluated by many people (the wines of Sauternes are the most obvious ones).

Bordeaux First Growth wines

I will start by looking at the "Grand vin" wines of the five First Growth wineries from the Left Bank of the Bordeaux region, in France: Château Haut-Brion, Château Lafite-Rothschild, Château Latour, Château Margaux, and Château Mouton-Rothschild. All five of these wines have vintages going back centuries, although most of the available quality scores cover only the period after 1900.

For each of the five wines, I have compiled as many publicly available scores as I can, using principally the information provided by Wine-Searcher, 90plus Wines and Cellar Tracker. For each wine, I then restricted the dataset to those post-1900 vintages with quality scores from at least two commentators; and then I pooled the five wines together. [Note: In the previous post I analyzed a single wine.] Finally, I separately converted all scores to use a 100-point scale.

For the analyses presented here, I have divided the data into two subsets: (i) 11 commentators with scores for at least 15 vintages of each of the five wines, covering the period from 1945–2014, inclusive; and (ii) 11 commentators with scores for at least 14 vintages of each of the five wines, covering the period from 1988–2014, inclusive.

There are eight commentators who appear in both datasets (Falstaff Magazin, Jeff Leve, Robert Parker, Jean-Marc Quarin, Jancis Robinson, James Suckling, Stephen Tanzer, Wine Spectator), and six who appear in one but not the other (Michel Bettane and Thierry Desseauve, Jeannie Cho Lee, Richard Jennings, John Kapon, La Revue du Vin de France, Vinum Weinmagazin). There are many well-known sources of Bordeaux wine commentary for whom I could not find sufficient data, including Wine Enthusiast, Decanter, Vinous (Antonio Galloni), Gault & Millau, Wine & Spirits Magazine, and Tim Atkin. It is worth noting that most of the Wine Spectator's scores were actually from James Suckling, along with a few from Thomas Matthews, James Molesworth and Harvey Steiman (who have all reviewed the red wines of Bordeaux for that magazine), plus some that were unattributed.

So, we can now look at how similar are the quality scores of these commentators, when pooled across these five wines. [Aside: the picture does not change much if we consider each of the five wines separately.] Let's start with the first dataset, covering the period since the 1945 vintage. All of these commentators are from the USA except for Jancis Robinson (UK), Falstaff Magazin (Austria) and Jean-Marc Quarin (France).

As before, we can quantify the relationships among the scores using correlation analysis. This analysis reveals that the following percentages are held in common between the 11 commentators pairwise. In this table, values >60% are highlighted in green, those between 50 and 60% are in blue, and those <10% are in yellow.

Table showing correlations among the quality scores of 11 wine critics

As far as wine quality is concerned, the average agreement among the commentators is less than 50% in almost all cases, and more than half of the values are in the range 10–40%, which is rather low. Certainly, the critics disagree with each other much more than they agree. The only commentators who appear to be in strong agreement with each other are Jeff Leve and Robert Parker. At the other extreme, neither Jancis Robinson nor Richard Jennings has much in common with the other commentators.

It might be more useful to look at a picture of these data, rather than a table of numbers. To do this, we can employ a network, as described in the post on Summarizing multi-dimensional wine data as graphs, Part 2: networks. This is shown in the next graph. [Technical note: the correlation scores were first converted to euclidean distances, and the NeighborNet graph was drawn using the SplitsTree program.]

Network showing similarities among the quality scores of 11 wine critics

In this graph, the lengths of the lines represent the amount of information. The interconnected lines in the centre represent the shared information, with the terminal lines ("leaves") representing the unique information. In this case the longest lines are the terminals, indicating that there is little commonality among the quality scores.

The connections among the lines represent who is agreeing with whom. For example, Parker and Leve are closely associated in the network, as expected from the results shown in the table above (their association is indicated by the short distance separating them along the lines of the network). You can see that there is also some association between Tanzer and Cho Lee, between Suckling and Kapon, and between Robinson and Jennings (and also between Robinson and Kapon). The Spectator magazine and Jean-Marc Quarin appear to have some similarity to the scores of Robert Parker; but the relationships of Falstaff magazine are unclear. It is worth noting that the scores of Suckling and the Spectator are not closely associated, in spite of the fact that most of them come from the same person (see Are the quality scores from repeat tastings correlated?).

We can now do the same two analyses for the second dataset, covering the period since the 1988 vintage. The correlation analysis reveals that the following percentages are held in common between the 11 commentators pairwise.

A similar pattern emerges, although the average values are slightly larger for this restricted dataset. The average agreement among the commentators is still less than 50% in most cases, and more than half of the values are in the range 10–40%. Thus, the critics disagree with each other much more than they agree. However, in this dataset there are now several pairs of critics who share more than 50% agreement. Jancis Robinson is once again involved in the values that are less than 10%

The network picture of the same data is shown in the next graph. Several of the previous associations are not present, because three of the commentators are not in this dataset (Cho Lee, Jennings, Kapon).

Suckling, Falstaff and Spectator are closely associated, as expected from the results in the table, as are Parker and Leve. More interestingly, we can now evaluate three new commentators, all from France. Indeed, the four French commentators are closely associated in the graph (Revue de France, Bettane et Desseauve, Quarin, and Vinum Weinmagazin [from Switzerland]). Of these, only the Revue de France scores seem to be associated with any of the non-French commentators, having some connection to those of Robert Parker.

Thus, there is little commonality among the scores of different commentators, and this is especially true for Jancis Robinson. Furthermore, the four French commentators do seem to form a separate groups from the others. Perhaps it is relevant that these five people are the only ones in the dataset who use a 20-point quality scale rather than a 100-point scale.

Hill of Grace (Australia)

As an addendum, we can take a quick look at another single Australian wine that has a long record of critics' scores, following the Penfolds Grange used in the previous post.

Unlike Grange, the Henschke Hill of Grace is a single vineyard wine, made from c. 7 ha of shiraz. The oldest vines were planted in the 1860s; and in a good year there are about 2,000 cases of wine produced. The vintages date back to 1958, with four vintages when the wine was not made (leaving 50 released vintages for analysis). There are five commentators who have provided quality scores for almost all of these vintages, and two more who have covered at least one third of them.

As above, we can quantify the relationships among the scores using correlation analysis. This analysis reveals that the following percentages are held in common between the seven commentators pairwise. As above, values >60% are highlighted in green, those between 50 and 60% are in blue, and those <10% are in yellow.

Table showing correlations among the quality scores of 7 wine critics

As you can see, the correlations are extremely poor, except for those involving Huon Hooke, which are not quite so bad. None of the values exceeds 50%, and most are actually <10%. This means that there is very little agreement among the commentators in their quality scores.

This is the most extreme example of disagreement in quality scores that I have encountered.

Conclusion

The answer to the question posed in the title is: "a lot".

This broader analysis (six wines) confirms the results from the previous blog post (Poor correlation among critics' quality scores). The idea that wine commentators have some sort of consensus opinion with regard to wine quality is completely untenable, for all of the wines checked so far. In general, the agreement varies from 0–50%, so that the critics disagree more than they agree.

However, there are patterns of association among the commentators, so that their quality scores are not completely random. Unfortunately, this seems to be a relatively minor component of the data patterns. Nevertheless, the four French commentators do seem to have opinions about the French wines that differ from those of the other commentators.

Research Literature

Johan Almenberg, Anna Dreber (2009) When does the price affect the taste? Results from a wine experiment. American Association of Wine Economists Working Paper No. 35.

Orley Ashenfelter, Richard Quandt (1999) Analyzing a wine tasting statistically. Chance 12(3):16-20.

Robert H. Ashton (2011) Improving experts’ wine quality judgments: two heads are better than one. Journal of Wine Economics 6:160-178.

Robert H. Ashton (2012) Reliability and consensus of experienced wine judges: expertise within and between? Journal of Wine Economics 7:70-87.

Robert H. Ashton (2013) Is there consensus among wine quality ratings of prominent critics? An empirical analysis of red Bordeaux, 2004-2010. Journal of Wine Economics 8:225-234.

George A. Baker, and Maynard A. Amerine (1953) Organoleptic ratings of wines estimated from analytical data. Food Research 18:381-389.

Jeffrey C. Bodington (2012) 804 tastes: evidence on preferences, randomness, and value from double-blind wine tastings. Journal of Wine Economics 7:181-191.

Jeffrey C. Bodington (2015) Evaluating wine-tasting results and randomness with a mixture of rank preference models. Journal of Wine Economics 10:31-46.

Jeffrey C. Bodington (2015) Testing a mixture of rank preference models on judges’ scores in Paris and Princeton. Journal of Wine Economics 10:173-189.

Chris J. Brien, P. May, Oliver Mayo (1987) Analysis of judge performance in wine-quality evaluations. Journal of Food Science 52:1273-1279.

Jing Cao (2014) Quantifying randomness versus consensus in wine quality ratings. Journal of Wine Economics 9:202-213.

Jing Cao, Lynne Stokes (2010) Evaluation of wine judge performance through three characteristics: bias, discrimination, and variation. Journal of Wine Economics 5:132-142.

Jean-Marie Cardebat, Emmanuel Paroissien (2015) Reducing quality uncertainty for Bordeaux en primeur wines: a uniform wine score. American Association of Wine Economists Working Paper No. 180.

Jean-Marie Cardebat, Jean-Marc Figuet, Emmanuel Paroissien (2014) Expert opinion and Bordeaux wine prices: an attempt to correct biases in subjective judgments. Journal of Wine Economics 9:282-303.

Domenic V. Cicchetti (2004) Who won the 1976 blind tasting of French Bordeaux and US Cabernets? Parametrics to the rescue. Journal of Wine Research 15:211-220.

Dominic V. Cicchetti (2006) The Paris 1976 Wine Tasting revisited once more: comparing ratings of consistent and inconsistent tasters. Journal of Wine Economics 1:125-140.

Domenic Cicchetti, Arnold Cicchetti (2008) The balancing act in consistent wine tasting and wine appreciation: Part II: Consistency in wine tasting and appreciation: an empirical-objective perspective. Journal of Wine Research 19:185-191.

Domenic V. Cicchetti, Arnie F. Cicchetti (2013) As wine experts disagree, consumers’ taste buds flourish: how two experts rate the 2004 Bordeaux vintage. Journal of Wine Research 24:311-317.

Dom Cicchetti, Arnie Cicchetti (2014) Two enological titans rate the 2009 Bordeaux wines. Wine Economics and Policy 3:28-36.

Margaret A. Cliff, Marjorie C. King (1996) A proposed approach for evaluating expert wine judge performance using descriptive statistics. Journal of Wine Research 7:83-90.

Margaret A. Cliff, Marjorie C. King (1997) The evaluation of judges at wine competitions: the application of eggshell plots. Journal of Wine Research 8:75-80.

Margaret A. Cliff, Marjorie C. King (1999) Use of principal component analysis for the evaluation of judge performance at wine competitions. Journal of Wine Research 10:25-32.

Margaret A. Cliff, Mike O’Mahony, Lana Fukumoto, Marjorie C. King (2000) Development of a ‘bipolar’ R-index. Journal of Sensory Studies 15:219-229.

Victor Ginsburg, Israël Zang (2012) Shapley ranking of wines. Journal of Wine Economics 7:169-180.

Neal D. Hulkower (2009) The Judgment of Paris according to Borda. Journal of Wine Research 20:171-182.

Dennis V. Lindley (2006) Analysis of a wine tasting. Journal of Wine Economics 1:33-41.

Jonas De Maere (2014) Do expert tasters evaluate wines consistently? A statistical analysis and a proposal for improvement. Weinakademiker thesis, Weinakademie Österreich.

Philippe Masset, Jean-Philippe Weisskopf, Mathieu Cossutta (2015) Wine tasters, ratings, and en primeur prices. Journal of Wine Economics 1:75-107.

Ingram Olkin, Ying Lou, Lynne Stokes, Jing Cao (2015) Analyses of wine-tasting data: a tutorial. Journal of Wine Economics 10:4-30.

Wendy V. Parr, James A. Green, K. Geoffrey White (2006) Wine judging, context and New Zealand sauvignon blanc. Revue Européenne de Psychologie Appliquée 56:231-238.

Anthony Pecotich, Steven Ward (2010) Taste testing of wine by expert and novice consumers in the presence of variations in quality, brand and country of origin cues. American Association of Wine Economists Working Paper No. 66.

Richard E. Quandt (2006) Measurement and inference in wine tasting. Journal of Wine Economics 1:7-30.

Richard E. Quandt (2012) Comments on the Judgment of Princeton. Journal of Wine Economics 7:152-154.

Christine H. Scaman, J. Dou (2001) Evaluation of wine competition judge performance using principal component similarity analysis. Journal of Sensory Studies 16:287-300.

Eric T. Stuen, Jon R. Miller, Robert W. Stone (2015) An analysis of wine critic consensus: a study of Washington and California wines. Journal of Wine Economics 10:47-61.

Daniel L. Ward (2012) A graphical and statistical analysis of the Judgment of Princeton wine tasting. Journal of Wine Economics 7:155-168.

Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.

Roman L. Weil (2007) Debunking critics' wine words: can amateurs distinguish the smell of asphalt from the taste of cherries? Journal of Wine Economics 2:136-144.

Roman L. Weil (2005) Analysis of reserve and regular bottlings: Why pay for a difference only critics claim to notice? Chance 18(3):9-15.

Monday, April 3, 2017

How large is between-critic variation in quality scores?

No comments:

Post a Comment

Get new posts by email: