Monday, August 21, 2017

Wine tastings: should we assess wines by quality points or rank order of preference?

At formal wine tastings, the participants often finish by putting the wines in some sort of consensus quality order, from the wine most-preferred by the tasting group to the least-preferred. This is especially true of wine competitions, of course, but trade and home tastings are often organized this way, as well.

One interesting question, then, is how should this consensus ordering be achieved; and do different methods consistently produce different results?


At the bottom of this post I have listed a small selection of the professional literature on the subject of ranking wines. In the post itself, I will look at some data on the subject, ranking the wines in two different ways.

Dataset

The data I will look at come from the Vintners Club. This club was formed in San Francisco in 1971, to organize weekly blind tastings (usually 12 wines). Remarkably, the club is still extant, although the tastings are now monthly, instead of weekly. The early tastings are reported in the book Vintners Club: Fourteen Years of Wine Tastings 1973-1987 (edited by Mary-Ellen McNeil-Draper. 1988).

The Vintners Club data consist of three pertinent pieces of information for each wine at each tasting:
  • the total score, determined by summing each taster's ranking (1-12) of the wines in descending order of preference (1 is most preferred, 12 is least preferred)
  • the average of the UCDavis points (out of 20) assigned by each taster — the Vintners Club has "always kept to the Davis point system" for its tastings and, therefore, averaging the scores is mathematically valid
  • the number of tasters voting for the wine as 1st place (and also 2nd and 12th).
The Vintners Club uses the total score as their preferred ranking of the wines for each tasting. That is, in the book the wines are ranked in ascending order of their total score, with the minimum score representing the "winning" wine.

For my dataset, I chose the results of the 45 "Taste-offs"  of California wine. These tastings were the play-offs / grand finals (depending on your sporting idiom), consisting of the first- and second-place wines from a series of previous tastings of the same grape varieties. The Vintners Club apparently began its annual Taste-off program in 1973, and has pursued the concept ever since.

In my dataset, there are 14 Taste-offs for cabernet sauvignon, 12 for chardonnay, 9 for zinfandel, 4 for pinot noir, 3 for riesling, and one each for sauvignon blanc, gamay, and petite sirah. There were 17-103 people attending each the 45 Taste-offs (median 56 people per tasting), of whom 43-96% submitted scores and ranks (median 70%).

For each tasting, I calculated the Spearman correlation between the rank-order of the wines as provided by the total scores and the rank-order of the wines as provided by the average Davis points for each wine. This correlation provides a measure (scale: 0-100%) of how much of the variation in ranks is shared by the two sets of data (total scores versus average points). The percentage is thus a measure of agreement between the two rankings for each tasting.

Total scores and average points

The graph shows the results of the 45 tastings, with each point representing one of the Taste-offs. The horizontal axis represents the number of people providing scores for that tasting, while the vertical axis is the Spearman correlation for that tasting.

Correlation between two methods for ranking wines

As you can see, in most cases the correlation varies from 50-100%. However, only 1 in every 5 times is the correlation above 90%, which is the level that would indicate almost the same ranking for the two schemes. So, we may conclude that, in general, the total score and the average points do not usually provide the same rank-order of the wines at each tasting.

Indeed, in two cases the two schemes provide very different rank-orders for the wines, with correlations of only 41% and 23%. This is actually rather surprising. These two tastings both involved chardonnay wines, for some reason.

It is a moot point whether to sum the ranks or average the scores. That is, we cannot easily claim that one approach is better than the other — they produce different results, not better or worse results. However, for both approaches there are technical issues that need to be addressed.

For averaging, we need to ensue that everyone is using the same scale, otherwise the average is mathematically meaningless (see How many wine-quality scales are there? and How many 100-point wine-quality scales are there?). Similarly, when trying to combine ranks together, there is no generally agreed method for doing so — in fact, different ways of doing it can produce quite inconsistent outcomes (see the literature references below).

Number of first places

For those wines ranked first overall at each tasting, only 4-60% of the scorers had actually chosen them as their personal top-ranked wines of the evening, with an average of 22%. That is, on average, less than one-quarter of the scorers ranked the overall "winning" wine as being at the top of their own personal list. This indicates that rarely was there a clear winner.

Indeed, for only half of the wines was the "winning" wine the one that got the largest number of first places, based on either the sum of ranks or the average points. Indeed, for those wines ranked first overall at each tasting, for only 24 of the 45 tastings was that wine the one that received the greatest number of 1st place votes during the evening. Similarly, for the wines with the highest average score at each tasting, for only 25 of the 45 tastings was that wine the one that received the greatest number of 1st place votes during the evening.

We may safely conclude that neither being ranked 1st by a lot of people, nor getting a high average score from those people, will actually make a wine the top-ranked wine of the evening. As I have noted in a previous blog post, often the winning wine is the least-worst wine

Footnote

Confusingly, for each tasting, the Vintners Club rank data very rarely add up to the expected total for the number of people providing results. That is, the sum of the ranks should = 78 x the number of people providing scores. A few points less than the expected number likely represents a few tied votes by some of the scorers. However, there are also many tastings where the total scores add up to much more than is possible for the number of people present at the tasting. I have no explanation for this. (And yes, I have considered the effect of alcohol on the human ability to add up numbers!)



Research Literature

Michel Balinski, Rida Laraki (2013) How best to rank wines: majority judgment. In: E. Giraud-Héraud and M.-C. Pichery (editors) Wine Economics: Quantitative Studies and Empirical Applications, pp. 149-172. Palgrave Macmillan.

Jeffrey C. Bodington (2015) Testing a mixture of rank preference models on judges’ scores in Paris and Princeton. Journal of Wine Economics 10:173-189.

Victor Ginsburg, Israël Zang (2012) Shapley ranking of wines. Journal of Wine Economics 7:169-180.

Neal D. Hulkower (2009) The Judgment of Paris according to Borda. Journal of Wine Research 20:171-182.

Neal D. Hulkower (2012) A mathematician meddles with medals. American Association of Wine Economists Working Paper No. 97.

2 comments:

  1. (American mathematician-turned-songwriter/singer satirist Tom Lehrer, in the lyrics to his song titled "New Math," asks the live recording concert audience for a volunteer to solve the arithmetic problem 342 minus 173 . . . in base eight. In reaction to the crowd response, he quips: "Now, let's not always see the same hands!"

    Lehrer's reply feels like it could be directed at me for invariably going first with a comment to David's Monday morning blog.

    Wait! What? You say nobody else is awake at 1:30 in the morning reading his blog?)

    I have "a little" experience on this topic du jour.

    At the more than 100 sit-down winetasting luncheons I organized or co-organized here in Los Angeles (influenced by Vintners Club tastings in San Francisco), I insisted on participants casting First, Second and Third overall preference votes.

    With upwards of 24 submissions in the line-up comprising a single vintage, single variety, single region (e.g., California) wines, I felt that asking the participants to "force rank" all the wines from First to Last (e.g., 24th) would be impossible. They simply won't be able to break their ties on preference vote rankings -- creating havoc in the voting taking.

    When a participant was stymied on trying to break a tie when assiging Top 3 Overall ranking, I asked this simple rhetorical question: "Imagine you can take home as my gift only one of those tied ranking bottles. Which one will it be?"

    That nudged participants into compliance.

    I never asked the participants for their numerical scores. Since each individual scored wines differently -- thumbs up/thumbs down, 5 stars, 20 points, 100 points -- there would be no statistically correct way to compare them.

    And as we repeatedly found, the highest scoring wine according to The Wine Advocate or Wine Spectator was not the crowd favorite.

    "De gustibus non est disputandum."

    Robert Parker's and James Laube's palates have been formed through the prism of their life's experiences.

    Don't expect your palate to align with theirs.

    ReplyDelete
  2. Excerpts from Robert Parker on How He “Rates” Wines:

    Source: Robert Parker, The Wine Advocate (issue 84, dated 12-11-92):

    “Long-time readers know that I am more critical of older wines than many other writers. To merit high ratings, an older wine must still be fully alive with its personality intact.”

    Source: Robert Parker, The Wine Advocate (issue 90, dated 12-20-93):

    “Readers should recognize that when tasting old bottles the expression, ‘There are no great wines, only great bottles,’ is applicable. . . . Long-time readers have noted that I prefer my wines younger rather than older. Therefore, regardless of its historical significance, no wine which tastes old and decrepit will receive a good review. Those old wines that receive enthusiastic evaluations do so because they remain well-preserved and loaded with remarkable quantities of rich, pure fruit. They possess a freshness, in addition to the profound complexity that developed with significant bottle age. . . . bottles that received perfect or exceptional reviews are living, rich, concentrated, compelling wines that justify the enormous expense and considerable patience collectors invest in maturing the finest young wines from top vintages.”

    Source: Robert Parker, The Wine Advocate (issue 103, dated 2-23-96):

    “Long-time readers know that I am a fruit fanatic, and if a wine does not retain this essential component, it is not going to receive a satisfactory review.”

    Source: Robert Parker, The Wine Advocate (issue 109, dated 6-27-97):

    “The 1990 Le Pin [red Bordeaux, rated 98 points] is a point or two superior to the 1989 [Le Pin, rated 96 points], but at this level of quality comparisons are indeed tedious. Both are exceptional vintages, and the scores could easily be reversed at other tastings.”

    Source: Robert Parker, The Wine Advocate (unknown issue from 2002):

    “. . . Readers often wonder what a 100-point score means, and the best answer is that it is pure emotion that makes me give a wine 100 instead of 96, 97, 98 or 99. ”

    ReplyDelete