The Wine Gourd: Can non-experts distinguish anything about wine?

Roman L. Weil is a professor of accounting, with an interest in wine. During the early 2000s he conducted three similar experiments to assess the ability of non-experts (primarily educated, upper middle-class individuals who were experienced and enthusiastic wine drinkers) to distinguish various characteristics of wine. These distinctions included:

vintages rated by an expert as good versus poor
wines selected for a special "reserve" bottling versus the normal wine
different taste descriptors provided by an expert.

Here, I summarize the results of those experiments, as they seem not to be widely known, and yet they provide very interesting conclusions. In my usual fashion, I present pictures of the results (ie. graphs) rather than the original tabulated numbers, because it is then much easier to see the patterns in the data and thus to appreciate the conclusions.

Methods

All of the experiments were designed in the same way. Several different pairs of wines were chosen for each experiment, the pairing being determined by the particular objective of each experiment; these wine pairs constitute the experimental replication. The paired wines were presented to several hundred different tasters, spread over a number of different places and occasions; these people constitute the replicate sample units.

In each case, each taster was presented with three unlabeled glasses, one glass containing one of the wines, and two glasses with the other wine from the same pair. In this triangular experiment, the taster was asked to distinguish the singleton wine (ie. one of the glasses should taste different to the other two glasses). The taster was then asked to identify certain characteristics of the two wines. On any one occasion, tasters received 1–3 of the wine pairs.

The results were summed for each wine pair separately, listing the number of people who correctly distinguished the two wines in each pair, and then how many of those successful people correctly identified the chosen characteristics. Note that distinguishing the characteristics is not relevant unless the taster could actually distinguish the paired wines in the first place!

By random chance, the tasters should be able to distinguish the paired wines one-third of the time (ie. identifying the singleton glass out of three). So, our "expected" result is 33% if the tasters can do no better than random (ie. guessing). Then, for the two characteristics the expectation is 50%, if the tasters can do no better than random (ie. there are two characteristics to identify).

Distinguishing different vintages of the same wine

Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.

The hypotheses being tested in this experiment are that the amateurs:

cannot distinguish in blind tastings the wines of years rated by an expert as high from those of years rated low, and
if they can, they do not agree with the vintage chart's preferences.

To test these hypotheses, Weil selected six "pairs of wines with the following characteristics: the pairs have identical features (such as shipper, vineyard, and producer) except vintage, and Robert Parker rated one the vintages of those two wines Average to Appalling while he ranked the other Excellent to The Finest in The Wine Advocates Vintage Guide 1970-1999." So, the only difference between the paired wines should be that they came from vintages that Parker thought were very different from each other.

There were 593 tasters. One of the wine pairs was presented to wine professionals ("experts") on two occasions, as well as to the amateurs on the other occasions, and so these experts are treated separately in the results. The pairs of wine were tasted by 54-119 tasters each.

The results of the first hypothesis test are shown in the next graph. For each of the graphs presented below, the interpretation is as follows. Each wine-pair is represented by a horizontal line, as indicated by the legend. The central point on each of the lines represents the percentage of the tasters who succeeded at the task for that wine pair. The two end points on each line are the boundaries of the estimated 95% confidence interval (formally: the Score binomial 95% confidence interval). This interval gets smaller as the sample size (the number of tasters) gets larger, as it represents our statistical "confidence" in the results of the experiment. The dashed line represents the expected results if the tasters are performing in a random manner — the idea of the experiment is to see whether people can do better than random. So, if the dashed line passes through the 95% confidence interval for a particular wine pair, then the tasters have done no better than random for that pair, whereas if the dashed line lies outside the 95% confidence interval then the tasters have done better than random.

Results of Roman Weil's experimental test of wines from different vintages

For the first graph, only the two groups of tasters receiving the Bordeaux wine performed better than random chance. Formally: for five of the wine pairs, the experiment provides no evidence that amateur wine tasters can distinguish between good and poor vintages any better than taking a guess. For the Bordeaux wine pair, both the amateurs and experts did better than taking a guess, with the wine experts doing slightly better than the amateurs.

This outcome calls into serious question the alleged difference of quality between different vintages in the modern world. Remember, Robert Parker (or his delegate) detected big differences in the vintages within a wine pair, but the amateurs could not consistently detect this for themselves when presented with actual examples of the wines. The different result for the Bordeaux wines may reflect the common conception that vintages really do still differ in Bordeaux.

The results of the second hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 30-50% of the tasters. The sample sizes therefore refer to only 21-60 tasters per wine pair.

Note that in all cases the tasters behaved in a random manner. That is, there was no consistent preference for the wine from the highly rated vintage compared to the poorer vintage, for any of the wines. We may conclude from this that expert vintage ratings are not related to wine preferences among wine drinkers. The wine from an allegedly poor vintage can taste just as good to an amateur drinker as a wine from a supposedly better vintage.

Distinguishing reserve and normal bottlings of the same wine

Roman L. Weil (2005) Analysis of reserve and regular bottlings: why pay for a difference only critics claim to notice? Chance 18(3):9-15.

The hypotheses being tested in this experiment are that the amateurs:

cannot distinguish in blind tastings the wines of reserve bottlings (or first wines) from the normal wines (or second wines), and
if they can, they do not prefer the reserve wine.

To test these hypotheses, Weil selected fourteen "pairs of wines based on the following characteristics: the pairs had identical features in all respects, except that one was a regular bottling and one was a reserve bottling. Common features included all label items (e.g. shipper, vineyard, and producer), retail source, and date of purchase." So, the only difference between the paired wines should be that the winemaker specially selected the reserve or first wine for separate bottling, at a much higher price (there was a price ratio of 1.13-3.57 for Weil's choices).

The results of the first hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. There were 855 tasters, with the pairs of wine being tasted by 38-136 tasters each. The two pairs of Champagne wines were each tasted by a small number of people only, and so I have pooled their results here (they did not differ from each other).

Results of Roman Weil's experimental test of wines from different bottlings

Note that the tasters do very much better here than in the previous experiment. That is, for six of the thirteen wine pairs the tasters did better than random when asked to distinguish the more expensive bottle of wine from the same winemaker. Mind you, they rarely did better than 50%, as opposed to 30%. Interestingly, there are three wine types that are repeated in the experiment: the cabernet blend from Bordeaux, the cabernet wine from the western USA, and the white wine from California; and in all three cases the tasters succeeded with one wine but not the other.

Nevertheless, the results do indicate that, for tasters, there is often a bigger difference between what the winemaker does with the wine (selects wine for different bottlings, to be charged at different prices) than between what nature does with the wine (produces different climatic conditions in different years).

The results of the second hypothesis test are shown in the next graph. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 30-50% of the tasters. The sample sizes therefore refer to only 13-56 tasters per wine pair.

Here, the tasters did not consistently prefer the reserve wine over the normal wine, except in two cases. We may conclude from this that winemakers are, indeed, generally selecting wines of different taste for their different bottlings, but that this is not necessarily related to wine preferences among wine drinkers. The wine from an expensive bottle can taste just as good to an amateur drinker as one from a supposedly inferior bottle of the same wine.

The two exceptions are informative. For one of the California chardonnays there was a strong preference for the more expensive wine. This suggests that the winemaker succeeded in this particular case — they charged more ($26 versus $13) for a wine that drinkers actually prefer. In the opposite manner, for one of the Bordeaux wines there was actually a preference for the cheaper wine. It may surprise you to reveal that this was a preference for the 1994 Les Forts de Latour ($56 at the time) over 1994 Château Latour ($200), the most expensive wine in the experiment. The Bordeaux first-growth chateaux might like to take note of this result (as might your wallet!). (Note: in general, the first wines of the Bordeaux first growths cost 3-4 times as much as their second wines; see the Liv-Ex blog.)

Matching wines and their descriptions

Roman L. Weil (2007) Debunking critics' wine words: can amateurs distinguish the smell of asphalt from the taste of cherries? Journal of Wine Economics 2:136-144.

The hypotheses being tested here are that the amateurs:

cannot distinguish in blind tastings wines that are described by an expert using different words, and
if they can, they cannot match the descriptions to the wines.

To test these hypotheses, Weil selected ten "pairs of wines with the following characteristics: the pairs have similar features, and the same writer / critic wrote about these two wines with disjoint word sets. That is, the reviewer used different words in describing the two wines." Note that the wines could actually come from different vintages or even continents, provided that they had similar grapes, etc.

The results of the first hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. There were 321 tasters, with the pairs of wine being tasted by 13-86 tasters each, which means much smaller sample sizes than for the other experiments.

Results of Roman Weil's experimental test of wines with different descriptions

Since the objective was to choose wines that differ in description by an expert, it is hardly surprising that the tasters succeeded in distinguishing the wine pairs in six out of the ten cases. However, in only one case did they do better than 60-70%, which does call into question the experts' abilities to describe wine in any quantitative way. After all, there are many examples in wine lore of different experts also describing exactly the same wine in completely disjunct words.

The results of the second hypothesis test are shown in the next graph. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 40-60% of the tasters. The sample sizes therefore refer to only 5-45 tasters per wine pair.

Sadly, in only one case could the tasters consistently match the wines to the expert descriptions. So, we may conclude that reading a description of a wine does not necessarily tell you what it will taste like to you.

Conclusions

Combined, these three experiments do not paint a happy picture of the wine business. Amateur wine tasters cannot consistently distinguish wines from different vintages or different bottlings, or with different descriptions. And when they can do so, their preferences do not necessarily agree with the professionals' assessments of quality — they are about as likely to prefer the one as the other. So, what is it that these professionals are doing? Whatever it is, it seems to be somewhat divorced from their customer base. In any case, there seems to be little reason to pay more for a "special" wine (a better year or a better selection), unless you have already checked it out and decided that you prefer it.

Quality versus preference

One potentially confusing aspect of Weil's experiments is that in two of his three experiments his second hypothesis is not actually related to the first one. In the first experiment his second question concerns which wine the tasters prefer, not which one they think is from the higher-rated vintage; and similarly for the second experiment, they are asked which wine they prefer rather than which one is the reserve wine. Only in the third experiment is the second question directly related to the objective — which wine matches which description.

It is important to recognize the distinction between "prefer / like" and "high quality" (otherwise, one of the two expressions would be redundant!). These are often treated as though they both mean "better", as in the expression "if you like it then it is good". However, these are two very different ideas — supposedly better quality does not mean that you should prefer it in any personal sense. Personal preference is all in your head, but differences in quality also exist outside of it.

For example, one does not need to like opera in order to recognize a poor opera singer, nor does one have to be a practicing christian to appreciate the architectural and artistic merits of a church. So, recognition of quality is not necessarily related to personal choice. For example, I can accept that there are high-quality characteristics of Champagne, but I do not actually like the taste of those distinctive characteristics — I actually prefer the crémant wines from Alsace, Die or the Loire, or the sparkling wines of southern Australia. Financially, of course, this is to my benefit!

This point is important for a wine drinker. The ability to recognize which wine the professionals think has higher quality is a separate issue from whether you actually like that wine. Do I like the wines recommended by Robert Parker? Perhaps so, or perhaps not, but either way I can probably recognize them, because they have a similar set of characteristics. He sees those characteristics as denoting high quality, but I may well see them as something I don't particularly care for.

Weil is probably right to focus on "prefer / like", since that is of most practical relevance to a consumer; but we should not confuse this with "quality". It would be of interest to experimentally examine the latter, also.

Monday, November 14, 2016

Can non-experts distinguish anything about wine?

No comments:

Post a Comment

Get new posts by email: