Monday, November 28, 2016

Summarizing multi-dimensional wine data as graphs, Part 1: ordinations

When collecting data, it is quite common to record several characteristics for each of a set of "objects". For example, a wine (the object) might come from a particular region, and be based on a particular grape type, have a particular winemaker, and be of a particular quality (four characteristics). Such data are referred to as being multi-dimensional.

When dealing with multi-dimensional data, we could analyze each characteristic separately. However, this would not give us an overview of the whole dataset, but merely tell us about each of the details. If we want an overview, then we need to summarize the multiple dimensions down into something that we can illustrate as a graph.

This summarization process is part of multivariate data analysis, sometimes also called pattern analysis. There are many mathematical techniques for doing this, because there are many ways of summarizing anything, data included. In particular, the result of each summary may be unique, because there may be many possible patterns in the data that could be included in a summary, and each analysis technique may pick a different part of the data to summarize. After all, a summary must lose information, by definition, and there can be different opinions about which bits not to lose — each analysis technique can be seen as having its own "opinion".

I will be illustrating two different types of multivariate data analysis, using examples of data from the wine world. In this post I will look at ordination analyses, and in the next post I will look at network analyses (see Summarizing multi-dimensional wine data as graphs, Part 2: networks).

Ordinations

Ordination analyses try to put the objects in some sort of rank order (hence the name), which can then be displayed as a one or two-dimensional graph. In the graph, each point represents an object, and their positions relative to each other illustrate their similarity based on the original multi-dimensional characteristics. That is, the many original dimensions are reduced to one or two dimensions, and we then get a picture of the result. Points close together in the picture are more similar than are points further apart.

The specific example shown here is taken from this research paper:
María-Pilar Sáenz-Navajas, Eva Campo, Angela Sutan, Jordi Ballester, Dominique Valentin (2013) Perception of wine quality according to extrinsic cues: the case of Burgundy wine consumers. Food Quality and Preference 27: 44-53.
As part of their work, these authors showed 23 wine bottles to each of 48 people, and asked them to subjectively assess what they thought was the likely quality of the wines (ie. based solely on looking at the bottle and its label). Their responses were categorized as Low quality, Average quality, or High quality.

In this example, there are 23 objects (the wines), and the characteristics are the three quality outcomes. For each object, we have a count of how many people placed it in each of the three quality classes (ie. we have three dimensions).

An example ordination summary

We wish to summarize the three-dimensional data down to one dimension, showing us the order of assessed quality of the wines, averaged across the 48 people. The authors chose to produce this summary with an ordination technique called Correspondence Analysis, which is certainly appropriate for their data. The resulting order of the wines is shown at the top of the first graph, with each dot representing a single wine, ordered along the dotted line from lowest quality at the left to highest quality at the right.

This is all very well, as we now have the wines in order, but obviously this isn't all that we want to know — we want to know what features of the wine labels led the participants to put the wine bottles in this particular order. This is easy to do for ordinations, and it is shown in the bottom five rows of the graph. Each row represents a different feature of the bottle labels, as indicated in the legend. The location of the colored dots within each row represents the average position along the dotted line of the wines with that feature.

For example, the second row indicates that the wines from the Pay d'Oc region are mostly down the left-hand (low quality) end of the graph, while the Burgundy and Jura wines are preferentially at the right (assessed as likely to be of high quality). Similarly, the fifth row indicates that wines bottled by a co-operative are preferentially at the low-quality end of the order, while wines bottled by the winemaker are at the high-quality end.

We can thus see at a glance which label features are associated with the decision that a wine might be of high quality, as assessed by the participants. This is what ordinations are all about — producing a picture of data once it has been arranged in some relevant order.

Ordinations seem to be rarely used in wine research, but I think that a case can be made that they should be used more often, as a very convenient way of summarizing complex data.

Monday, November 21, 2016

Gattinara 1958

The northern Italian region of Piedmont is famous for, among other things, the long-lived Barolo and Barbaresco wines made from the Nebbiolo grape. What is less well known is that in the northern part of Piedmont is a gathering of other DOC and DOCG areas, in the Vercelli hills a long way north-east of Turin. Perhaps the best of these is the Gattinara DOCG. Here, they also make Nebbiolo wines, but under the local name of Spanna. In the past, the Gattinara wines have been at least as admired as those of Barolo for their longevity; but this reputation has slipped in the modern world. A recent visit to the region is described by Quentin Sadler on his wine page.

Old bottles of Spanna can still be found, as indicated by the recent tasting of some 1964s recorded on the Barolista blog.

A bottle of 1958 Berteletti Spanna

Gattinara 1958, Spanna del Castello di Lozzolo, from Fratelli Berteletti
Purchased on eBay (in July), for €35 delivered to my door from Italy

When first opened there was only a faint aroma, but after a few hours with a cork lightly inserted in the neck (the François Audouze method for opening old bottles of wine) the aroma had increased remarkably. This wine was very much still in its prime. On pouring, the wine had a pale amber hue, fading significantly towards the edge of the glass. The aroma showed plum, honey and toast, with hints of prune, peach and plum jam, along with an earthy tone. In the mouth, there was still plenty of fresh acidity, with low tannins of course, balanced perfectly with flavors of plum, lemon, prune, almond and tobacco, all complemented by a long aftertaste.

This wine was among the best old wines I have tasted. It was every bit the equal of a more expensive 1958 Giacomo Borgogno Barolo Riserva, tasted back in August 2009. Indeed, 1958 was among the best of the old Piedmont vintages, and there are still quite a few bottles available even now.

The wine was drunk with a dinner of meatballs in tomato sauce and parmesan cheese, but it went especially well with the pre-dinner Pecorino Smeraldo, which is a Sardinian sheep's cheese.

[Tasting notes by Susanne Stenlund.]

Note: Among the 1964 Spanna wines tasted by Barolista was also a bottle from Fratelli Berteletti, which received a tasting note very similar to the one above:
A stunner from the first pour until the last sip. The nose is big, mature and velvety with notes of dried black cherries, liquorice, asphalt and dried flowers. Very deep. Some coffee ground notes after a while. Very much alive and kicking. The taste is broad and steady with notes of black olives, dried mushrooms, rosehips and warm gravel. Long and rich. This is really good. Given this blind I would have guessed it to be from the 80s. Will go on for another 20 years.
Viva Spanna!

Monday, November 14, 2016

Can non-experts distinguish anything about wine?

Roman L. Weil is a professor of accounting, with an interest in wine. During the early 2000s he conducted three similar experiments to assess the ability of non-experts (primarily educated, upper middle-class individuals who were experienced and enthusiastic wine drinkers) to distinguish various characteristics of wine. These distinctions included:
  • vintages rated by an expert as good versus poor
  • wines selected for a special "reserve" bottling versus the normal wine
  • different taste descriptors provided by an expert.
Here, I summarize the results of those experiments, as they seem not to be widely known, and yet they provide very interesting conclusions. In my usual fashion, I present pictures of the results (ie. graphs) rather than the original tabulated numbers, because it is then much easier to see the patterns in the data and thus to appreciate the conclusions.


Methods

All of the experiments were designed in the same way. Several different pairs of wines were chosen for each experiment, the pairing being determined by the particular objective of each experiment; these wine pairs constitute the experimental replication. The paired wines were presented to several hundred different tasters, spread over a number of different places and occasions; these people constitute the replicate sample units.

In each case, each taster was presented with three unlabeled glasses, one glass containing one of the wines, and two glasses with the other wine from the same pair. In this triangular experiment, the taster was asked to distinguish the singleton wine (ie. one of the glasses should taste different to the other two glasses). The taster was then asked to identify certain characteristics of the two wines. On any one occasion, tasters received 1–3 of the wine pairs.

The results were summed for each wine pair separately, listing the number of people who correctly distinguished the two wines in each pair, and then how many of those successful people correctly identified the chosen characteristics. Note that distinguishing the characteristics is not relevant unless the taster could actually distinguish the paired wines in the first place!

By random chance, the tasters should be able to distinguish the paired wines one-third of the time (ie. identifying the singleton glass out of three). So, our "expected" result is 33% if the tasters can do no better than random (ie. guessing). Then, for the two characteristics the expectation is 50%, if the tasters can do no better than random (ie. there are two characteristics to identify).


Distinguishing different vintages of the same wine

Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.

The hypotheses being tested in this experiment are that the amateurs:
  • cannot distinguish in blind tastings the wines of years rated by an expert as high from those of years rated low, and
  • if they can, they do not agree with the vintage chart's preferences.
To test these hypotheses, Weil selected six "pairs of wines with the following characteristics: the pairs have identical features (such as shipper, vineyard, and producer) except vintage, and Robert Parker rated one the vintages of those two wines Average to Appalling while he ranked the other Excellent to The Finest in The Wine Advocates Vintage Guide 1970-1999." So, the only difference between the paired wines should be that they came from vintages that Parker thought were very different from each other.

There were 593 tasters. One of the wine pairs was presented to wine professionals ("experts") on two occasions, as well as to the amateurs on the other occasions, and so these experts are treated separately in the results. The pairs of wine were tasted by 54-119 tasters each.

The results of the first hypothesis test are shown in the next graph. For each of the graphs presented below, the interpretation is as follows. Each wine-pair is represented by a horizontal line, as indicated by the legend. The central point on each of the lines represents the percentage of the tasters who succeeded at the task for that wine pair. The two end points on each line are the boundaries of the estimated 95% confidence interval (formally: the Score binomial 95% confidence interval). This interval gets smaller as the sample size (the number of tasters) gets larger, as it represents our statistical "confidence" in the results of the experiment. The dashed line represents the expected results if the tasters are performing in a random manner — the idea of the experiment is to see whether people can do better than random. So, if the dashed line passes through the 95% confidence interval for a particular wine pair, then the tasters have done no better than random for that pair, whereas if the dashed line lies outside the 95% confidence interval then the tasters have done better than random.

Results of Roman Weil's experimental test of wines from different vintages

For the first graph, only the two groups of tasters receiving the Bordeaux wine performed better than random chance. Formally: for five of the wine pairs, the experiment provides no evidence that amateur wine tasters can distinguish between good and poor vintages any better than taking a guess. For the Bordeaux wine pair, both the amateurs and experts did better than taking a guess, with the wine experts doing slightly better than the amateurs.

This outcome calls into serious question the alleged difference of quality between different vintages in the modern world. Remember, Robert Parker (or his delegate) detected big differences in the vintages within a wine pair, but the amateurs could not consistently detect this for themselves when presented with actual examples of the wines. The different result for the Bordeaux wines may reflect the common conception that vintages really do still differ in Bordeaux.

The results of the second hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 30-50% of the tasters. The sample sizes therefore refer to only 21-60 tasters per wine pair.

Results of Roman Weil's experimental test of wines from different vintages

Note that in all cases the tasters behaved in a random manner. That is, there was no consistent preference for the wine from the highly rated vintage compared to the poorer vintage, for any of the wines. We may conclude from this that expert vintage ratings are not related to wine preferences among wine drinkers. The wine from an allegedly poor vintage can taste just as good to an amateur drinker as a wine from a supposedly better vintage.


Distinguishing reserve and normal bottlings of the same wine

Roman L. Weil (2005) Analysis of reserve and regular bottlings: why pay for a difference only critics claim to notice? Chance 18(3):9-15.

The hypotheses being tested in this experiment are that the amateurs:
  • cannot distinguish in blind tastings the wines of reserve bottlings (or first wines) from the normal wines (or second wines), and
  • if they can, they do not prefer the reserve wine.
To test these hypotheses, Weil selected fourteen "pairs of wines based on the following characteristics: the pairs had identical features in all respects, except that one was a regular bottling and one was a reserve bottling. Common features included all label items (e.g. shipper, vineyard, and producer), retail source, and date of purchase." So, the only difference between the paired wines should be that the winemaker specially selected the reserve or first wine for separate bottling, at a much higher price (there was a price ratio of 1.13-3.57 for Weil's choices).

The results of the first hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. There were 855 tasters, with the pairs of wine being tasted by 38-136 tasters each. The two pairs of Champagne wines were each tasted by a small number of people only, and so I have pooled their results here (they did not differ from each other).

Results of Roman Weil's experimental test of wines from different bottlings

Note that the tasters do very much better here than in the previous experiment. That is, for six of the thirteen wine pairs the tasters did better than random when asked to distinguish the more expensive bottle of wine from the same winemaker. Mind you, they rarely did better than 50%, as opposed to 30%. Interestingly, there are three wine types that are repeated in the experiment: the cabernet blend from Bordeaux, the cabernet wine from the western USA, and the white wine from California; and in all three cases the tasters succeeded with one wine but not the other.

Nevertheless, the results do indicate that, for tasters, there is often a bigger difference between what the winemaker does with the wine (selects wine for different bottlings, to be charged at different prices) than between what nature does with the wine (produces different climatic conditions in different years).

The results of the second hypothesis test are shown in the next graph. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 30-50% of the tasters. The sample sizes therefore refer to only 13-56 tasters per wine pair.

Results of Roman Weil's experimental test of wines from different bottlings

Here, the tasters did not consistently prefer the reserve wine over the normal wine, except in two cases. We may conclude from this that winemakers are, indeed, generally selecting wines of different taste for their different bottlings, but that this is not necessarily related to wine preferences among wine drinkers. The wine from an expensive bottle can taste just as good to an amateur drinker as one from a supposedly inferior bottle of the same wine.

The two exceptions are informative. For one of the California chardonnays there was a strong preference for the more expensive wine. This suggests that the winemaker succeeded in this particular case — they charged more ($26 versus $13) for a wine that drinkers actually prefer. In the opposite manner, for one of the Bordeaux wines there was actually a preference for the cheaper wine. It may surprise you to reveal that this was a preference for the 1994 Les Forts de Latour ($56 at the time) over 1994 Château Latour ($200), the most expensive wine in the experiment. The Bordeaux first-growth chateaux might like to take note of this result (as might your wallet!). (Note: in general, the first wines of the Bordeaux first growths cost 3-4 times as much as their second wines; see the Liv-Ex blog.)


Matching wines and their descriptions

Roman L. Weil (2007) Debunking critics' wine words: can amateurs distinguish the smell of asphalt from the taste of cherries? Journal of Wine Economics 2:136-144.

The hypotheses being tested here are that the amateurs:
  • cannot distinguish in blind tastings wines that are described by an expert using different words, and
  • if they can, they cannot match the descriptions to the wines.
To test these hypotheses, Weil selected ten "pairs of wines with the following characteristics: the pairs have similar features, and the same writer / critic wrote about these two wines with disjoint word sets. That is, the reviewer used different words in describing the two wines." Note that the wines could actually come from different vintages or even continents, provided that they had similar grapes, etc.

The results of the first hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. There were 321 tasters, with the pairs of wine being tasted by 13-86 tasters each, which means much smaller sample sizes than for the other experiments.

Results of Roman Weil's experimental test of wines with different descriptions

Since the objective was to choose wines that differ in description by an expert, it is hardly surprising that the tasters succeeded in distinguishing the wine pairs in six out of the ten cases. However, in only one case did they do better than 60-70%, which does call into question the experts' abilities to describe wine in any quantitative way. After all, there are many examples in wine lore of different experts also describing exactly the same wine in completely disjunct words.

The results of the second hypothesis test are shown in the next graph. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 40-60% of the tasters. The sample sizes therefore refer to only 5-45 tasters per wine pair.

Results of Roman Weil's experimental test of wines with different descriptions

Sadly, in only one case could the tasters consistently match the wines to the expert descriptions. So, we may conclude that reading a description of a wine does not necessarily tell you what it will taste like to you.


Conclusions

Combined, these three experiments do not paint a happy picture of the wine business. Amateur wine tasters cannot consistently distinguish wines from different vintages or different bottlings, or with different descriptions. And when they can do so, their preferences do not necessarily agree with the professionals' assessments of quality —  they are about as likely to prefer the one as the other. So, what is it that these professionals are doing? Whatever it is, it seems to be somewhat divorced from their customer base. In any case, there seems to be little reason to pay more for a "special" wine (a better year or a better selection), unless you have already checked it out and decided that you prefer it.


Quality versus preference

One potentially confusing aspect of Weil's experiments is that in two of his three experiments his second hypothesis is not actually related to the first one. In the first experiment his second question concerns which wine the tasters prefer, not which one they think is from the higher-rated vintage; and similarly for the second experiment, they are asked which wine they prefer rather than which one is the reserve wine. Only in the third experiment is the second question directly related to the objective — which wine matches which description.

It is important to recognize the distinction between "prefer / like" and "high quality" (otherwise, one of the two expressions would be redundant!). These are often treated as though they both mean "better", as in the expression "if you like it then it is good". However, these are two very different ideas — supposedly better quality does not mean that you should prefer it in any personal sense. Personal preference is all in your head, but differences in quality also exist outside of it.

For example, one does not need to like opera in order to recognize a poor opera singer, nor does one have to be a practicing christian to appreciate the architectural and artistic merits of a church. So, recognition of quality is not necessarily related to personal choice. For example, I can accept that there are high-quality characteristics of Champagne, but I do not actually like the taste of those distinctive characteristics — I actually prefer the crémant wines from Alsace, Die or the Loire, or the sparkling wines of southern Australia. Financially, of course, this is to my benefit!

This point is important for a wine drinker. The ability to recognize which wine the professionals think has higher quality is a separate issue from whether you actually like that wine. Do I like the wines recommended by Robert Parker? Perhaps so, or perhaps not, but either way I can probably recognize them, because they have a similar set of characteristics. He sees those characteristics as denoting high quality, but I may well see them as something I don't particularly care for.

Weil is probably right to focus on "prefer / like", since that is of most practical relevance to a consumer; but we should not confuse this with "quality". It would be of interest to experimentally examine the latter, also.

Sunday, November 6, 2016

Modern wine vintage charts: pro or con?

Vintage charts, which provide a quality score for each wine vintage in some specified wine-making region, have been a conspicuous part of the wine landscape for many decades. However, there has also been an increasing amount of criticism in recent years.

For your reading pleasure, at the bottom of this post I have included links to a selection of online commentaries (mostly negative) about this issue. The principal objections seem to be one or more of these:
  • They are broad generalizations — they do not account for within-region variation in quality
  • The ratings over simplify — there is also between-vineyard variation within local areas
  • There is no recognition of site selection - there is even within-vineyard variation
  • Modern wine-making (along with global warming) produces reasonably consistent quality, so that vintage variation mainly concerns quantity, instead
  • Do the charts rate wine longevity or drinkability?
  • Vintage variation influences style, but not necessarily quality
  • Charts produced by different people inevitably differ, often strongly disagreeing
  • Do wine drinkers actually prefer highly rated vintages?
Many of these points are easy to quantify, and most of them make vintage charts redundant in the modern world. Here, I present specific examples to illustrate some of these points.

Different people, different charts

Most of the well-known wine magazines produce vintage charts, which are available online. The first graph below compares two of these charts for the vintages from 2000-2011. The dots represent the vintage scores from the Wine Advocate (vertically) and the Wine Enthusiast (horizontally) pooled for the following Italian regions: Barolo, Barbaresco, Brunello di Montalcino, and Chianti. If the two magazines gave each vintage the same score, then the dots would all be along the pink line.

Wine Advocate versus Wine Enthusiast vintage scores for Italy

As you can see, there is a great deal of disagreement between these two charts, as only four of the dots are actually on the line, and another five differ by 1 point. But more importantly, the eight Wine Enthusiast scores between 80 and 88 form two clusters of quality scores as far as the Wine Advocate is concerned, with four of the vintages scoring much lower (74-77) than the other four (89-93).

As an alternative example, Jancis Robinson has organized some blind tastings of the red Bordeaux vintages from this century (C21 Bordeaux vintages - a ranking). During the tastings in 2015 and 2016, the attending wine professionals were "asked to rank the last 13 vintages in qualitative order." We thus have a total of 18 (2016) and 15 (2015) rankings for the same 12 vintages (2000-2011). These are compared in the next graph, where each dot represents a single vintage, located according to the sum of ranks from 2015 (horizontally) and 2016 (vertically). Note that a smaller rank indicates a "better" vintage.

Jancis Robinson vintage assessment from 2015 and 2016 for Bordeaux

There is obviously a lot of agreement here. However, there are four vintages in the middle of the graph that all had very similar ranks in 2015 but had two very different ranks in 2016, so that two of the dots are a long way below the line. That is, the 2006 and 2008 vintages were evaluated similarly in the two tastings, but the 2003 and 2004 vintages dropped significantly in the ranking between 2015 and 2016.

Andrew Jefford has a comment on these rankings at Decanter (Kicking the hell out of Bordeaux 2011).

Within-region variation

Wine vintage charts must apply to specified wine-making regions, with a score for each vintage in each region. Unfortunately, these regions are often unconscionably large, so that a single number cannot possibly describe the wine quality across the whole region. While countries like France, Spain and Italy usually get divided into several wine-making regions, even somewhere as large as California sometimes gets treated as a single region.

However, to me, the classic example of silliness is trying to treat an entire continent like Australia as a single region, or even "south-eastern Australia". The following maps compare the size of Australia to both Europe (minus Scandinavia and the Baltic states) and the USA. As you can see, even south-eastern Australia is as large as Spain + Portugal, or California + Oregon + Washington. Moreover, the variation in wine-growing climates throughout south-eastern Australia is at least as large as any of these other conglomerations.


Within-location variation

Traditional wine-making regions sometimes get subdivided, on the grounds that the within-region climate variation produces different wines. Thus, Bordeaux red wine is sometimes divided into the Right Bank (Saint Emilion and Pomerol) and the left Bank (the Médoc).

The next graph compares the vintage rankings for these two Banks, from the Wine Cellar Insider, for the vintages from 1982-2014. Each dot represents a single vintage, located according to the quality score for the Left Bank (horizontally) and the Right Bank (vertically). Once again, smaller ranks indicate "better" vintages; and if the vintages had the same rank in both Banks then the dots would lie along the pink line. Not all vintages made it into the rankings (ie. some were not considered good enough to be worth ranking).

Wine Cellar Insider vintage ranks for Left and Right Bank Bordeaux

While there is some consistency in the rankings there are many anomalies, where the two Banks had very different qualities in the same vintage. In particular, there are six vintages (shown as red dots) where a vintage made it into the ranking for one Bank but not the other.

The Global Wine Score blog has a similar analysis for these two Banks (Bordeaux 2015 vintage: Right Or Left Bank?).

Modern consistency of vintage quality

I have published several recent blog posts that illustrate the changing nature of vintage scores over the past 25 years (Two centuries of Bordeaux vintages — Tastet & Lawton; A century of Barolo vintages — Fontanafredda; More than a century of Barolo vintages — Marchesi di Barolo). The bottom line is that the scores have increased during that time, as well as becoming less variable from year to year.

A similar point is made for Australian vintages in this paper:
     V.O. Sadras, C.J. Soar, P.R. Petrie (2007) Quantification of time trends in vintage scores and their variability for major wine regions of Australia. Australian Journal of Grape and Wine Research 13:117-123.

Preference for high versus low vintage ratings

This issue of whether wine drinkers actually prefer the vintages recommended by the wine charts is addressed in another published article:
     Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.
A free copy is available here.

The paper discusses an empirical test of the claim (specifically by Frank Prial; see the link below) that the modern vintage chart is redundant. The author got many people to do tastings of paired wines, one from a good vintage as decreed by the Wine Advocate chart and one a poor vintage; and his conclusion is:
The 240 wine drinkers on whom I’ve systematically tested Prial’s hypothesis cannot distinguish between wines of good and bad vintages, except for Bordeaux, and even when they can distinguish, their preferences and the chart’s do not match better than a random process would imply.
In other words, a high vintage score in a chart is no guarantee that you will actually like the wines.

Selected commentaries

Frank J. Prial, The New York Times
So who needs vintage charts?

Paul Gregutt, The Seattle Times
Rating vintage ratings; not high

Paul Kaan, Filthy Good Vino blog
Using a vintage chart to pick wines sucks … here’s a better way!

W. Blake Gray, The Gray Report blog
Vintage charts for California are worthless

Dan Berger, Vintage Experiences newsletter
Vintage chart fallacies

Richard Hemming, Jancis Robinson blog
Vintage nonsense

Decanter staff
Are official vintage charts meaningless?