Monday, 20 March 2017

A comparative tasting of some old Barolo wines

On March 11 this year we held a comparative tasting of some old wines from the Barolo DOCG, in northern Italy:
  • 1988 Marchesi di Barolo
  • 1985 Marchesi di Barolo "Cannubi La Valletta"
  • 1971 Alessandria Luigi & Figli
  • 1971 Fontanafredda "Vigna La Rosa"
  • 1964 Alessandria Luigi & Figli
  • 1964 Fontanafredda (La Grande Annata)
  • 1964 Cantina Terre del Barolo
None of these are top-of-the-line wines, but they are all from top vintages, so it seemed interesting to see how they have fared over the decades.

The seven wines compared, in tasting order left to right

Six of the seven wines were still in good condition, but the remaining wine had oxidized into a sherry-like form. The oldest wines were past their best, although still very interesting, with their aroma being much more complex than their taste. There was rarely a long aftertaste in the older wines.

The wines were opened at noon, for the tasting at 7:00 pm. This followed the François Audouze method, in which the wines are opened but not decanted. Instead, the bottles are lightly recorked if they smell okay, but otherwise are left open for slow breathing. This "slow oxidation" approach was apparently insufficient for these wines — they all still had fairly closed noses when poured, but opened up with time in glass. This was especially true of the five oldest wines.

All of the wines had considerable amounts of sediment. For most of the wines this had settled to the very bottom of the bottle, as they had been standing upright for several days. However, for the two 1971 wines there was some very fine sediment still suspended in the lower quarter of the bottle

The wine glasses in the same order as above


1988 Marchesi di Barolo

Still a dark orange-purple. Early on, the taste was stronger than the aroma, the latter being of sweetish plum, with an alcohol lift. Later on, an aroma of vegemite (or marmite, for some of you) appeared. In the mouth the tannins were still in evidence, along with citrus, although the overall sensation was of somewhat sour plumminess.

1985 Marchesi di Barolo "Cannubi La Valletta"

More of an amber color than the previous wine. The aroma was much stronger than the taste. Early on this aroma was of dark chocolate, fig and barnyard, with an alcohol lift and a slight oxidized tang. Later, this became more fruity, along with some honey. The mouth was still rather drying, with a taste of toffee or muscovado. By consensus, this was the best wine of the evening.

1971 Alessandria Luigi & Figli

This was a paler brown, with fine suspended sediment in the lower part of the bottle. The initial aroma was very muted, with some sweaty horse detectable, along with a slight alcohol lift and some oxidation. The aroma developed somewhat during the evening, although remaining muted, with finally a distinctly smoky tang. The taste remained muted, being rather cheesy (in a good sense), along with apricot.

1971 Fontanafredda "Vigna La Rosa"

Slightly paler than the previous wine, with fine suspended sediment (but less). The aroma was distinctly of rubber and tar at first (without alcohol or oxidation), but this eventually blew off, to be replaced by roses. A muted strawberry taste was evident, although still mouth-drying and somewhat acidic.

1964 Alessandria Luigi & Figli

This wine had oxidized into a very nice dry sherry. Very pale brown, almost yellowish, with a slight fine sediment. The aroma was of honey, and the taste distinctly of apricot.

1964 Fontanafredda (La Grande Annata)

This was the most surprising wine of the evening, leading to some accusations of tampering. This seems very unlikely, given the price paid for the bottle. The cork was seeping very slightly when the wine was delivered, a month before the tasting.

The color was a bright reddish purple, with only a slight paleness at the edge of the glass; it thus looked much younger than its label. The aroma was initially of plasticene, followed later by flowers (roses), honey and a bit of dark chocolate. At first there was no distinct taste, but eventually red currants and cherries (with kernels!) appeared. The aftertaste remained short.

In defense of the wine, I note that in 2011 Jancis Robinson apparently reported an authenticated 1954 Lopez de Heredia "Vina Tondonia" (Rioja) wine as still being "bright crimson".

1964 Cantina Terre del Barolo

This was the only 1964 wine to look and behave as expected. It was a pale brownish purple, but not as brown as either of the two 1971 wines. The aroma was slightly oxidized (but not alcoholic), showing mainly vanilla. The taste was of plum and prune, with a touch of cherry and strawberry. The aftertaste was still astringent.

This bottle had the only cork that was not tightly wedged in the neck; indeed, I ended up (inadvertently) pushing the cork in rather than pulling it out.

Monday, 13 March 2017

How many wine prices are there?

My short answer is "three", as I will explain in this blog post.

It would be nice for the consumer if wine prices varied in some predictable way. From the viewpoint of a data analyst, this implies that there is some specifiable model of price variation that can be used to describe the variation in observed prices.

From the viewpoint of the consumer, having such a model is valuable, because it can be used to make a rational decision about whether a particular wine is a bargain or a rip-off (as explained in Choosing value-for-money wines). That is, models suggest that prices have a predictable component (an "average value") and a random component, and too much deviation from the predictable component indicates either a good deal or a bad one.

I have suggested in previous blog posts that one model that seems to fit wine prices reasonably well is what is known as the Lognormal Model (eg. see The relationship of wine quality to price). This model indicates that prices are expected to increase exponentially in response to winemaker effort. That is, rather than one unit of effort leading to the addition of one unit of price, the prices are multiplied, instead.


One consequence of this model is that prices do vary around some average value but the variation is not symmetrical — prices vary much more above the average than below it. I am sure that you have all noticed this in real life; and not just for wines, but for most consumer products. There are plenty of very expensive wines but not so many very cheap ones.

The most expensive wines are often featured in the media as being "luxury" products, beyond the purchase ability of mere mortals. On the face of it, it seems unlikely that the pricing of these wines is in any way related to the pricing of the wines bough by the rest of us. Recently, Thach, Olsen, Cogan-Marie & Charters suggested dividing these wines into the following price categories per bottle (see What Price is Luxury Wine?): Affordable Luxury (US$ 50-100), Luxury wines (US$ 100-500), Icon wines (US$ 500-1,000) and Dream wines (US$ >1,000). Below these, we might also recognize Everyday wines (US$ <10), Better wines (US$ 10-20) and Premium wines (US$ 20-50).

Maybe there are different pricing models for each of these wine groups? I thought that it might be worthwhile to try modeling some real data, to see how all of these ideas fit together.

Systembolaget

The data come from the online database of the national liquor chain in Sweden, known as Systembolaget. Being government owned, the complete product information is freely available, as both an XLS file and an XML file. I have used the prices of the bottled red wines that were available in May 2016, when there were 5,487 such wines listed in the database. [Note that bag-in-box wines are not included.]

This first graph shows the frequency distribution of how many wines (vertically) fit into each of the bottle prices (horizontally). [Note the logarithmic scale for the prices.] Obviously, the prices are given in Swedish crowns (SEK). To convert to other common currencies you can divide the SEK by approximately 10 — if you want more precision, for USD divide by 9, for EUR divide by 9.5, and for GBP divide by 11.

Frequency disribution of the prices of bottled red wines in Sweden

The graph is rather spiky, because of the worldwide practice of setting prices at particular "desirable" values (99, 149, 199, etc), but there is otherwise a clear general pattern to the data. The minimum price is 40 SEK (US$ 4.50) and the maximum is 22,500 SEK (US$ 2,500). The most common price is 100 SEK (US$ 11), with the median at 190 SEK (US$ 20) — that is, half the wines cost US$ 20 or less.

Now, we can compare these data with the pricing categories outlined above. This next graph superimposes them onto the frequency distribution.

Frequency disribution of the prices of bottled red wines in Sweden

We can then see how many wines fit into each category:
Everyday wines
Better wines
Premium wines
Affordable Luxury
Luxury wines
Icon wines
Dream wines
558
2,070
2,045
538
246
19
11
Not unexpectedly, there is a dearth of the cheapest wines, which mainly sell in cardboard boxes, not bottles. However, the other wine price categories are well represented in Sweden. Moreover, the wines themselves are the ones typically available in other Western countries.

Modeling the price data

The frequency distribution does, indeed, look very much like what would be expected for a lognormal model. So, we can start or analysis by trying to fit the data to such a model.

This next graph shows the probability distribution of the best-fitting lognormal model in pink. [Technical note: the model was fitted using maximum likelihood, with the Regress+ program; I subtracted 35 SEK from each price before fitting the model, and then added it back for the probability distribution.]

The prices of bottled red wines in Sweden do not fit a simple lognormal model

In this sort of analysis, we interpret the pink line as representing the predictable component  of the variation in wine prices, and the differences between the blue line and the pink line represents the random component of the wine prices.

This lognormal model fits the price data reasonably well, and the predicted parameters are close to the observed ones (eg. the predicted median and mode are within 7% of the observed values). However, the fit is not good enough, because the pink line (the model) deviates too far from the blue line (the data) in too many places.

I suggest that the basic problem here is that we are trying to fit a single price model to the data, when the data actually follow several different models.

First, the highest prices do not fit the model, notably those for the Icon wines and Dream wines. It should not surprise us that these wines have their own price structure, which comes from cloud-cuckoo land, and is tolerated only by businessmen with more money than common sense. So, we should not be trying to cram these wines into our model. Indeed, there is a big price gap at US$ 500, and so it is easy to decide where to divide the model in two — we simply drop the Icon and Dream wines from our analysis.

Second, it looks to me like the remaining frequency distribution (US$ <500) still cannot be modeled by a single lognormal model. That is, the shape of the pink line cannot be made to move closer to the blue line. Even these data are more complicated than our simple lognormal model.

So, let's try fitting two lognormal models to the data, instead. This is straightforward to do; and it implies that the wine prices actually live in two different worlds. This next graph shows the probability distributions of the two best-fitting lognormal models in pink, along with their combination shown in black. [Technical note: the Akaike Information Criterion for two models versus one improves by 231.7 units.]

The prices of bottled red wines in Sweden do fit a paired lognormal model

The black line clearly fits the blue line much better than did the pink line in the previous graph. So, this is now a very good fit of the data and the overall model. We could, of course, keep fitting even more complex models, but there seems to be no good practical reason to do so. [Technical note: by definition both lognormal curves must start at the same left-hand point on the horizontal axis. This is an unfortunate inadequacy of using the lognormal, because the smaller curve would make more sense if it started further to the right. That is, a model for the more expensive wines should not start at US$4!]

This model means that the two pink lines represent two different pricing structures for the red wines. One of the structures covers the cheaper wines (principally US$ <20) and the other one covers somewhat more expensive wines (US$ 20-500). Roughly speaking, there is one pricing structure for the Everyday and Better wines, and another one for the Premium, Affordable Luxury, and Luxury wines.

The majority of wine consumers are likely to be dealing with the first model, which actually covers 65% of the red wines. However, the cognoscenti will principally be interested in the second model, which covers almost all of the remaining 35% — these are the wines that the wine media tend to write about most often. The Icon and Dream categories, with their own (third) model, include only 0.5% of the wines.

This has important practical consequences for buying wine. For example, there is little point in a consumer trying to compare the quality:price ratios (QPR) of US$10 wines and US$50 wines, because they probably won't be the same — the pricing structures are not actually connected. You need to be wearing a different hat when you shop for Premium wines rather than Everyday wines!

Conclusion

For these data (bottled red wines available in Sweden), there appear to be at least three price models. One model covers the most expensive wines (US$ >500), one principally covers the cheapest wines (US$ <20), and the third model covers most of the wines in between. Price variation within each of these models is unrelated to price variation within the other models.

In practice, this means that price comparisons only make sense within each of these three groups — for example, the quality:price ratios (QPR) will be different between the groups. There seems to be no good reason why these conclusions would not apply to the wines sold in other countries with similar wines available, as well, although the details may be different.

Monday, 6 March 2017

Are the quality scores from repeat tastings correlated? Sometimes!

Wine quality scores from commentators tend to be set in stone while the wines are still available commercially. That is, a single score is produced when the wine is released, and it stands indefinitely. Sometimes, wines are tasted several times, and a consensus score is then produced, but not often. Sometimes, only the latest score is the one presented, without reference to any previous scores — Wine-Searcher does this, for example, when compiling critics' scores.


One thing that is of interest to the consumer is how these repeat scores relate to each other. It would be nice if repeated tastings of a particular wine produced the same score, because then we could have some confidence in it. However, there will be some component of the scores that is due to different bottles of the wines, and so we cannot expect perfect repeatability. More importantly, this issue will be confounded by possible changes in the wines themselves as they age, so that any one bottle varies through time.

However, a lot of the variation in scores will be due to what is technically called intra-individual variation in the scorer, which you might call within-taster variation — the same wine tasted on repeat occasions by the same person does not receive the same score no matter how similar it is. The assessment of quality is the result of a taster’s previous experiences as well as their personal conceptions; and even experienced wine tasters have been shown to incorporate their own preferences in their judgments. In addition to this, the environment of the tasting is also known to affect quality judgments.

This issue has only occasionally been studied in the professional literature; and I have included a list of relevant published papers at the end of this post.

What I will do in this post is look at some particular examples of scores from repeat tastings by five different commentators. Some of these tastings come from retrospective vertical comparisons of a single wine, where many of the previous vintages are tasted on a single occasion, or horizontal tastings of a number of wines from the same region and vintage — these new scores can be compared to the scores previously assigned to those same wines.

Some examples

Most of these examples are restricted to wines where there are many vintages to compare, and where the producer actively provides retrospective vertical tastings of their products. Furthermore, my expertise in this regard is in Australian wine. This creates a distinct bias in which wines I can use as examples.

For my first example, I will use some scores from a book by the Australian commentator Jeremy Oliver, The Australian Wine Handbook, of which there were at least three editions: 1993, 1994, 1996. I introduced this book in a previous blog post (Wine writing, and wine books). The wine that I will look at is Penfolds Grange Bin 95, of which there are now more than 60 vintages, and which I also introduced in a previous post (Poor correlation among critics' quality scores).

The first graph compares the quality scores for 38 vintages of this wine in the 2nd and 3rd editions of the book, at which time Oliver was using a 10-point quality scale. Each point represents a single vintage; and if the scores were identical in the two editions, then the points would all lie along the pink line.


There are only 21 vintages (55%) for which the scores are the same, 8 (21%) that decrease from 1994 to 1996, and 9 (24%) that increase. The maximum decrease is 2 points, and the maximum increase is 3 points. The book does not make clear what the circumstances were that lead to the two sets of scores, but there is obviously considerable variation in the opinions about quality. Nevertheless, there is no evidence of any bias in Oliver's opinions about the wines,

Now let's consider an example where the second tasting involved a retrospective vertical comparison of the wines. This example involves the same wine and commentator. The first set of scores comes from The Onwine Australian Wine Annual, published in 2000, at which time Oliver was using a 20-point scale. These scores are not exactly the same as those from 1996. The second set of scores comes from a retrospective tasting, Making it a Date with Grange, published in 2004. There are 45 vintages included in this next graph.


Once again, there is no evidence of any bias in Oliver's opinions about the wines, although the scores change by a maximum of 1.8, both up and down. The correlation between the scores shows that they share approximately half (51%) of the variation, which is not particularly high, given that they are the same wines tasted only 4 years apart.

As an alternative example of a retrospective vertical tasting, we can look at another Australian commentator, James Halliday, and the wine Cullen Cabernet Sauvignon Merlot (now called Diana Madeline) from the Margaret River region of Western Australia. The first set of scores comes from various editions of The Australian Wine Companion, using a 100-point scale. The second set of scores comes from a retrospective vertical, Tasting an Icon, published in the Halliday Wine Companion Magazine for Feb/Mar 2014. There are 19 vintages included in the next graph.


This time we see several vintages that have very different scores in the two datasets. Two of the vintages have their scores reduced by 5-6 points (bottom of the graph), and one gets a 5-point increase (top-left). Furthermore, the scores are not highly correlated even if we exclude these three vintages, with only 20% of the variation being shared between the two datasets.

Moving on, the wines of Bordeaux often have repeated scores from single sources that cover many vintages. For example, the American magazine Wine Spectator published two retrospective vertical tastings of the top wine from Château Lafite-Rothschild, one on 15 December 1988 and one on 30 November 1991. There are 34 vintages included in the next graph.


Once again, we see several vintages that have very different scores in the two datasets; these vintages are labeled in the graph. Even if we exclude these three vintages, then the scores are still not correlated with each other, as only 8% of the variation is shared between the two datasets. This is a very poor correlation, given that they refer to repeat tastings of the same wines tasted only 3 years apart.

As an alternative approach, we could try comparing a range of wines from the same region in the same vintage year — that is, a horizontal tasting rather than using a vertical one. To do this, let's look at the American commentator James Laube, and the 1986 vintage California Cabernets. The first set of scores comes from the 1988 book California's Great Cabernets: The Wine Spectator's Ultimate Guide for Consumers, Collectors and Investors. The second set of scores comes from a retrospective horizontal tasting, 10 Years After, published in Wine Spectator for December 15 1996 (pp. 67-70). (Thanks to Bob Henry for compiling these two datasets.) There are 45 different cabernet wines included in the next graph.


As is obvious, there was a major re-assessment of the wines at the second tasting, as almost all of the points lie below the pink line, rather than being scattered around the line (as in the above graphs). This vintage did not turn out to be as good as originally expected!

On release, the wines were assessed generally in the score range 88-96, but 10 years later these same wines scored only 86-94 points, with an average score reduction of 2.6 points per wine. This seems to relate to the idea that there are bonus quality points available for wines based on their expected longevity (see the post What's all this fuss about red versus white wine quality scores?). After 10 years, it was obvious that the 1986 cabernets were not going to last as long as expected, and so their bonus points disappeared.

It is clear that the consumer should pay attention to the Wine Spectator and the Wine Advocate, both of which are known to conduct reviews of California Cabernets at the 10th, 20th, 30th and sometimes even 40th anniversaries of the vintages.

Finally, we can return to the vertical tastings, and look at a British commentator, Jancis Robinson. We can also return to Australian wines, this time the Henschke Hill of Grace Shiraz, a wine from the Eden Valley region of South Australia, and with almost as long and distinguished a pedigree as the Penfolds Grange discussed above. Robinson took part in two retrospective vertical tastings of this wine, one on 18 March 2003 and the other on 12 March 2013, which marked the 40th and 50th anniversaries, respectively, of the first vintage. There are 23 vintages included in the next graph, scored on a 20-point quality scale.


Once again, there was apparently a major re-assessment of the wines at the second tasting, as almost all of the points lie above the pink line, rather than being scattered around the line. Note that this is a comparison of two retrospective tastings, unlike the above graphs.

The age of the wines at the first tasting was 5-28 years, and the scores ranged from 15-18.5, with an average of 16.7 points. The age of the same wines at the second tasting was 15-38 years, and the scores ranged from 16-19, with an average of 17.7 points, for an average increase of 1 point. Obviously, either the wines changed or Robinson's perception of them did. That is, is this score inflation, or did most of the wines get better through time?

To work this out, we can plot all of the scores from the two retrospective tastings, not just those tasted on both occasions. This final graph shows the two sets of scores for each vintage tasted, in different colors. (Note: the missing vintages in the second tasting are ones in which the wine was not considered good enough to release.)


Clearly, all of the scores on the second occasion (in green) are consistent, and so there is no evidence that the quality of the wines improved over time. Instead, we must conclude that Robinson simply scored the wines 1 point higher on the second occasion. This may reflect her better familiarity with this style of wine, or it may reflect her well-known dislike of assigning wine scores in the first place.

Conclusion

Wine quality scores are usually presented as though they are inviolate, and represent a critic's opinion that does not change. That is, we are given one number, which does not get changed. This may be valid if the commentator tastes each wine once, and once only. However, the commentators have been known to reevaluate wines at different times, especially if they are invited to a retrospective tasting, either vertical or horizontal. In this case, the evaluations at different times can be really quite different.



Research Literature

Robert H. Ashton (2012) Reliability and consensus of experienced wine judges: expertise within and between? Journal of Wine Economics 7:70-87.

Chris J. Brien, P. May, Oliver Mayo (1987) Analysis of judge performance in wine-quality evaluations. Journal of Food Science 52:1273-1279.

Richard Gawel, Peter W. Godden (2008) Evaluation of the consistency of wine quality assessments from expert wine tasters. Australian Journal of Grape and Wine Research 14:1-8.

Richard Gawel, Tony Royal, Peter Leske (2002) The effect of different oak types on the sensory properties of chardonnay. Australian and New Zealand Wine Industry Journal 17:14-20.

Robert T. Hodgson (2008) An examination of judge reliability at a major U.S. wine competition. Journal of Wine Economics 3:105-113.

Robert T. Hodgson (2009) How expert are "expert" wine judges? Journal of Wine Economics 4:233-241.

Harry Lawless, Yen-Fei Liu, Craig Goldwyn (1997) Evaluation of wine quality using a small-panel hedonic scaling method. Journal of Sensory Studies 12:317-332.