The Wine Gourd: February 2019

Monday, February 25, 2019

How much difference can there be between critics?

I have previously written a couple of posts in which I looked at wine-quality scores where critics tasted exactly the same bottles of wine at the same time:

In these cases the quality scores differed to one extent or another. However, what I have not yet considered is just how big those differences can get, between any given pair of professional wine tasters. Here, I present an example where those differences are very big indeed.

When looking at variation in wine-quality scores, it is important to eliminate the effects of different bottles and tasting conditions, by having the scores be produced from the same bottles at the same time. This is, of course, what happens at most group wine tastings. Wine Spectator magazine is a useful source of this sort of information, as it has occasionally held direct tasting comparisons between pairs of its critics, among the tastings conducted by each of the region experts on their own.

The exercise

I have previously used data concerning James Laube and James Suckling, who have both provided wine-quality scores to Wine Spectator regarding Cabernet wines (Laube versus Suckling — their scores differ, but what does that mean for us?). This time, I will compare James Laube with Per-Henrik Mansson, as they have both provided scores for Chardonnay wines, with Laube as the California expert and Mansson as the Burgundy expert. Mansson has subsequently moved on from the magazine.*

The dataset I will use here is from the "Chardonnay Challenge" of 1997 (see Wine Spectator for November 15, 1997, pp. 46–70), in which our two critics tasted 10 California Chardonnay wines and 10 Burgundy white wines from both the 1990 and 1995 vintages.** However, there were only 39 bottles of wine with which to compare their scores, as one of the Burgundies from 1990 was not available in time for the tasting.

The data are shown in the first graph, with Laube's scores vertically and Mansson's horizontally. Each point represents one of the 39 bottles.

Mansson vs. Laube for 1990 and 1995 chardonnays

This does not look too good, to me — in fact, it looks terrible. There is a wide spread of points in the graph (note, also, that Mansson's scores cover a bigger range than Laube's) The mathematical correlation indicates only 3% agreement between the two sets of scores, which is almost no agreement at all. To make this clear, the solid pink line shows what agreement would look like — for bottles whose points are on this line, the two critics perfectly agreed with each other. Clearly, this is only 2 out of the 39 bottles. The Laube score is > the Mansson score 22 times, and 15 times it is the other way around.

The two dashed lines in the graph show us ±2 points from perfect agreement — for bottles between the two lines, the two sets of point scores were within 2 points of each other. This allows for the approximate nature of expert opinions — technically, we are allowing for the fact that the scores are presented with 1-point precision (eg. 88 vs. 89 points) but the experts cannot actually be 1-point accurate in their assessment.

There are only 10 of the 39 bottles (26%) between the dashed lines. So, even when we allow for the approximate nature of expert opinions, there is much more disagreement here than there is agreement.

Another way of dealing with the approximate nature of expert scores is to greatly reduce the number of score categories, so that all the experts need to do to agree is pick the same category. The Wine Spectator does it this way:

95 – 100
90 – 94
85 – 89
80 – 84
75 – 79
50 – 74

Classic: a great wine
Outstanding: a wine of superior character and style
Very good: a wine with special qualities
Good: a solid, well-made wine
Mediocre: a drinkable wine that may have minor flaws
Not recommended

So, I have shown this scheme in the second graph. For bottles within the boxes, the two critics' point scores agree as to the word categories of wine quality. Rather poorly, this is only 6 of the 39 wines (15%). So, even this broad-brush approach to wine quality assessment provides only one-sixth agreement between the two critics.

For comparison, the Laube versus Suckling Cabernet tasting (mentioned above) produced much better agreement. Their mathematical correlation was 29% (only 3% this time), there were 5 out of 40 bottles on the solid line (2 out of 39 this time), 23 out 40 bottles between the dotted lines (10 out of 39 this time), and 25 out of 40 bottles within the squares (6 out of 39 this time). Suckling and Laube did not agree much with each other, but Mansson and Laube hardly agree at all.

To make this point clear, the third graph illustrates the differences in the paired scores, expressed as the Mansson score minus the Laube score (horizontally) and the count of the number of scores (vertically). Clearly, the scores differ by up to 10 points (Mansson greater than Laube) and 13 points (Laube greater than Mansson). I have rarely seen scores differ by this much — 13 points is a lot of quality-score difference. It is pertinent, I think, to ask whether these two people were actually tasting the same wines!

As an aside, it is worth noting the overall low scores given to the wines. Only 16 of the wines scored >90 points, even though they were all quite expensive. This is quite comparable to the previous year's Cabernet tasting, where only 17 wines scored >90 points.

What does this mean for us?

Obviously, we should be asking what is going on here. The magazine is presenting their scores as representing some sort of Wine Spectator standard of quality, but clearly this is not an objective standard of quality. The scores are personal (but expert) judgments by their individual critics, who may have very little in common.

In this case, the situation is illustrated in the final graph, which shows the average scores for each critic for the four types of wine — California versus Burgundy, for both the 1990 and 1995 vintages. Put simply, James Laube preferred the California wines in both years, and Per-Henrik Mansson particularly liked the 1995 Burgundies. The only wines they agreed about were the 1990 Burgundies.

Mansson's preference for the 1995 Burgundies is explained in his notes:

I looked beyond aromas and flavors for what I think are the two most important factors in determining a great Chardonnay: a seamless, silky texture in the mid-palate and a clean, elegant, balanced finish ... I often find that young California Chardonnays taste overly oaky and acidic. After a glass or two, they seem heavy, even dull and flat. The 95s reinforced this negative impression; compared to the beautifully balanced, elegant, supple yet succulent white Burgundies, the California whites tasted slightly bitter to me, with a few notable exceptions.

Laube's consistent preference for the California wines, however, is not explicitly explained. His published notes are almost entirely about how much better value for money the California wines were compared to the Burgundies — the Burgundies cost up to 10 times as much but were no better. However, since the wines were tasted blind, this cannot explain his scores. His only brief comment is:

California Chardonnays tend to be fruitier, white Burgundies a shade earthier.

This is consistent with his notes for the previous Cabernet tasting:

I like my wines young, rich, highly concentrated and loaded with fruit flavors.

The problem for us is that these critics' quality scores are not really comparable. They give us a rank order of preference for each critic, but any attempt to directly compare them makes little sense. Unfortunately, comparing them is precisely what the magazine actually asks us to do (and I did!).

* I bet his name was Månsson before he left Sweden.

** Thanks to Bob Henry for sending me a copy of the magazine.

Monday, February 18, 2019

If not scores or words to describe wine quality, then what?

The point that I have been making in a number of my recent blog posts is that wine-quality scores give the illusion of being mathematical without having any useful mathematical properties. This is not quite fraudulent, but it is unfortunate — the apparent precision of the numbers gives an illusion of accuracy.

This means that, in practice, quality scores do nothing more than express personal preferences — they allow the critic to put wines in some sort of rank order. There is nothing wrong with wanting to do that, of course (see Wine tastings: should we assess wines by quality points or rank order of preference?). Indeed, if you can find someone with the same tastes as yourself, knowing about their preferences can be very helpful for buying wines.

However, this does raise the obvious question as to what we might use, instead, to express a rank order of preference. This question has been raised by a number of people who have expressed their conclusions publicly, as well as by many who have not. This blog post gathers together some of the public ones. The collection is neither comprehensive nor exhaustive, but merely introduces you to some of the suggestions that I have found interesting.

Using letters not numbers

This approach is what teachers use, whether they are at a high school, a university, or a community college. So, you are all familiar with it. Each student's academic score (expressed as a %) for each semester or year-end is converted to a simpler grading system, which then expresses the student's achievement.

Sometimes, the grade ranking is something like this (from best to worst): A, B, C, D, E and F, where D conveniently means "Deficient" and F means "Fail". Other times it might be like this: A+, A, A-, B+, B, B-, C. The one I had at university in Australia used: High Distinction, Distinction, Credit, Pass, Terminating Pass, and Fail. Note that in this case D means "Distinction", not "Deficient"! There is nothing universal about educational grading systems.

This general approach is thought to work better than the original percentage scores because there is an explicit meaning to each of the grades — getting an A means something quite different to getting a B. This forces the teacher to be very careful about which students get which grades. As noted by David Schildknecht:

Scores of wine critics aren't bound to intuitively meaningful letter grades as are those of educators — although I grant that the association of a 100-point scale with educational grading is so strong that even in wine criticism, despite incommensurability, many consumers will think of e.g. "90" as having some intuitive sense.

This grading system can work provided that the institution involved has a convention for assigning a particular numerical range to a given letter grade, and that the grades objectively correspond to a specified level of academic achievement,

Moody credit rating system

This idea was suggested by wine merchant David Farmer (John Moody and brave or foolish judging). David was unhappy with current wine judging systems, particularly the well-known fact that a wine can win a gold medal at one show and nothing at all at the next show. He was therefore looking for a more complex judging system, which he calls Three Dimensional:

After a bit of searching I stumbled across the inventiveness of John Moody who in 1900 developed a rating system for financial paper. I found his ideas had a three dimensional character to grading which offered, if not solutions for wine judging, a display of technique which could perhaps be adapted ... As I studied the Moody system I found it offered many parallels to wine judging.

Moody's rating system is one of several that are used to express credit ratings in the financial world (the others being from Standard & Poor, and the Fitch Group). It can be applied to bonds and to countries, for example, to indicate their credit-worthiness. The initial idea was to provide to investors impartial information on credit-worthiness of financial entities, which the credit ratings agencies were being paid to provide.

We do not need to concern ourselves with the methodology used for the economic evaluation, but only with the way the outcome is expressed. The Moody system uses a linear ranking with 21 levels (from best to worst):

A
A
A
A
A
A
A
B
B
B
B
B
B
B
B
B
C
C
C
C
C

a
a
a
a

a
a
a
a
a
a

a
a
a
a

a

a
a
a

a
a
a

1
2
3
1
2
3
1
2
3
1
2
3
1
2
3
1
2
3

Clearly, the system is more complex than a simple score of 0 to 20, such as might be used for wines. In the Moody system, it matters whether the entity being rated is in group A, B or C, and within the groups it matters how many "a"s it gets. For example, being downgraded from Aaa to Aa1 has a specific meaning within the system (see Wikipedia). This is in contrast to a wine-scoring scheme, where adjacent scores have little or no objective meaning.

This general approach to expressing a rank order has a lot of similarity to that of the educational system, as described above, being at heart a more elaborate version of it.

Wine tally

James Ho uses the name WineTally on Facebook, Instagram and CellarTracker. He has suggested what he describes as "a Simple, Sensible Scoring System for wines to enable every consumer to articulate one's evaluation and experience." This boils down to providing the reader with the individual sub-scores as well as the final score.

There are 4 wine-character-dimensions (Accuracy, Balance, Complexity, Depth), with 3x4x4x3 = 144 possible outcomes. Technically, the wines are thus placed into one of 144 points in multi-dimensional space (ie. there are only 144 possible different wines). These 144 outcomes are then reduced to one of 11 possible scores (scale 0-10).

Most of Ho's wines get a score of 7, 8, or 9, occasionally 6, rarely 5 or 10 (see the examples below). There are 18 ways to get a score of 7 (ie. 18 distinct wines could get a score of 7), ten ways to get 8, and four ways to get 9. So, this certainly keeps things simple, in terms of expressing a rank order of preferences.

We might see this as a simplification of the previous two systems, but at the same time providing more detail.

Music rating

Bob Henry has suggested using a form of reviewing that already exists for assessing music. He has noted:

Audiophiles who read British review magazines have been introduced to a twin scale: one judging the sonic fidelity of the recording (against the reference standard of live acoustic music performed, without electronic amplification, in a concert hall), and artistic interpretation of the musical piece.

As an extreme example: over 100-year-old Edison phonograph cylinder recordings of operatic tenor Enrico Caruso have low sonic fidelity (by contemporary standards) to the sound of the human voice and accompanying acoustic instruments ... yet could represent the finest artistic interpretation in recorded history. The recording gets two scores: an “F” letter grade for [poor] fidelity and an “A” letter grade for [great] artistry.

Similarly, a wine could be less than true-to-type or even technically flawed (50 points) ... and yet be hedonistically sublime (100 points).

Thus, this expands the wine score to two separate components, one technical and one artistic, rather than the usual approach of confounding the two. Each wine gets two scores, which may be numerical or not.

Further thoughts on music

The connection between wine and music has been taken much further by other people. For example, Morten Scholer has noted:

Some winegrowers, in particular in Italy, have loudspeakers in their vineyards and in their cellars playing classical and baroque music. They have experienced that the sound waves have a positive influence on the quality of the grapes and wine.

He lists a few specific examples; and David Schildknecht adds another (2012. The World of Fine Wine, issue 36, p. 32). This idea has been formally examined by Charles Spence and Qian Wang (2015. Wine and music (I): on the crossmodal matching of wine and music. Flavour 4: 3).

Of more relevance to the discussion here, Michel Bettane has suggested (The music of wine):

For me, the individuality of a noble wine is analogous to music-tone, voice, song. I think of wine as a musical score, composed of geological, climatic, agronomic, and enological notes that must be sight-read, understood, and interpreted. The taste is the result of that interpretation.

The Australian wine critic Mark Shield took things even further than this. He used an alternative approach to that of numerically scoring wines, by comparison to music itself — that is, the music that the wine most resembles, or the music it makes you wish you were listening to. This is analogous more to a wine note than to a wine score.

Philip White has noted:

Mark Shield, the dreadfully missed Melbourne wine writer, made no bones about how the music of Thelonius Monk influenced his palate as he reviewed wine. He said he couldn’t taste properly without it, and often referred to different compositions of his favourite jazz composer and player relative to specific wines.

As but one specific example of Shield's approach (using classical music):

If they ever bottled Beethoven's 5th Symphony it would be the 1982 Cape Mentelle Cabernet sauvignon.

Shield was a widely admired writer in a self-confessed Australian larrikin style. It is a great pity that his wine columns (Rough Marc in Wine and Spirit, Noble Rot in the Sunday Age, as well as others) are not freely available on the web, for all to read. His missives from the Rat Shack are among my fondest memories of my first forays into the wine literature.

Monday, February 11, 2019

US 2018 grape harvest prices visualized

It is of some interest to know what prices are paid for grapes, as this obviously affects wine prices. In particular, there are often big price differences between areas and between grape varieties. The problem, for me at least, is that this information is usually provided in a large table full of numbers, which is a bot overwhelming.

Even when a table is arranged in a visually helpful way (and many of them are not), I still find it awkward to find the patterns I am looking for. What I want is a picture, not a mass of numbers. This blog post provides such a picture, for the 2018 grape harvest prices in the USA.

The data come from Grape Connect, which describes itself as "a transparent and secure wine-grape, juice, and bulk wine marketplace". They have a neatly arranged table of the Average Grape Pricing by U.S. Appellation [2018 Harvest]. It "includes pricing data for the 25 most represented varietals in respect to online listing count from 1/01/18 until 9/26/18; it shows average per-ton pricing by AVA for the 2018 harvest."

The table is arranged in four columns: Grape variety, US state, American Viticultural Area (AVA), and $US price (weighted average Price-per-ton, with listing quantity as the weight). So, each row in the table presents the data for one grape variety in one AVA. This allows the reader to look up the price of a particular grape in a particular area, but it is very difficult to compare prices between areas and between varieties. The latter is also of interest.

In order to see pricing patterns among the grape varieties and AVAs, the table would be much better arranged with the varieties as the columns and the AVAs as the rows. Each cell in the table would then contain the price. That way, I could compare prices between grape varieties within each AVA by simply looking across a single row; and I could compare prices between AVAs for each grape variety by simply looking down a single column.

Even better would be a picture, not a table. Such a figure is called a Heat Map. This uses colors to represent the prices, rather than using the actual numbers. Here is an example, based on the Grape Connect data.

Heat map of the US 2018 grape harvest prices

Click the image to see the full size (879x1600 pixels), where everything is readable.

The grape varieties are in alphabetical order, horizontally; and the AVAs are in alphabetical order within states, vertically. The prices (average weighted price per ton) are shown by the colors, as indicated by the scale in the bottom-right corner. The prices have been log10 transformed — the minimum price is $400 = 2.6 (orange); and the maximum is $9,500 = 4.0 (crimson). Missing combinations are colored white (ie. areas without price data for that grape variety).

The heat map quickly allows us to see how widespread is each grape variety, by how much of each column is filled — for example, Cabernet is more common in California than elsewhere, and Pinot is more common in Oregon*. We can also see which AVAs have lots of grape types, by how much of each row is filled — for example, the Washington AVAs tend to have more varieties than the other states (4 out of the 12 AVAs have at least 10 varieties**). We can also see the grape prices, by the colors — for example, Cabernet has the highest prices. These patterns are not necessarily unexpected, but the point is that we can easily see them using the heat map, which we cannot in the original table.

It might be even better to re-arrange the rows and columns of the heat map, to put the highest prices near each other. This can be done, but I do not currently have access to suitable software. The downside of doing this is that it would probably no longer allow the state prices to be seen (because AVAs from different states would be mixed).

[Postscript: the Grape Connect web page has been updated with a neat heat map of its own.]

* Cabernet sauvignon has prices for 60 of the AVAs in the table, which is 50% more than its nearest competitor (Syrah).

** Columbia Valley, WA and Rogue Valley, OR each have 15 grape varieties out of the 25 varieties that make it into the table.

Monday, February 4, 2019

How many 100-point scores do critics really give?

There has been much talk about the apparently inexorable increase in the values of wine-quality scores over the past few decades. The scores from 1983 (when Robert Parker introduced his 100-point scale) to the year 2000 averaged much less than have the scores from 2000 to 2018. Indeed, Jamie Goode noted two years ago: "The 100-point scale is very compressed at the top end …. and this scale is becoming so bunched at the top end that it is nearing the end of its useful life."

This leads to the obvious question about just how crowded the top end might be, irrespective of whether we use 100 points or 20 points, or something else.

Before looking at the actual data, though, it is important to note that there are two possible interpretations of a maximum-point score: (i) the wine is as good as we expect to meet in our lifetime; or (ii) it is the best that could ever be. If we mean the latter, then we run the risk of claiming that we are the arbiters of perfection. As Ambrose Bierce defined it:

PERFECTION, n. An imaginary state of quality distinguished from the actual by an element known as excellence; an attribute of the critic.

So, it may be best to claim that we mean option (i), not option (ii). Under these circumstances, of course, when we do subsequently do encounter a better wine then we would need to assign a score in excess of 100 points (see Why not expand the 100-point scale?).

Some data

One suitable place to look for data about how often wine commentators use maximum scores is the Wine-Searcher database. Here, we are provided with hundreds of thousands of wine-quality scores from 30 critic sources, or so (some of which represent groups of people). For most of the critics we can, at the click of a button or two, get a list of their 500 top scores. This allows us to compile the data reported here (mostly compiled at the beginning of this year).

The only tricky data to compile come from Robert Parker himself, or more generally his publication the Wine Advocate, as the scores actually come from a number of people. The issue is that there are more than 500 100-point scores in the database. For example, Lisa Perrotti-Brown recently noted just how many 100-point scores the Advocate has for the 2016 Napa wines, alone. So, I would like to thank the people at Wine-Searcher (especially Robert Anding) for looking up some of the numbers for me.

The data that I compiled cover 23 critics who use the 100-point scale and 6 critics who use the 20-point scale. In each case, I calculated the percentage of their scores that are the maximum; and for the 100-point scale also how many were near-maximum (99 and 98 points).

Proportion of maximum wine-quality scores from selected commentators

The first graph shows the 29 critics, ranked in order of how many of their scored wines received maximum points. Not unexpectedly, Robert Parker and the Wine Advocate team are the principal culprits, although Jeff Leve (at the Wine Cellar Insider) is trying very hard. However, quite a range of the other commentators have non-negligible numbers of top scores. Indeed, only 9 of the critics (one-third) have no maximum-point wines in the database. Interestingly, this includes 3 of the 4 critics from Australia (Jeremy Oliver, Huon Hooke, and the three-headed Wine Front).

This leads me to wonder whether some of these people are giving high scores but without being willing to produce the attention-getting maximum scores. I examined this by looking at the percentage of scores covering the 98-100 points range for those 23 critics using the 100-point scale. (This does not really work for the 20-point scales, since the data collection would involve half-points, which Wine-Searcher does not record.)

Proportion of 98-100-point wines from selected commentators

These data are shown in the second graph, with the critics still listed in the same order as above. This shows that everyone uses near-top scores, and that some people use them a lot. In particular, Luca Gardini has no 100-point scores but quite a few at 98 and 99 points. Furthermore, Tim Atkin and Daniele Cernilli (at Doctor Wine) clearly use an over-abundance of 98 and 99 points, compared to 100 points.

If we exclude the Parker/Advocate scores, then 0.3% of the scores in the database have maximum points (ie. 3 out of every 1000 wines); and 0.8% of the scores are in the range 98-100 points. These numbers may be lower than many people are expecting. This seems to be mainly because 34% of the Wine-Searcher scores actually come from the Wine Spectator magazine (or 32% if we include all of the database scores), and its contributors produce relatively few scores in the 98-100 range (0.07%). The Wine Enthusiast is the next-biggest contributor (17% of the scores), and even it has only 0.16% of its scores in the 98-100 range.

Finally, it is instructive to look at the four Australian critics plus the lone New Zealander. James Halliday has long been singled out in Australia for handing out a lot of high scores (eg. What's in a number? Part the second), and the data show that he does indeed use 100 points more than do any of his compatriots. However, the 98-100-point data show a very different picture. Jeremy Oliver is the only one with fewer 98-100-point wines than Halliday; and Huon Hooke and Bob Campbell exceed Halliday to the tune of 3.0 and 3.6 times as many wines! Apparently, only Oliver has not yet succumbed to the lure of the high-level scores.