Wine cannot be advertised for sale on the English-language eBay sites without a liquor license (e.g. in the U.S.A., U.K., Australia, Canada, Ireland). However, it can be sold privately on many of the mainland European sites (eg. Austria, Belgium, France, Germany, Italy, Netherlands, Spain), except to minors. The wine can then easily be sent anywhere within the European Union. Indeed, many European wine shops use eBay as one of their online portals.
This is generally a Useful Thing for customers, because older vintage wines are widely available, usually much cheaper than in wine shops or at other auctions. However, the buyer must beware. In eBay terms, for older wines you are formally buying the bottle, not its contents, since there is no independent evaluation of the condition of the wine, as occurs for other auctions (and for which the buyer is charged a substantial premium).
I have purchased some very nice wines from 1945-2000 this way, although I have also had a few rather mediocre ones.
I have not yet been ripped off. Indeed, eBay prides itself on dealing with shonky activities by its members, although these activities still exist, and will presumably continue to do so. Last year, I encountered the following example, which I explain here for your education, because it involves a general issue with eBay.
A Milan-based seller became active selling old vintages of Barolo wine. This in itself is not unusual, but what attracted my attention was that the seller was offering free shipping, apparently worldwide. That is very unusual, because international shipping costs from Italy (even within the European Union) are often more expensive than the wine itself. How could the seller afford this? Buyer beware!
So, I decided to keep a curious eye on several of the wines. When I did so, an unusual bidding pattern appeared.
I have attached at the bottom of this post images of the final bidding results for all seven of the wines that I followed. Many more wines were offered by the seller, but I did not check their results. You will note that in all seven cases a previously unknown bidder (ie. one who had never bought anything on eBay before) put in a late bid. In six of the seven cases this newbie bidder won the auction.
This is a quite unbelievable coincidence, and I do not for one moment believe it. I occasionally see newbies bidding high prices on wine, but not seven different newbies bidding on all of the wines that I happen to be watching. If you are prepared to accept this, then I have this bridge in Brooklyn that I would like to sell you ...
Indeed, this looks exactly like shill bidding — defined as "bids on an item with the intent to artificially increase its price or desirability." Normally, the shill bidder does not win the item, but merely forces the other bidders into bidding unnecessarily high, preferably by forcing them to their maximum possible bid. This happened for one of the seven auctions shown below (the fourth one), in which an inexperienced bidder paid €151 for a wine that no-one else thought was worth more than €100. So, the shill bidder managed to extract an extra 50% of profit from the auction. This also happened for the third auction.
The other five auctions require a somewhat different explanation for their profitability.
Unfortunately, eBay has a mechanism that allows shill bidders to ostensibly "win" the item while still achieving their purpose of forcing another buyer to pay more for the item than they needed to. This is called a Second Chance Offer. After the auction, the highest losing bidder is contacted by the seller and told that they have another chance to buy the item, by paying their maximum bid amount.
So, the purpose of the shill bidding in this case is to reveal, to the seller, the buyer's maximum bid. Normally in an auction, the maximum bid for the highest bidder is not revealed to the seller, only the fact that they bid higher than everyone else. Of course, all of the losers' maximum bids are revealed.
Let's take one example from below, the sixth one. The highest bid is the shill bid (from bidder t***t), which was more than €114 — we do not know the actual bid, but one of the other examples (the fourth one) suggests that it was most likely €150. The second highest bid was €112.98 (from genuine bidder 7***8), and the third highest was €79 (from genuine bidder o***2). This means that, without the shill bid, the item would have sold for €79.50 to bidder 7***8. Instead, a Second Chance Offer is sent to 7***8 for sale of the item at €112.98, with a handsome extra profit of €33 to the seller (in collaboration with the shill bidder, who may or may not actually be a separate person).
Note that this approach to shill bidding does also deal with snipe bidders (ie. those who bid during the last few seconds of the auction — there are some examples below). Snipe bidding is sometimes considered to be immune to the actions of shill bidding (eg. How to snipe a winning bid), but it is not immune to the Second Chance Offer problem on eBay.
Caveat emptor. Be very wary of eBay's Second Chance Offers. If you want to play safe, ignore them.
Note: This post is modified from a post on my other blog: The Genealogical World of Phylogenetic Networks.
Monday, April 24, 2017
Monday, April 17, 2017
Was the Judgment of Paris repeatable?
A few months ago I wrote a blog post for the Academic Wino, discussing the 1976 wine tasting that has become known as the Judgment of Paris, organized by Steven Spurrier and Patricia Gallagher. Here, wines from France were tasted along with some wines from California, and the latter acquitted themselves very well in the opinions of the tasters.
Given the outcome of this tasting, it is possibly the third most important event in the social and economic history of wine in the USA, after the imposition and then repeal of Prohibition. It was certainly made much of by the media during the Bicentennial; and this has been repeated every 10 years since. World wine was henceforth taken seriously, not just the European wines.
However, one of the things that struck me most strongly about this tasting was just how variable the results were amongst the tasters — hardly any of the tasters agreed closely with each other about the quality scoring of the wines, and especially about which wines were the best among the 10 reds (bordeaux grapes) and the 10 whites (chardonnays).
This immediately calls the repeatability of the results into question. After all, only one bottle of each wine was tasted, on one occasion, by one group of people. What would happen under other circumstances?
This is particularly important to me as a scientist, because it is the ability to independently repeat an experimental result that is considered to be the only really good evidence in science. For example, if no-one else can replicate my experiments for themselves, then my results will not be widely accepted in the scientific community.
So, given that it is common knowledge that the results of wine tastings are often barely repeatable, why was the Judgment of Paris tasting not widely repeated by other people at other places? The results were widely reported, but apparently only Frank J. Prial, writing in the New York Times (June 16 1976, p. 39), warned against taking the unreplicated wine-tasting results too seriously: "One would be foolish to take Mr Spurrier's little tasting as definitive." And yet, this is what the media very much did.
A first attempt at replication
However, Robert Lawrence Balzer did partly replicate the tasting, later in the same year. Balzer was among the earliest of the wine journalists in the USA, specializing in California wines. He was the wine columnist for the Los Angeles Times, and he also wrote his own newsletter, Robert Lawrence Balzer’s Private Guide to Food and Wine. More importantly, he had previously (in 1973) organized an important tasting of French and US wines, in New York (see Wikipedia).
So, if anyone was going to try replicating the Judgment of Paris, and publish the results, it was likely to be Balzer. The resulting tasting was discussed on pages 77-84 of Volume 6 Number 8 of his newsletter. [Thanks to Christine Graham for kindly sending me a copy of this article.]
Unfortunately, Balzer explicitly stated that his tasting was inspired by the Judgment "without any attempt at exact duplication". This is a pity, because an attempt at exact duplication is what we require. So, Balzer had only 9 of the 20 wines duplicated exactly, while some of the others differed either as to vintage or producer, and some were completely different.
For the red wines, 6 wines were identical to the Paris tasting (4 from the US, 2 French), 2 had different vintages (both French), and 2 of the Paris wines were not tasted (both US). For the white wines, 3 were identical (all US), 4 had different vintages (2 US, 2 French), 1 differed as to producer (French), 1 differed as to both vintage and producer (French), and 1 was not re-tasted (US). For the French wines, it was at that time recognized that there could be big differences between wines from different producers even when harvesting grapes from the same vineyard, and also between vintages from the same producers; and so, these differences prevent those wines from being treated as repeats of the Paris tasting.
The results for the 9 repeated wines, averaged across the 9 tasters' scores, are shown in the first graph, with the red wines in blue and the whites in green. If the scores of the two tastings were identical, then the points should lie along the pink line.
The scores for the American tasting are considerably higher than those of the Paris tasting. The Americans presumably were using the UC Davis 20-point scoring system, which the French tasters were definitely not. The Davis system does not use very much of the 20-point range, as it reserves a large part of the range for faulty wines, which was an important part of its development as a teaching tool (see Steve De Long's comparison of wine scoring systems). Even today, French tasters still often use much more of the 20-point range than do Americans (eg. La Revue du Vin de France).
In spite of this, the scores from the two tastings are correlated — indeed, 67% of the variation in the Balzer scores is directly related to the Paris scores. This is quite a good degree of repeatability. However, it is not the complete picture.
First, note that the rank order of the white wines is not the same in the two tastings — the Chateau Montelena 1973 Chardonnay was ranked first in the Paris tasting, while the Chalone Vineyard 1974 Chardonnay was ranked first in the Balzer tasting. Second, the red wines form two score groups in the Paris tasting, whereas they do not in the later tasting — indeed, the Château Montrose 1970 and the Mayacamas Vineyards 1971 Cabernet had the same average score in the Balzer tasting, whereas they had very different scores in Paris.
Perhaps more importantly, however, the erratic nature of the wine preferences among tasters was repeated in the American tasting. For example, among the red wines, only one person actually chose the Stag's Leap Wine Cellars 1973 Cabernet as their top-scoring wine, in spite of the wine getting the highest average score — and even that person scored it joint top with Château Léoville-Las-Cases 1970. In fact, the 9 tasters chose 7 different wines as their top-rank! The whites were no different, with only one person recorded as picking the Montelena as their (joint) top wine.
So, the things that were repeatable at the repeated tasting were a lot of the "wrong" things. The unreliability of wine tastings was strongly in evidence, and the preference rankings varied (particularly the "winner" among the whites).
A second replication
The only other published tasting that was a serious attempt to evaluate the results of the Judgment tasting occurred nearly 2 years afterwards, in January 1978, at the Vintners Club. This club was formed in San Francisco in 1971, to organize weekly wine tastings (usually 12 wines). Remarkably, the club is still extant (having had only four presidents), although tastings are now monthly, instead of weekly. The early tastings are reported in the book Vintners Club: Fourteen Years of Wine Tastings 1973-1987 (edited by Mary-Ellen McNeil-Draper. 1988).
For the Judgment of Paris replication, 98-99 people tasted the wines over two evenings (white then red), "with Steven Spurrier himself in charge". The tasting allegedly "duplicated [the Paris] tasting to the last bottle", but in fact the vintage listed for the Bâtard-Montrachet Ramonet-Prudhon differs from the Paris event, leaving 19 duplicated wines. The Vintners Club has "always kept to the Davis point system" for its tastings; and so the scores were higher than for the Paris tasting, as discussed above.
The next graph shows the results for the 19 repeated wines, averaged across the 88 (red) and 55 (white) people who provided scores; once again, the red wines are in blue and the whites in green.
As before, variability of the results is the name of the game. Indeed, every red wine was placed first by at least one of the tasters, as well as being placed last by at least one of the tasters; and every white wine was placed first at least once, except for the David Bruce Winery 1973 Chardonnay, and every white was placed last, except for the Chalone Vineyard 1974 Chardonnay.
The Vintners book claims that "the results were very similar to the preceding tasting in Paris", but in fact the scores from the two tastings are not well correlated at all. For the white wines, only 35% of the variation in the Vintners scores is directly related to the Paris scores; and for the red wines it is a measly 10%. For tastings of the same wines under reasonably similar circumstances, these are very low values, and they indicate very poor repeatability.
For the red wines, the Stag's Leap Wine Cellars 1973 Cabernet was placed 1st, as it had been in the previous two tastings. However, the Heitz Wine Cellars Martha’s Vineyard 1970 Cabernet was placed 2nd, having been placed 9th in Paris. For the white wines, the Chalone Vineyard 1974 Chardonnay was placed 1st, as it had been in the Balzer tastings (3rd in Paris), with the Chateau Montelena 1973 Chardonnay placed 2nd (1st in Paris).
With one exception, the Balzer and Vintners tastings are reasonably well correlated (64% of the variation in common), although the Balzer group's scores were (on average) 1 point higher per wine than for the Vintners group. The exception is the Heitz Wine Cellars Martha’s Vineyard 1970 Cabernet, which the Balzer group scored as 14.6 and the Vintners group scored as 16.9. The Spurrier group's result is more in accord with the Balzer group, for this wine.
Conclusion
Neither of these two tastings inspires much confidence in the replicability of wine tastings, let alone the repeatability of the Judgment of Paris in particular. Even to this day, I still read of people expressing the opinion that the difference between Californian and French wines is "obvious". Well, it wasn't obvious to the people at any of these three tastings.
As Mike Steinberger noted in Slate (Nov. 7 2007, In blindness Veritas?): "there is a tendency to overlook the fact that wines and palates are fickle, and to read more into the results than is justified. This was certainly true of history's most famous blind tasting, the 1976 Judgment of Paris".
You will, however, have noted, I am sure, that all three tastings produced a California wine as the "top" for both the reds and whites! There is simply some disagreement about which one it is.
There seems to be little here that supports the media hoopla that ensued in 1976, at least in terms of California versus France "winners". It was the California wine industry that was the big winner, not the individual wines.
Given the outcome of this tasting, it is possibly the third most important event in the social and economic history of wine in the USA, after the imposition and then repeal of Prohibition. It was certainly made much of by the media during the Bicentennial; and this has been repeated every 10 years since. World wine was henceforth taken seriously, not just the European wines.
However, one of the things that struck me most strongly about this tasting was just how variable the results were amongst the tasters — hardly any of the tasters agreed closely with each other about the quality scoring of the wines, and especially about which wines were the best among the 10 reds (bordeaux grapes) and the 10 whites (chardonnays).
This immediately calls the repeatability of the results into question. After all, only one bottle of each wine was tasted, on one occasion, by one group of people. What would happen under other circumstances?
This is particularly important to me as a scientist, because it is the ability to independently repeat an experimental result that is considered to be the only really good evidence in science. For example, if no-one else can replicate my experiments for themselves, then my results will not be widely accepted in the scientific community.
So, given that it is common knowledge that the results of wine tastings are often barely repeatable, why was the Judgment of Paris tasting not widely repeated by other people at other places? The results were widely reported, but apparently only Frank J. Prial, writing in the New York Times (June 16 1976, p. 39), warned against taking the unreplicated wine-tasting results too seriously: "One would be foolish to take Mr Spurrier's little tasting as definitive." And yet, this is what the media very much did.
A first attempt at replication
However, Robert Lawrence Balzer did partly replicate the tasting, later in the same year. Balzer was among the earliest of the wine journalists in the USA, specializing in California wines. He was the wine columnist for the Los Angeles Times, and he also wrote his own newsletter, Robert Lawrence Balzer’s Private Guide to Food and Wine. More importantly, he had previously (in 1973) organized an important tasting of French and US wines, in New York (see Wikipedia).
So, if anyone was going to try replicating the Judgment of Paris, and publish the results, it was likely to be Balzer. The resulting tasting was discussed on pages 77-84 of Volume 6 Number 8 of his newsletter. [Thanks to Christine Graham for kindly sending me a copy of this article.]
Unfortunately, Balzer explicitly stated that his tasting was inspired by the Judgment "without any attempt at exact duplication". This is a pity, because an attempt at exact duplication is what we require. So, Balzer had only 9 of the 20 wines duplicated exactly, while some of the others differed either as to vintage or producer, and some were completely different.
For the red wines, 6 wines were identical to the Paris tasting (4 from the US, 2 French), 2 had different vintages (both French), and 2 of the Paris wines were not tasted (both US). For the white wines, 3 were identical (all US), 4 had different vintages (2 US, 2 French), 1 differed as to producer (French), 1 differed as to both vintage and producer (French), and 1 was not re-tasted (US). For the French wines, it was at that time recognized that there could be big differences between wines from different producers even when harvesting grapes from the same vineyard, and also between vintages from the same producers; and so, these differences prevent those wines from being treated as repeats of the Paris tasting.
The results for the 9 repeated wines, averaged across the 9 tasters' scores, are shown in the first graph, with the red wines in blue and the whites in green. If the scores of the two tastings were identical, then the points should lie along the pink line.
The scores for the American tasting are considerably higher than those of the Paris tasting. The Americans presumably were using the UC Davis 20-point scoring system, which the French tasters were definitely not. The Davis system does not use very much of the 20-point range, as it reserves a large part of the range for faulty wines, which was an important part of its development as a teaching tool (see Steve De Long's comparison of wine scoring systems). Even today, French tasters still often use much more of the 20-point range than do Americans (eg. La Revue du Vin de France).
In spite of this, the scores from the two tastings are correlated — indeed, 67% of the variation in the Balzer scores is directly related to the Paris scores. This is quite a good degree of repeatability. However, it is not the complete picture.
First, note that the rank order of the white wines is not the same in the two tastings — the Chateau Montelena 1973 Chardonnay was ranked first in the Paris tasting, while the Chalone Vineyard 1974 Chardonnay was ranked first in the Balzer tasting. Second, the red wines form two score groups in the Paris tasting, whereas they do not in the later tasting — indeed, the Château Montrose 1970 and the Mayacamas Vineyards 1971 Cabernet had the same average score in the Balzer tasting, whereas they had very different scores in Paris.
Perhaps more importantly, however, the erratic nature of the wine preferences among tasters was repeated in the American tasting. For example, among the red wines, only one person actually chose the Stag's Leap Wine Cellars 1973 Cabernet as their top-scoring wine, in spite of the wine getting the highest average score — and even that person scored it joint top with Château Léoville-Las-Cases 1970. In fact, the 9 tasters chose 7 different wines as their top-rank! The whites were no different, with only one person recorded as picking the Montelena as their (joint) top wine.
First, but only in one out of three tastings. |
So, the things that were repeatable at the repeated tasting were a lot of the "wrong" things. The unreliability of wine tastings was strongly in evidence, and the preference rankings varied (particularly the "winner" among the whites).
A second replication
The only other published tasting that was a serious attempt to evaluate the results of the Judgment tasting occurred nearly 2 years afterwards, in January 1978, at the Vintners Club. This club was formed in San Francisco in 1971, to organize weekly wine tastings (usually 12 wines). Remarkably, the club is still extant (having had only four presidents), although tastings are now monthly, instead of weekly. The early tastings are reported in the book Vintners Club: Fourteen Years of Wine Tastings 1973-1987 (edited by Mary-Ellen McNeil-Draper. 1988).
For the Judgment of Paris replication, 98-99 people tasted the wines over two evenings (white then red), "with Steven Spurrier himself in charge". The tasting allegedly "duplicated [the Paris] tasting to the last bottle", but in fact the vintage listed for the Bâtard-Montrachet Ramonet-Prudhon differs from the Paris event, leaving 19 duplicated wines. The Vintners Club has "always kept to the Davis point system" for its tastings; and so the scores were higher than for the Paris tasting, as discussed above.
The next graph shows the results for the 19 repeated wines, averaged across the 88 (red) and 55 (white) people who provided scores; once again, the red wines are in blue and the whites in green.
As before, variability of the results is the name of the game. Indeed, every red wine was placed first by at least one of the tasters, as well as being placed last by at least one of the tasters; and every white wine was placed first at least once, except for the David Bruce Winery 1973 Chardonnay, and every white was placed last, except for the Chalone Vineyard 1974 Chardonnay.
The Vintners book claims that "the results were very similar to the preceding tasting in Paris", but in fact the scores from the two tastings are not well correlated at all. For the white wines, only 35% of the variation in the Vintners scores is directly related to the Paris scores; and for the red wines it is a measly 10%. For tastings of the same wines under reasonably similar circumstances, these are very low values, and they indicate very poor repeatability.
For the red wines, the Stag's Leap Wine Cellars 1973 Cabernet was placed 1st, as it had been in the previous two tastings. However, the Heitz Wine Cellars Martha’s Vineyard 1970 Cabernet was placed 2nd, having been placed 9th in Paris. For the white wines, the Chalone Vineyard 1974 Chardonnay was placed 1st, as it had been in the Balzer tastings (3rd in Paris), with the Chateau Montelena 1973 Chardonnay placed 2nd (1st in Paris).
With one exception, the Balzer and Vintners tastings are reasonably well correlated (64% of the variation in common), although the Balzer group's scores were (on average) 1 point higher per wine than for the Vintners group. The exception is the Heitz Wine Cellars Martha’s Vineyard 1970 Cabernet, which the Balzer group scored as 14.6 and the Vintners group scored as 16.9. The Spurrier group's result is more in accord with the Balzer group, for this wine.
Conclusion
Neither of these two tastings inspires much confidence in the replicability of wine tastings, let alone the repeatability of the Judgment of Paris in particular. Even to this day, I still read of people expressing the opinion that the difference between Californian and French wines is "obvious". Well, it wasn't obvious to the people at any of these three tastings.
As Mike Steinberger noted in Slate (Nov. 7 2007, In blindness Veritas?): "there is a tendency to overlook the fact that wines and palates are fickle, and to read more into the results than is justified. This was certainly true of history's most famous blind tasting, the 1976 Judgment of Paris".
You will, however, have noted, I am sure, that all three tastings produced a California wine as the "top" for both the reds and whites! There is simply some disagreement about which one it is.
There seems to be little here that supports the media hoopla that ensued in 1976, at least in terms of California versus France "winners". It was the California wine industry that was the big winner, not the individual wines.
Monday, April 10, 2017
Napa versus Bordeaux red-wine prices
In a recent article on Wine-Searcher, Blake Gray addressed the question: Does Napa have too much Cab? The accountant answer is: "not if you can sell it", which is a very short-term point of view. Somebody with a longer perspective would be more interested in whether the domination of the Napa Valley vineyards by cabernet sauvignon wines is sustainable, because sustainability is related to biodiversity, not to monocultures.
Blake looks at the issue by making some comparisons between the Napa area of California and the Bordeaux region of France, whch also relies on cabernet as a principal grape. Let's look at some of the points made in the article.
First, cabernet sauvignon comprises nearly half of the vineyard area in Napa, which is certainly not a monoculture, but it is also quite a domination. How does this domination compare to Bordeaux? Actually, the comparison needs to be among all of the common grape varieties, because Bordeaux has more merlot than cabernet.
A number of mathematical measurements of what is called "diversity" have been developed in science, for making precisely this sort of comparison. The idea is to reduce the various grape-variety areas down to a single number that quantifies their diversity, from a monoculture of one grape variety at one extreme to equal amounts of each grape variety at the other extreme.
The one I will use here is called the Shannon Diversity Index (see Wikipedia); and I will apply it to the vineyard area of each of the seven most common grape varieties in both Napa and Bordeaux. The Index will be a number between 0 (for a monoculture) and the natural logarithm of 7 (for equal amounts of each grape variety). The data come from Blake Gray's article for Napa, and from the Wine Cellar Insider for Bordeaux.
It turns out that Napa (1.51) actually has a slightly greater diversity of grape varieties than does Bordeaux (1.28). So, we certainly cannot yet claim that there is anything unusual about the domination of Napa by cabernet sauvignon. However, there seems to be no reason why Napa won't eventually exceed Bodeaux, if the current trends continue.
[Aside: Blake Gray makes the erroneous claim that Europe has grapes that are "disallowed" in certain regions. No grapes are disallowed anywhere in Europe; and any type of wine can be made in any region. What is disallowed is the name that can be used for those wines — names must match the definition of those names. In Bordeaux, in addition to its top seven grape types, which are Merlot, Cabernet Sauvignon, Cabernet Franc, Semillon, Sauvignon Blanc, Malbec and Muscadelle, there are also notable areas of: Petit Verdot, Carménère, Sauvignon Gris, Colombard, Folle Blanche and Ugni Blanc.]
Moving on, Blake Gray points out that Napa winemakers charge the consumer more money for their cabernet wines than do the winemakers of Bordeaux. He quantifies this claim by looking at the most popular wine searches on Wine-Searcher, for the cabernet blends of both Napa and Bordeaux. Search popularity is a convenient way to compare the wine regions, especially as it turns out in practice that the most expensive wines are the most popular searches.
Blake does not show us a picture of the dollar comparison, but we can generate one of our own. As of 18 March 2017, c. 450 of the most popular 500 searches on Wine-Searcher for both Napa and Bordeaux involve wines dominated by one or more of the principal Bordeaux red-wine grapes (cabernet sauvignon, merlot, cabernet franc, malbec). Now, that's what I call domination!
These two lots of 450 wines are the ones that I have compared in the first graph. The two lines show a running average of the Wine-Searcher average bottle price (vertically) against the Wine-Searcher search popularity (horizontally). [The running average is based on nine-unit blocks.]
For the first 30 most-popular searches, the Bordeaux wines are more expensive than are the Napa wines, but after that the Napa wines are consistently 2–3 times more expensive than the Bordeaux wines. The exception is for the Le Pin wine from Pomerol, which is ranked 45th in search popularity but is the second most expensive Bordeaux wine. Furthermore, some rather expensive Napa wines have low search popularity, so that the Napa line on the graph goes up and down like a yo-yo.
Conclusion: a Napa cabernet will cost most of us a lot more money than will a Bordeaux wine, unless we go for the few most expensive Bordeaux wines.
It turns out that this pattern is independent of wine quality. Wine-Searcher also provides an average quality score for each wine, averaged across a number of wine critics. So, we can compare the Quality:Price Ratio for the two regions, as well.
I have done this in the second graph. Each point represents one of the 450 wines from each region, plotted with the average bottle price (vertically) and the average quality score (horizontally). Note the log price scale, which deals with the ridiculous prices of the wines from Screaming Eagle (Napa), Petrus (Pomerol) and Le Pin (Pomerol), at the top of the graph. Also, shown are the exponential price models fitted to each dataset (see the post on The relationship of wine quality to price) — this is a straight line on the graph because of the log price scale.
The Bordeaux wines, on average, score 0.7 quality points less than do the Napa wines. However, the QPR line for Napa is consistently above the line for Bordeaux, indicating that the Bordeaux wines are generally cheaper for the same quality score. This cannot be a good thing for US wine drinkers, but is much better for the wine drinkers of France (where most of the Bordeaux wines are consumed).
Note that there is only one isolated dot below the main wines, which represents the only wine with an outstanding Quality:Price Ratio, relative to the other wines. There are, however, plenty of points above the main group, which represent poor value for money!
In other words, neither Napa nor Bordeaux has red wines that represent particularly good value for money; and it seems unlikely that this situation will change any time soon (see At what price, To Kalon?).
This odd QPR wine, incidentally, is Chateau Tour Saint-Christophe (Saint-Emilion Grand Cru), with an average score of 91 points and an average cost of US$26 — this seems to be commonly available in the USA. The property was recently renovated by Hong Kong-based entrepreneur Peter Kwok, so the wine may not remain cheap for much longer.
Blake looks at the issue by making some comparisons between the Napa area of California and the Bordeaux region of France, whch also relies on cabernet as a principal grape. Let's look at some of the points made in the article.
First, cabernet sauvignon comprises nearly half of the vineyard area in Napa, which is certainly not a monoculture, but it is also quite a domination. How does this domination compare to Bordeaux? Actually, the comparison needs to be among all of the common grape varieties, because Bordeaux has more merlot than cabernet.
A number of mathematical measurements of what is called "diversity" have been developed in science, for making precisely this sort of comparison. The idea is to reduce the various grape-variety areas down to a single number that quantifies their diversity, from a monoculture of one grape variety at one extreme to equal amounts of each grape variety at the other extreme.
The one I will use here is called the Shannon Diversity Index (see Wikipedia); and I will apply it to the vineyard area of each of the seven most common grape varieties in both Napa and Bordeaux. The Index will be a number between 0 (for a monoculture) and the natural logarithm of 7 (for equal amounts of each grape variety). The data come from Blake Gray's article for Napa, and from the Wine Cellar Insider for Bordeaux.
It turns out that Napa (1.51) actually has a slightly greater diversity of grape varieties than does Bordeaux (1.28). So, we certainly cannot yet claim that there is anything unusual about the domination of Napa by cabernet sauvignon. However, there seems to be no reason why Napa won't eventually exceed Bodeaux, if the current trends continue.
[Aside: Blake Gray makes the erroneous claim that Europe has grapes that are "disallowed" in certain regions. No grapes are disallowed anywhere in Europe; and any type of wine can be made in any region. What is disallowed is the name that can be used for those wines — names must match the definition of those names. In Bordeaux, in addition to its top seven grape types, which are Merlot, Cabernet Sauvignon, Cabernet Franc, Semillon, Sauvignon Blanc, Malbec and Muscadelle, there are also notable areas of: Petit Verdot, Carménère, Sauvignon Gris, Colombard, Folle Blanche and Ugni Blanc.]
Moving on, Blake Gray points out that Napa winemakers charge the consumer more money for their cabernet wines than do the winemakers of Bordeaux. He quantifies this claim by looking at the most popular wine searches on Wine-Searcher, for the cabernet blends of both Napa and Bordeaux. Search popularity is a convenient way to compare the wine regions, especially as it turns out in practice that the most expensive wines are the most popular searches.
Blake does not show us a picture of the dollar comparison, but we can generate one of our own. As of 18 March 2017, c. 450 of the most popular 500 searches on Wine-Searcher for both Napa and Bordeaux involve wines dominated by one or more of the principal Bordeaux red-wine grapes (cabernet sauvignon, merlot, cabernet franc, malbec). Now, that's what I call domination!
These two lots of 450 wines are the ones that I have compared in the first graph. The two lines show a running average of the Wine-Searcher average bottle price (vertically) against the Wine-Searcher search popularity (horizontally). [The running average is based on nine-unit blocks.]
For the first 30 most-popular searches, the Bordeaux wines are more expensive than are the Napa wines, but after that the Napa wines are consistently 2–3 times more expensive than the Bordeaux wines. The exception is for the Le Pin wine from Pomerol, which is ranked 45th in search popularity but is the second most expensive Bordeaux wine. Furthermore, some rather expensive Napa wines have low search popularity, so that the Napa line on the graph goes up and down like a yo-yo.
Conclusion: a Napa cabernet will cost most of us a lot more money than will a Bordeaux wine, unless we go for the few most expensive Bordeaux wines.
It turns out that this pattern is independent of wine quality. Wine-Searcher also provides an average quality score for each wine, averaged across a number of wine critics. So, we can compare the Quality:Price Ratio for the two regions, as well.
I have done this in the second graph. Each point represents one of the 450 wines from each region, plotted with the average bottle price (vertically) and the average quality score (horizontally). Note the log price scale, which deals with the ridiculous prices of the wines from Screaming Eagle (Napa), Petrus (Pomerol) and Le Pin (Pomerol), at the top of the graph. Also, shown are the exponential price models fitted to each dataset (see the post on The relationship of wine quality to price) — this is a straight line on the graph because of the log price scale.
The Bordeaux wines, on average, score 0.7 quality points less than do the Napa wines. However, the QPR line for Napa is consistently above the line for Bordeaux, indicating that the Bordeaux wines are generally cheaper for the same quality score. This cannot be a good thing for US wine drinkers, but is much better for the wine drinkers of France (where most of the Bordeaux wines are consumed).
Note that there is only one isolated dot below the main wines, which represents the only wine with an outstanding Quality:Price Ratio, relative to the other wines. There are, however, plenty of points above the main group, which represent poor value for money!
In other words, neither Napa nor Bordeaux has red wines that represent particularly good value for money; and it seems unlikely that this situation will change any time soon (see At what price, To Kalon?).
This odd QPR wine, incidentally, is Chateau Tour Saint-Christophe (Saint-Emilion Grand Cru), with an average score of 91 points and an average cost of US$26 — this seems to be commonly available in the USA. The property was recently renovated by Hong Kong-based entrepreneur Peter Kwok, so the wine may not remain cheap for much longer.
Monday, April 3, 2017
How large is between-critic variation in quality scores?
I have written before about the Poor correlation among critics' quality scores (see also Can non-experts distinguish anything about wine?). This topic refers to what is technically called inter-individual variation in the scorer, which you might call "between-taster variation" — the same wine tasted by different people, even on the same occasion, does not necessarily receive the same quality score, even when it comes from the same bottle.
This results from two things: (i) variation in personal assessment of the wine (the assessment of quality is the result of each taster’s previous experiences as well as their personal conceptions); and (ii) differences in how this assessment is expressed in terms of a score.
This is an important issue for anyone who reads the opinions of wine commentators. After all, if there is more disagreement than agreement, then we might ask ourselves what it is that we are expecting to get out of reading the critics in the first place. It is for this reason that we are often advised to find a commentator whose wine tastes match our own, and read that person's reviews only.
This issue of inter-individual variation has been studied in the professional literature; and, indeed, many authors have concluded that wine criticism is a somewhat fraudulent activity, given the large personal component in the scores. I have included a list of relevant published papers at the end of this post.
What I will do in this post is taker a broader look at this topic than I did in my previous post, but still examine particular examples of scores from particular wines and wineries. All of the wines will be red, since it seems to be rather hard to find large datasets of white wines that have been evaluated by many people (the wines of Sauternes are the most obvious ones).
Bordeaux First Growth wines
I will start by looking at the "Grand vin" wines of the five First Growth wineries from the Left Bank of the Bordeaux region, in France: Château Haut-Brion, Château Lafite-Rothschild, Château Latour, Château Margaux, and Château Mouton-Rothschild. All five of these wines have vintages going back centuries, although most of the available quality scores cover only the period after 1900.
For each of the five wines, I have compiled as many publicly available scores as I can, using principally the information provided by Wine-Searcher, 90plus Wines and Cellar Tracker. For each wine, I then restricted the dataset to those post-1900 vintages with quality scores from at least two commentators; and then I pooled the five wines together. [Note: In the previous post I analyzed a single wine.] Finally, I separately converted all scores to use a 100-point scale.
For the analyses presented here, I have divided the data into two subsets: (i) 11 commentators with scores for at least 15 vintages of each of the five wines, covering the period from 1945–2014, inclusive; and (ii) 11 commentators with scores for at least 14 vintages of each of the five wines, covering the period from 1988–2014, inclusive.
There are eight commentators who appear in both datasets (Falstaff Magazin, Jeff Leve, Robert Parker, Jean-Marc Quarin, Jancis Robinson, James Suckling, Stephen Tanzer, Wine Spectator), and six who appear in one but not the other (Michel Bettane and Thierry Desseauve, Jeannie Cho Lee, Richard Jennings, John Kapon, La Revue du Vin de France, Vinum Weinmagazin). There are many well-known sources of Bordeaux wine commentary for whom I could not find sufficient data, including Wine Enthusiast, Decanter, Vinous (Antonio Galloni), Gault & Millau, Wine & Spirits Magazine, and Tim Atkin. It is worth noting that most of the Wine Spectator's scores were actually from James Suckling, along with a few from Thomas Matthews, James Molesworth and Harvey Steiman (who have all reviewed the red wines of Bordeaux for that magazine), plus some that were unattributed.
So, we can now look at how similar are the quality scores of these commentators, when pooled across these five wines. [Aside: the picture does not change much if we consider each of the five wines separately.] Let's start with the first dataset, covering the period since the 1945 vintage. All of these commentators are from the USA except for Jancis Robinson (UK), Falstaff Magazin (Austria) and Jean-Marc Quarin (France).
As before, we can quantify the relationships among the scores using correlation analysis. This analysis reveals that the following percentages are held in common between the 11 commentators pairwise. In this table, values >60% are highlighted in green, those between 50 and 60% are in blue, and those <10% are in yellow.
As far as wine quality is concerned, the average agreement among the commentators is less than 50% in almost all cases, and more than half of the values are in the range 10–40%, which is rather low. Certainly, the critics disagree with each other much more than they agree. The only commentators who appear to be in strong agreement with each other are Jeff Leve and Robert Parker. At the other extreme, neither Jancis Robinson nor Richard Jennings has much in common with the other commentators.
It might be more useful to look at a picture of these data, rather than a table of numbers. To do this, we can employ a network, as described in the post on Summarizing multi-dimensional wine data as graphs, Part 2: networks. This is shown in the next graph. [Technical note: the correlation scores were first converted to euclidean distances, and the NeighborNet graph was drawn using the SplitsTree program.]
In this graph, the lengths of the lines represent the amount of information. The interconnected lines in the centre represent the shared information, with the terminal lines ("leaves") representing the unique information. In this case the longest lines are the terminals, indicating that there is little commonality among the quality scores.
The connections among the lines represent who is agreeing with whom. For example, Parker and Leve are closely associated in the network, as expected from the results shown in the table above (their association is indicated by the short distance separating them along the lines of the network). You can see that there is also some association between Tanzer and Cho Lee, between Suckling and Kapon, and between Robinson and Jennings (and also between Robinson and Kapon). The Spectator magazine and Jean-Marc Quarin appear to have some similarity to the scores of Robert Parker; but the relationships of Falstaff magazine are unclear. It is worth noting that the scores of Suckling and the Spectator are not closely associated, in spite of the fact that most of them come from the same person (see Are the quality scores from repeat tastings correlated?).
We can now do the same two analyses for the second dataset, covering the period since the 1988 vintage. The correlation analysis reveals that the following percentages are held in common between the 11 commentators pairwise.
A similar pattern emerges, although the average values are slightly larger for this restricted dataset. The average agreement among the commentators is still less than 50% in most cases, and more than half of the values are in the range 10–40%. Thus, the critics disagree with each other much more than they agree. However, in this dataset there are now several pairs of critics who share more than 50% agreement. Jancis Robinson is once again involved in the values that are less than 10%
The network picture of the same data is shown in the next graph. Several of the previous associations are not present, because three of the commentators are not in this dataset (Cho Lee, Jennings, Kapon).
Suckling, Falstaff and Spectator are closely associated, as expected from the results in the table, as are Parker and Leve. More interestingly, we can now evaluate three new commentators, all from France. Indeed, the four French commentators are closely associated in the graph (Revue de France, Bettane et Desseauve, Quarin, and Vinum Weinmagazin [from Switzerland]). Of these, only the Revue de France scores seem to be associated with any of the non-French commentators, having some connection to those of Robert Parker.
Thus, there is little commonality among the scores of different commentators, and this is especially true for Jancis Robinson. Furthermore, the four French commentators do seem to form a separate groups from the others. Perhaps it is relevant that these five people are the only ones in the dataset who use a 20-point quality scale rather than a 100-point scale.
Hill of Grace (Australia)
As an addendum, we can take a quick look at another single Australian wine that has a long record of critics' scores, following the Penfolds Grange used in the previous post.
Unlike Grange, the Henschke Hill of Grace is a single vineyard wine, made from c. 7 ha of shiraz. The oldest vines were planted in the 1860s; and in a good year there are about 2,000 cases of wine produced. The vintages date back to 1958, with four vintages when the wine was not made (leaving 50 released vintages for analysis). There are five commentators who have provided quality scores for almost all of these vintages, and two more who have covered at least one third of them.
As above, we can quantify the relationships among the scores using correlation analysis. This analysis reveals that the following percentages are held in common between the seven commentators pairwise. As above, values >60% are highlighted in green, those between 50 and 60% are in blue, and those <10% are in yellow.
As you can see, the correlations are extremely poor, except for those involving Huon Hooke, which are not quite so bad. None of the values exceeds 50%, and most are actually <10%. This means that there is very little agreement among the commentators in their quality scores.
This is the most extreme example of disagreement in quality scores that I have encountered.
Conclusion
The answer to the question posed in the title is: "a lot".
This broader analysis (six wines) confirms the results from the previous blog post (Poor correlation among critics' quality scores). The idea that wine commentators have some sort of consensus opinion with regard to wine quality is completely untenable, for all of the wines checked so far. In general, the agreement varies from 0–50%, so that the critics disagree more than they agree.
However, there are patterns of association among the commentators, so that their quality scores are not completely random. Unfortunately, this seems to be a relatively minor component of the data patterns. Nevertheless, the four French commentators do seem to have opinions about the French wines that differ from those of the other commentators.
Research Literature
Johan Almenberg, Anna Dreber (2009) When does the price affect the taste? Results from a wine experiment. American Association of Wine Economists Working Paper No. 35.
Orley Ashenfelter, Richard Quandt (1999) Analyzing a wine tasting statistically. Chance 12(3):16-20.
Robert H. Ashton (2011) Improving experts’ wine quality judgments: two heads are better than one. Journal of Wine Economics 6:160-178.
Robert H. Ashton (2012) Reliability and consensus of experienced wine judges: expertise within and between? Journal of Wine Economics 7:70-87.
Robert H. Ashton (2013) Is there consensus among wine quality ratings of prominent critics? An empirical analysis of red Bordeaux, 2004-2010. Journal of Wine Economics 8:225-234.
George A. Baker, and Maynard A. Amerine (1953) Organoleptic ratings of wines estimated from analytical data. Food Research 18:381-389.
Jeffrey C. Bodington (2012) 804 tastes: evidence on preferences, randomness, and value from double-blind wine tastings. Journal of Wine Economics 7:181-191.
Jeffrey C. Bodington (2015) Evaluating wine-tasting results and randomness with a mixture of rank preference models. Journal of Wine Economics 10:31-46.
Jeffrey C. Bodington (2015) Testing a mixture of rank preference models on judges’ scores in Paris and Princeton. Journal of Wine Economics 10:173-189.
Chris J. Brien, P. May, Oliver Mayo (1987) Analysis of judge performance in wine-quality evaluations. Journal of Food Science 52:1273-1279.
Jing Cao (2014) Quantifying randomness versus consensus in wine quality ratings. Journal of Wine Economics 9:202-213.
Jing Cao, Lynne Stokes (2010) Evaluation of wine judge performance through three characteristics: bias, discrimination, and variation. Journal of Wine Economics 5:132-142.
Jean-Marie Cardebat, Emmanuel Paroissien (2015) Reducing quality uncertainty for Bordeaux en primeur wines: a uniform wine score. American Association of Wine Economists Working Paper No. 180.
Jean-Marie Cardebat, Jean-Marc Figuet, Emmanuel Paroissien (2014) Expert opinion and Bordeaux wine prices: an attempt to correct biases in subjective judgments. Journal of Wine Economics 9:282-303.
Domenic V. Cicchetti (2004) Who won the 1976 blind tasting of French Bordeaux and US Cabernets? Parametrics to the rescue. Journal of Wine Research 15:211-220.
Dominic V. Cicchetti (2006) The Paris 1976 Wine Tasting revisited once more: comparing ratings of consistent and inconsistent tasters. Journal of Wine Economics 1:125-140.
Domenic Cicchetti, Arnold Cicchetti (2008) The balancing act in consistent wine tasting and wine appreciation: Part II: Consistency in wine tasting and appreciation: an empirical-objective perspective. Journal of Wine Research 19:185-191.
Domenic V. Cicchetti, Arnie F. Cicchetti (2013) As wine experts disagree, consumers’ taste buds flourish: how two experts rate the 2004 Bordeaux vintage. Journal of Wine Research 24:311-317.
Dom Cicchetti, Arnie Cicchetti (2014) Two enological titans rate the 2009 Bordeaux wines. Wine Economics and Policy 3:28-36.
Margaret A. Cliff, Marjorie C. King (1996) A proposed approach for evaluating expert wine judge performance using descriptive statistics. Journal of Wine Research 7:83-90.
Margaret A. Cliff, Marjorie C. King (1997) The evaluation of judges at wine competitions: the application of eggshell plots. Journal of Wine Research 8:75-80.
Margaret A. Cliff, Marjorie C. King (1999) Use of principal component analysis for the evaluation of judge performance at wine competitions. Journal of Wine Research 10:25-32.
Margaret A. Cliff, Mike O’Mahony, Lana Fukumoto, Marjorie C. King (2000) Development of a ‘bipolar’ R-index. Journal of Sensory Studies 15:219-229.
Victor Ginsburg, Israël Zang (2012) Shapley ranking of wines. Journal of Wine Economics 7:169-180.
Neal D. Hulkower (2009) The Judgment of Paris according to Borda. Journal of Wine Research 20:171-182.
Dennis V. Lindley (2006) Analysis of a wine tasting. Journal of Wine Economics 1:33-41.
Jonas De Maere (2014) Do expert tasters evaluate wines consistently? A statistical analysis and a proposal for improvement. Weinakademiker thesis, Weinakademie Österreich.
Philippe Masset, Jean-Philippe Weisskopf, Mathieu Cossutta (2015) Wine tasters, ratings, and en primeur prices. Journal of Wine Economics 1:75-107.
Ingram Olkin, Ying Lou, Lynne Stokes, Jing Cao (2015) Analyses of wine-tasting data: a tutorial. Journal of Wine Economics 10:4-30.
Wendy V. Parr, James A. Green, K. Geoffrey White (2006) Wine judging, context and New Zealand sauvignon blanc. Revue Européenne de Psychologie Appliquée 56:231-238.
Anthony Pecotich, Steven Ward (2010) Taste testing of wine by expert and novice consumers in the presence of variations in quality, brand and country of origin cues. American Association of Wine Economists Working Paper No. 66.
Richard E. Quandt (2006) Measurement and inference in wine tasting. Journal of Wine Economics 1:7-30.
Richard E. Quandt (2012) Comments on the Judgment of Princeton. Journal of Wine Economics 7:152-154.
Christine H. Scaman, J. Dou (2001) Evaluation of wine competition judge performance using principal component similarity analysis. Journal of Sensory Studies 16:287-300.
Eric T. Stuen, Jon R. Miller, Robert W. Stone (2015) An analysis of wine critic consensus: a study of Washington and California wines. Journal of Wine Economics 10:47-61.
Daniel L. Ward (2012) A graphical and statistical analysis of the Judgment of Princeton wine tasting. Journal of Wine Economics 7:155-168.
Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.
Roman L. Weil (2007) Debunking critics' wine words: can amateurs distinguish the smell of asphalt from the taste of cherries? Journal of Wine Economics 2:136-144.
Roman L. Weil (2005) Analysis of reserve and regular bottlings: Why pay for a difference only critics claim to notice? Chance 18(3):9-15.
This results from two things: (i) variation in personal assessment of the wine (the assessment of quality is the result of each taster’s previous experiences as well as their personal conceptions); and (ii) differences in how this assessment is expressed in terms of a score.
This is an important issue for anyone who reads the opinions of wine commentators. After all, if there is more disagreement than agreement, then we might ask ourselves what it is that we are expecting to get out of reading the critics in the first place. It is for this reason that we are often advised to find a commentator whose wine tastes match our own, and read that person's reviews only.
This issue of inter-individual variation has been studied in the professional literature; and, indeed, many authors have concluded that wine criticism is a somewhat fraudulent activity, given the large personal component in the scores. I have included a list of relevant published papers at the end of this post.
What I will do in this post is taker a broader look at this topic than I did in my previous post, but still examine particular examples of scores from particular wines and wineries. All of the wines will be red, since it seems to be rather hard to find large datasets of white wines that have been evaluated by many people (the wines of Sauternes are the most obvious ones).
Bordeaux First Growth wines
I will start by looking at the "Grand vin" wines of the five First Growth wineries from the Left Bank of the Bordeaux region, in France: Château Haut-Brion, Château Lafite-Rothschild, Château Latour, Château Margaux, and Château Mouton-Rothschild. All five of these wines have vintages going back centuries, although most of the available quality scores cover only the period after 1900.
For each of the five wines, I have compiled as many publicly available scores as I can, using principally the information provided by Wine-Searcher, 90plus Wines and Cellar Tracker. For each wine, I then restricted the dataset to those post-1900 vintages with quality scores from at least two commentators; and then I pooled the five wines together. [Note: In the previous post I analyzed a single wine.] Finally, I separately converted all scores to use a 100-point scale.
For the analyses presented here, I have divided the data into two subsets: (i) 11 commentators with scores for at least 15 vintages of each of the five wines, covering the period from 1945–2014, inclusive; and (ii) 11 commentators with scores for at least 14 vintages of each of the five wines, covering the period from 1988–2014, inclusive.
There are eight commentators who appear in both datasets (Falstaff Magazin, Jeff Leve, Robert Parker, Jean-Marc Quarin, Jancis Robinson, James Suckling, Stephen Tanzer, Wine Spectator), and six who appear in one but not the other (Michel Bettane and Thierry Desseauve, Jeannie Cho Lee, Richard Jennings, John Kapon, La Revue du Vin de France, Vinum Weinmagazin). There are many well-known sources of Bordeaux wine commentary for whom I could not find sufficient data, including Wine Enthusiast, Decanter, Vinous (Antonio Galloni), Gault & Millau, Wine & Spirits Magazine, and Tim Atkin. It is worth noting that most of the Wine Spectator's scores were actually from James Suckling, along with a few from Thomas Matthews, James Molesworth and Harvey Steiman (who have all reviewed the red wines of Bordeaux for that magazine), plus some that were unattributed.
So, we can now look at how similar are the quality scores of these commentators, when pooled across these five wines. [Aside: the picture does not change much if we consider each of the five wines separately.] Let's start with the first dataset, covering the period since the 1945 vintage. All of these commentators are from the USA except for Jancis Robinson (UK), Falstaff Magazin (Austria) and Jean-Marc Quarin (France).
As before, we can quantify the relationships among the scores using correlation analysis. This analysis reveals that the following percentages are held in common between the 11 commentators pairwise. In this table, values >60% are highlighted in green, those between 50 and 60% are in blue, and those <10% are in yellow.
As far as wine quality is concerned, the average agreement among the commentators is less than 50% in almost all cases, and more than half of the values are in the range 10–40%, which is rather low. Certainly, the critics disagree with each other much more than they agree. The only commentators who appear to be in strong agreement with each other are Jeff Leve and Robert Parker. At the other extreme, neither Jancis Robinson nor Richard Jennings has much in common with the other commentators.
It might be more useful to look at a picture of these data, rather than a table of numbers. To do this, we can employ a network, as described in the post on Summarizing multi-dimensional wine data as graphs, Part 2: networks. This is shown in the next graph. [Technical note: the correlation scores were first converted to euclidean distances, and the NeighborNet graph was drawn using the SplitsTree program.]
In this graph, the lengths of the lines represent the amount of information. The interconnected lines in the centre represent the shared information, with the terminal lines ("leaves") representing the unique information. In this case the longest lines are the terminals, indicating that there is little commonality among the quality scores.
The connections among the lines represent who is agreeing with whom. For example, Parker and Leve are closely associated in the network, as expected from the results shown in the table above (their association is indicated by the short distance separating them along the lines of the network). You can see that there is also some association between Tanzer and Cho Lee, between Suckling and Kapon, and between Robinson and Jennings (and also between Robinson and Kapon). The Spectator magazine and Jean-Marc Quarin appear to have some similarity to the scores of Robert Parker; but the relationships of Falstaff magazine are unclear. It is worth noting that the scores of Suckling and the Spectator are not closely associated, in spite of the fact that most of them come from the same person (see Are the quality scores from repeat tastings correlated?).
We can now do the same two analyses for the second dataset, covering the period since the 1988 vintage. The correlation analysis reveals that the following percentages are held in common between the 11 commentators pairwise.
A similar pattern emerges, although the average values are slightly larger for this restricted dataset. The average agreement among the commentators is still less than 50% in most cases, and more than half of the values are in the range 10–40%. Thus, the critics disagree with each other much more than they agree. However, in this dataset there are now several pairs of critics who share more than 50% agreement. Jancis Robinson is once again involved in the values that are less than 10%
The network picture of the same data is shown in the next graph. Several of the previous associations are not present, because three of the commentators are not in this dataset (Cho Lee, Jennings, Kapon).
Suckling, Falstaff and Spectator are closely associated, as expected from the results in the table, as are Parker and Leve. More interestingly, we can now evaluate three new commentators, all from France. Indeed, the four French commentators are closely associated in the graph (Revue de France, Bettane et Desseauve, Quarin, and Vinum Weinmagazin [from Switzerland]). Of these, only the Revue de France scores seem to be associated with any of the non-French commentators, having some connection to those of Robert Parker.
Thus, there is little commonality among the scores of different commentators, and this is especially true for Jancis Robinson. Furthermore, the four French commentators do seem to form a separate groups from the others. Perhaps it is relevant that these five people are the only ones in the dataset who use a 20-point quality scale rather than a 100-point scale.
Hill of Grace (Australia)
As an addendum, we can take a quick look at another single Australian wine that has a long record of critics' scores, following the Penfolds Grange used in the previous post.
Unlike Grange, the Henschke Hill of Grace is a single vineyard wine, made from c. 7 ha of shiraz. The oldest vines were planted in the 1860s; and in a good year there are about 2,000 cases of wine produced. The vintages date back to 1958, with four vintages when the wine was not made (leaving 50 released vintages for analysis). There are five commentators who have provided quality scores for almost all of these vintages, and two more who have covered at least one third of them.
As above, we can quantify the relationships among the scores using correlation analysis. This analysis reveals that the following percentages are held in common between the seven commentators pairwise. As above, values >60% are highlighted in green, those between 50 and 60% are in blue, and those <10% are in yellow.
As you can see, the correlations are extremely poor, except for those involving Huon Hooke, which are not quite so bad. None of the values exceeds 50%, and most are actually <10%. This means that there is very little agreement among the commentators in their quality scores.
This is the most extreme example of disagreement in quality scores that I have encountered.
Conclusion
The answer to the question posed in the title is: "a lot".
This broader analysis (six wines) confirms the results from the previous blog post (Poor correlation among critics' quality scores). The idea that wine commentators have some sort of consensus opinion with regard to wine quality is completely untenable, for all of the wines checked so far. In general, the agreement varies from 0–50%, so that the critics disagree more than they agree.
However, there are patterns of association among the commentators, so that their quality scores are not completely random. Unfortunately, this seems to be a relatively minor component of the data patterns. Nevertheless, the four French commentators do seem to have opinions about the French wines that differ from those of the other commentators.
Research Literature
Johan Almenberg, Anna Dreber (2009) When does the price affect the taste? Results from a wine experiment. American Association of Wine Economists Working Paper No. 35.
Orley Ashenfelter, Richard Quandt (1999) Analyzing a wine tasting statistically. Chance 12(3):16-20.
Robert H. Ashton (2011) Improving experts’ wine quality judgments: two heads are better than one. Journal of Wine Economics 6:160-178.
Robert H. Ashton (2012) Reliability and consensus of experienced wine judges: expertise within and between? Journal of Wine Economics 7:70-87.
Robert H. Ashton (2013) Is there consensus among wine quality ratings of prominent critics? An empirical analysis of red Bordeaux, 2004-2010. Journal of Wine Economics 8:225-234.
George A. Baker, and Maynard A. Amerine (1953) Organoleptic ratings of wines estimated from analytical data. Food Research 18:381-389.
Jeffrey C. Bodington (2012) 804 tastes: evidence on preferences, randomness, and value from double-blind wine tastings. Journal of Wine Economics 7:181-191.
Jeffrey C. Bodington (2015) Evaluating wine-tasting results and randomness with a mixture of rank preference models. Journal of Wine Economics 10:31-46.
Jeffrey C. Bodington (2015) Testing a mixture of rank preference models on judges’ scores in Paris and Princeton. Journal of Wine Economics 10:173-189.
Chris J. Brien, P. May, Oliver Mayo (1987) Analysis of judge performance in wine-quality evaluations. Journal of Food Science 52:1273-1279.
Jing Cao (2014) Quantifying randomness versus consensus in wine quality ratings. Journal of Wine Economics 9:202-213.
Jing Cao, Lynne Stokes (2010) Evaluation of wine judge performance through three characteristics: bias, discrimination, and variation. Journal of Wine Economics 5:132-142.
Jean-Marie Cardebat, Emmanuel Paroissien (2015) Reducing quality uncertainty for Bordeaux en primeur wines: a uniform wine score. American Association of Wine Economists Working Paper No. 180.
Jean-Marie Cardebat, Jean-Marc Figuet, Emmanuel Paroissien (2014) Expert opinion and Bordeaux wine prices: an attempt to correct biases in subjective judgments. Journal of Wine Economics 9:282-303.
Domenic V. Cicchetti (2004) Who won the 1976 blind tasting of French Bordeaux and US Cabernets? Parametrics to the rescue. Journal of Wine Research 15:211-220.
Dominic V. Cicchetti (2006) The Paris 1976 Wine Tasting revisited once more: comparing ratings of consistent and inconsistent tasters. Journal of Wine Economics 1:125-140.
Domenic Cicchetti, Arnold Cicchetti (2008) The balancing act in consistent wine tasting and wine appreciation: Part II: Consistency in wine tasting and appreciation: an empirical-objective perspective. Journal of Wine Research 19:185-191.
Domenic V. Cicchetti, Arnie F. Cicchetti (2013) As wine experts disagree, consumers’ taste buds flourish: how two experts rate the 2004 Bordeaux vintage. Journal of Wine Research 24:311-317.
Dom Cicchetti, Arnie Cicchetti (2014) Two enological titans rate the 2009 Bordeaux wines. Wine Economics and Policy 3:28-36.
Margaret A. Cliff, Marjorie C. King (1996) A proposed approach for evaluating expert wine judge performance using descriptive statistics. Journal of Wine Research 7:83-90.
Margaret A. Cliff, Marjorie C. King (1997) The evaluation of judges at wine competitions: the application of eggshell plots. Journal of Wine Research 8:75-80.
Margaret A. Cliff, Marjorie C. King (1999) Use of principal component analysis for the evaluation of judge performance at wine competitions. Journal of Wine Research 10:25-32.
Margaret A. Cliff, Mike O’Mahony, Lana Fukumoto, Marjorie C. King (2000) Development of a ‘bipolar’ R-index. Journal of Sensory Studies 15:219-229.
Victor Ginsburg, Israël Zang (2012) Shapley ranking of wines. Journal of Wine Economics 7:169-180.
Neal D. Hulkower (2009) The Judgment of Paris according to Borda. Journal of Wine Research 20:171-182.
Dennis V. Lindley (2006) Analysis of a wine tasting. Journal of Wine Economics 1:33-41.
Jonas De Maere (2014) Do expert tasters evaluate wines consistently? A statistical analysis and a proposal for improvement. Weinakademiker thesis, Weinakademie Österreich.
Philippe Masset, Jean-Philippe Weisskopf, Mathieu Cossutta (2015) Wine tasters, ratings, and en primeur prices. Journal of Wine Economics 1:75-107.
Ingram Olkin, Ying Lou, Lynne Stokes, Jing Cao (2015) Analyses of wine-tasting data: a tutorial. Journal of Wine Economics 10:4-30.
Wendy V. Parr, James A. Green, K. Geoffrey White (2006) Wine judging, context and New Zealand sauvignon blanc. Revue Européenne de Psychologie Appliquée 56:231-238.
Anthony Pecotich, Steven Ward (2010) Taste testing of wine by expert and novice consumers in the presence of variations in quality, brand and country of origin cues. American Association of Wine Economists Working Paper No. 66.
Richard E. Quandt (2006) Measurement and inference in wine tasting. Journal of Wine Economics 1:7-30.
Richard E. Quandt (2012) Comments on the Judgment of Princeton. Journal of Wine Economics 7:152-154.
Christine H. Scaman, J. Dou (2001) Evaluation of wine competition judge performance using principal component similarity analysis. Journal of Sensory Studies 16:287-300.
Eric T. Stuen, Jon R. Miller, Robert W. Stone (2015) An analysis of wine critic consensus: a study of Washington and California wines. Journal of Wine Economics 10:47-61.
Daniel L. Ward (2012) A graphical and statistical analysis of the Judgment of Princeton wine tasting. Journal of Wine Economics 7:155-168.
Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.
Roman L. Weil (2007) Debunking critics' wine words: can amateurs distinguish the smell of asphalt from the taste of cherries? Journal of Wine Economics 2:136-144.
Roman L. Weil (2005) Analysis of reserve and regular bottlings: Why pay for a difference only critics claim to notice? Chance 18(3):9-15.
Subscribe to:
Posts (Atom)