Monday, December 25, 2017

Medical practitioners and malt whisky

Greetings of the season.

For Christmas this year I thought that I might follow up on two previous posts that have proved to be quite popular on my other blog, which are about the detectable qualities of Scotch whiskies:


Medical practitioners have been known to partake of these tipples; and along the way some of them have pondered the question as to whether it it possible for people to reliably distinguish among the various whiskies, in even the most basic way. For example, two medical groups have done some experiments, and published them in the Christmas issue of the British Medical Journal:
  • Stephen J Chadwick, Hugh A Dudley (1983) Can malt whisky be discriminated from blended whisky? The proof. A modification of Ronald Fisher's hypothetical tea tasting experiment. British Medical Journal 287:1912-1913.
  • EJ Moran Campbell, Diana ME Campbell, Robin S Roberts (1994) Ability to distinguish whisky (uisge beatha) from brandy (cognac). British Medical Journal 309:1686-1688.
For those of you who think you know your whiskies, it turns out to be a lot harder to discriminate them than you think.

Here is the abstract of the first paper:
A modified version of Fisher's tea tasting experiment was performed to test the confident assertions of some members of an academic surgical unit that they could easily distinguish malt from blended whisky. Eight male volunteers from the unit, divided into regular and inexperienced whisky drinkers, were blindfolded and given a glass of each of six whiskies. The whiskies included three malts and three blends, and each subject tasted each whisky six times. They were asked whether the whisky was malt or blended, whether they could identify the distillery, and whether they liked it (ranked on a nine-point scale). Statistical analysis of the data suggested that within the [surgical] unit malt whisky could not be distinguished from blended whisky, and that experience did not alter powers of discrimination. These results suggest that, although "uisgebeatha" has unique properties, the inexpert drinker should choose his whisky to suit his taste and pocket and not his self image.
Here is the abstract of the second paper:
Objective: To assess the ability to distinguish between first rate malt whisky and brandy and between different brands of each. Design: Crossover with two sessions of 12 blindfold tastings of two whiskies and two brandies before and after supper, repeated not more than seven days later. Participants: four volunteers aged 50-68 years, all moderate drinkers of alcohol and members of a wine club. Results: Only one participant produced irrefutable statistical evidence of being able to distinguish between whisky and brandy, correctly identifying 50/51 (98%) samples. The participant who was best able to distinguish between whisky and brandy was also best able to identify correctly the brand of whisky (100%). Conclusion: The results show that some participants could distinguish neither between malt whisky and brandy nor between different brands of whisky and brandy. However, the success of one participant [a Scotsman] shows that "it can be done", and that his whisky specific ability is acquired not innate.
These experiments received comments from some of their medical colleagues, for those of you who might like to read them:
  • James Howie (1983) Good motivation but indifferent methods. British Medical Journal 287:1913-1914.
  • Douglas G Altman (1983) How blind were the volunteers? British Medical Journal 287:1914-1915.
  • Stephen J Chadwick, Hugh A Dudley (1983) In defense of the whisky drinker on the Clapham omnibus. British Medical Journal 287:1915.
  • Ken MacRae (1994) A spirited attempt. British Medical Journal 309:1688.
Of these, perhaps the most pertinent one is from both Howie and MacRae, who point out that some of the drinks chosen were rather similar. Obviously, this point alone obviates the need for an experiment at all — if it is known beforehand that whiskies are similar to each other (as shown in the two blog posts linked above), then why do we need an experiment to show it? Except for the fun of doing the tasting, of course!

Finally, Altman notes that "last Christmas I helped to perform a small experiment that demonstrated that white wine and red wine cannot always be distinguished (unpublishable results)." Christmas can have that effect on you.

Monday, December 18, 2017

Sample sizes, and the supposed wine differences between women and men

We have seen a number of web discussions this year about the ostensible differences between males and females when it comes to sensory perception, particularly the tasting of wine. For example:
More recently, a paper appeared in the Journal of Wine Economics that has been taken to shed some light on this issue (for example, see Men and women don't actually have different taste in wine ; Do men and women taste differently?). It apparently opposes the notion that such differences are of much practical importance:
The author compiled data from 23 different wine tastings (conducted by other people) in which the wine-quality scores could be subdivided into those from male and female tasters. He then proceeded to apply various statistical tests to the data, to assess whether there were differences between women and men in the wine-tasting results.


Leaving aside the physiology of wine tasting for the moment, there is one thing that immediately struck me when I read the paper — the incredibly small sample sizes used in the data analyses. This is a topic that I have commented on before, when I pointed out that there are issues with experiments that can far outweigh the effects of sample size, notably biased sampling (Why do people get hung up about sample size?). However, in this current case, the samples sizes seem quite unbelievable — there were never more than 20 people per gender at each of 23 different tastings, and often much fewer.

Now, it might seem intuitively obvious that such small sizes are unlikely to lead us anywhere useful, but we can do better than merely express an opinion about this. It seems worthwhile for me to try to quantitatively assess the situation.

Samples sizes and experiments

First, however, we need to be clear about the consequences of samples sizes for experiments. Indeed, the paper's author himself directs attention to the issue, by mentioning "Type II error" in his title. This esoteric expression is the statistical term for what scientists call "false negatives", the failure to find something when it really is there to be found. Alternatively, "Type I errors" are "false positives", the finding of something that is either illusory or trivially unimportant.

Using an analogy, if I am looking for a needle in a haystack then a false negative means that I fail to find it when there really is a needle in there — I have made a Type II error. If I find something in the haystack that I interpret as being a needle when it is not, then that is a false positive — I have made a Type I error.

Small sample sizes are prone to false negatives, whereas gratuitously large sample sizes are prone to false positives. Neither situation can be considered to be a good thing for an experiment.

Needless to say, statisticians have had a good look at these issues, and they have developed the notions of both statistical Power and statistical Sensitivity. Mathematically, Power is the complement of a Type II error, and thus expresses the theoretical probability that a statistical test will correctly reject the null hypothesis being tested — that is, Power tells us how probable it is that the statistical analysis will find something if there really is something to find. Sensitivity is a related concept, referring to the empirical ability of an experiment to to correctly reject the null hypothesis. Both concepts can be expressed mathematically.

A look at the paper

Returning to the paper in question, two experimental null hypotheses were tested, using different subsets of the 23 tastings:
    H1: Women’s and men’s scores have different means and standard deviations
    H2: Women and men have differently shaped distributions of scores
Various statistical tests were applied to test each hypothesis. I won't go into the technical details of what this all means, but will instead jump straight to the results presented in Tables 1, 2 and 3 of the paper.

Sufficient details are presented for me to perform a Power analysis of H1 and a Sensitivity analysis of both H1 and H2. (The test statistics are not presented for H2, only the Type I errors, and so the Power analysis cannot be performed.) My calculations were performed using the G*Power v.3.1.9.2 program; and the outcomes of my analyses are shown in the table below. Each row of the table represents one wine tasting, as listed in the first column. The second column shows the sample sizes, taken from the paper, while the final two columns are the results of my new calculations.

Power and Sensitivity analyses of wine tasting sample sizes

Formally, my Power analyses assess the post hoc achieved power of the statistical t-test at p=0.05 Type I error. More practically for you, all you need to know is that a Power of 80% is conventionally considered to be the minimum acceptable level for an experiment, and this represents an allowable probability of 20% for false negatives. For preference, a good experiment would require a Power of 95%, which is a 5% probability for false negatives.

As you can see in the table, the Power of Bodington's analyses are nowhere near these levels, as the Power of his analyses never exceeds 11%. Indeed, his probabilities of false negatives are in the range 90-95%, meaning that he is coming close to certainty that he will accept the null hypothesis for each statistical test, and thus conclude that he found no difference between men and women, irrespective of whether there actually are such differences or not.

Formally, my Sensitivity analyses quantify what is called the required Effect size. This is a quantitative measure of the "strength" of the phenomenon that the experiment is seeking to find. Large Effect sizes mean that the phenomenon will be easy to detect statistically, while small Effect sizes will be hard to find. Using my earlier analogy, if I am looking for a sewing needle in my haystack, then that is a small Effect size, whereas looking for a knitting needle would be a large Effect size.

Small Effect sizes require large sample sizes, while large Effect sizes can be detected even with small sample sizes. I specified p=0.05 & power=0.80 for my calculations, which relate to the acceptable levels of false positives and negatives, respectively.

Using the standard rule of thumb (Shlomo S. Sawilowsky. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods 8: 597-599), Effect sizes are grouped from very small (0.1) to huge (2.0). Effect sizes larger than 1 are encountered only rarely in the real world, and they do not need statistical analysis because the phenomenon being studied will be obvious, even to the naked eye.

For Bodington's statistical analyses, the various Effect sizes are: 1 medium, 8 large, 15 very large, and 7 huge; none of them are small or very small. So, his tests were capable of detecting only the most blatant differences between women and men. Such differences would actually not need an experiment, because we would be able to see them for ourselves, anyway.


Required sample sizes

To put this another way, there would have to be almost no variation between people within each of the two genders, in order for Bodington's statistical tests to have detected anything. When there is only small variation, then small sample sizes can be effective. However, if there are large differences between people, irrespective of gender, then we would require large sample sizes. My Power and Sensitivity analyses show that we would, indeed, require very large sample sizes in this case.

To examine this, we can perform what is called a prospective (or a priori) Power analysis. My calculations show that sample sizes of 300-450 people would be required per gender, depending on the particular assumptions made during the calculations. That is, if you want to do this experiment for yourselves, you will need to find at least 300 men and 300 women, give each of them the same sets of wines to taste, and then record their scores. If your statistical analyses of the scores still do not detect any gender differences, then you can be fairly sure that any such differences are indeed small.

Conclusion

Bodington's conclusion that the experiments found no detectable difference between men and women is therefore unsurprising — the analyses have so little statistical power that any such differences, if they exist, were unlikely to be found. The author suggests that "these differences are small compared to non-gender-related idiosyncratic differences between individuals and random expressions of preference." If so, then any study of the idea that the genders prefer different wines will require much larger sample sizes than the ones used in the published paper.

On this particular wine topic, the jury is still out. Based on differences in physiology, there is good reason to expect differences in wine tasting between females and males. However, measuring the size and nature of this difference, if it really does exist, remains to be done.

Monday, December 11, 2017

Do community wine-quality scores converge to the middle ground?

The short answer appears to be: not very often. This is surprising, given what is reported for other communities. This may indicate something unique about the wine community.

A few weeks ago, I discussed community wine-quality scores, such as those in the Cellar Tracker database (Cellar Tracker wine scores are not impartial). One of the subjects I commented on was the suggestion that the "wisdom of crowds" can mean that members of the crowd allow their judgement to be skewed by their peers. In the case of wine-quality scores, this would mean that scores from large groups of tasters may converge towards the middle ground, as the number of scores increases.


In the formal literature, this topic has been examined by, for example, Omer Gokcekus, Miles Hewstone & Huseyin Cakal (2014. In vino veritas? Social influence on ‘private’ wine evaluations at a wine social networking site. American Association of Wine Economists Working Paper No. 153). They looked at the trend in Cellar Tracker scores for wines through time, from when the first score is added for each wine. They wanted to see whether the variation in scores for a wine decreases as more scores are added for that wine, which would support the thesis about crowd behavior. They concluded that there is some evidence of this.

The important practical point here is that Cellar Tracker displays the average score for each wine when a user tries to add a new score of their own, and it is hard to ignore this information. So, it would be rather easy for a user to be aware of the difference between their own proposed score and the current "wisdom of the crowds". This would presumably have little or no effect when only a few scores have been added for each wine, but it might potentially have an effect as more score are added, because the crowd opinion then becomes so much clearer.

It has occurred to me that some data that I used in another blog post (Are there biases in community wine-quality scores?) might also be used to examine the possibility that Cellar Tracker scores are biased in this way. In my case, I will look at individual wines, rather than pooling the data across all wines, as was done in the research study described above.

The data at hand are the publicly available scores from Cellar Tracker for eight wines (for my data, only 55-75% of the scores were available as community scores, with the rest not being shared by the users). These eight wines included red wines from several different regions, a sweet white, a still white, a sparkling wine, and a fortified wine. In each case I searched the database for a wine with at least 300 community scores; but I did not succeed for the still white wine (which had only 189 scores).


The results for the eight wines are shown in the graphs at the end of the post. Each point represents one quality score for the wine (some users enter multiple scores through time). For each wine, each score is shown (vertically) as the difference from the mean score for the wine — positive scores indicate that score was greater than the average score, while negative scores indicate that it was less than the average. The time is shown (horizontally) as the number of days after the first tasting recorded for that wine.

The expectation is that, if the wine-quality scores do converge towards the middle ground, then the variability of the scores should decrease through time. That is, the points in the graphs will be more spread out vertically during the earliest times, compared to the later times.

The results seem to be quite consistent, with one exception. That exception is the first one, where the scores are, indeed, more variable through the first third of the time period. In all of the other cases, the scores are most variable during the middle period, which is when most of the scores get added to the database, or sometimes also in the late period.

So, for these wines at least, I find little evidence that Cellar Tracker scores do converge towards the middle ground. This seems to disagree with the study of Gokcekus, Hewstone & Cakal (mentioned above), who concluded that community scores are normative (= "to conform with the positive expectations of another") rather than informational ("to accept information obtained from another as evidence about reality").

However, a study by Julian McAuley & Jure Leskovec (2013. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. Proceedings of the 22nd International Conference on the World Wide Web, pp. 897-908), found that user behavior in the Cellar Tracker database was quite different from the other four community databases that they studied (Beer Advocate; Rate Beer; Amazon Fine Foods; Amazon Movies).


So, maybe wine drinkers really are different from beer drinkers and movie goers, when it comes to community assessment of their products? The wisdom of the wine crowd may be unique! In particular, you will note that wine drinkers are not afraid to give rather low scores for wines — the scores in the graphs go much further below the average than they do above it. Note that the dataset excludes wines that are considered to be flawed, which are usually not given scores at all (although very rarely they receive scores in the 50-60 range, which I excluded, as representing faulty wines).

It seems to me that community wine scores are actually informational, rather than normative, expressing the opinion of the drinker rather than that of the crowd. This also fits in with the easily observed fact that the community scores are consistently lower than are those of the professional wine critics (see my previous post Cellar Tracker wine scores are not impartial) — the wine community is not easily swayed by expert opinion. However, the tendency of all wine reviewers, professional, semi-professional and amateur, to favor a score of 90 over a score of 89 certainly represents an unfortunate bias.

Cellar Tracker wine-quality scores through time for Alvear Pedro Ximenez

Cellar Tracker wine-quality scores through time for Barbaresco 2006

Cellar Tracker wine-quality scores through time for Caymus Cabernet 2012

Cellar Tracker wine-quality scores through time for Clicquot NV Brut

Cellar Tracker wine-quality scores through time for Edwards Sauvignon Blanc 2012

Cellar Tracker wine-quality scores through time for Pontet-Canet 2003

Cellar Tracker wine-quality scores through time for Rieussec 2001

Cellar Tracker wine-quality scores through time for Tondonia 2001

Monday, December 4, 2017

California cabernets do not get the same quality scores at different tastings

We are commonly told that most wines are drunk within a very few days of purchase. On the other hand, it is a commonly held belief among connoisseurs that many wines are likely to improve with a bit of bottle age, especially red wines. Counter-balancing the latter idea, it is the easy to demonstrate that the perception of wine quality depends on the people concerned and the circumstances under which the wines are tasted.

Therefore, I thought that it might be interesting to look at some repeated tastings of the same wines under circumstances where they are formally evaluated under roughly the same conditions. Do these wines get the same quality scores at the different tastings?

To look at this, I will use some of the data from the tastings of the Vintners Club, based in San Francisco. The results of the early club tastings are reported in the book Vintners Club: Fourteen Years of Wine Tastings 1973-1987 (edited by Mary-Ellen McNeil-Draper. 1988); and I have used this valuable resource in several previous blog posts (eg. Should we assess wines by quality points or rank order of preference?).


For each wine tasted, the book provides the average of the UCDavis points (out of 20) assigned by the group of tasters present at that meeting. The Vintners Club has "always kept to the Davis point system" for its tastings and, therefore, averaging these scores is mathematically valid, as is comparing them across tastings. Many of the wines were tasted at more than one meeting, although each meeting had its own theme. For out purposes here, the largest dataset is provided by the tastings involving California cabernet wines, which I will therefore use for my exploration.

In the book, there were 170 different California cabernets that the Club tasted more than once, sometimes up to five years apart. Most of these were tasted only twice, but some were tasted up to four times over four years. Of these 170 wines, only eight wines produced the same average score on their first two tasting occasions. Of the rest, 63 wines (37%) produced a lower score on the second occasion, and 99 (58%) produced a higher average score. Perhaps we might tentatively conclude that California cabernet wines do tend to increase in quality with a bit of time in bottle?

However, it is instructive to look at those 137 wines that were re-tasted within one year of their first tasting. This will give us some idea of the repeatability of wine quality scores, as we should not really be expecting California cabernet wines to change too much within their first year in the bottle. Any differences in scores are therefore likely to reflect the tasting group rather than the wine itself (unless there is much bottle variation, or the wines mature very rapidly).

The data are shown in the first graph, with the difference between the two tasting dates shown horizontally, and the difference between the two average UCDavis scores shown vertically (ie. second tasting score minus first tasting score). Each point represents one wine, tasted twice within a year.

Difference in quality scores for wines re-tasted within 1 year

Clearly, there is considerable variability in the quality scores between tastings (the scores are spread out vertically over several quality points). Moreover, there is not much pattern to this variability — even after only a few days the scores can differ by more than 1 point; and even after a year they can still be identical. Most of the wines (70%) produced scores within +/– 1 point at the two tastings.

Notably, however, there were more decreases in score than there were increases between the two tastings. Only eight wines produced an increase in score of more than 1 point, while 33 wines (24%) produced a decrease in score of more than 1 point, and four of these actually decreased by more than 2 points. (NB: some of the points on the graph sit on top of each other.) I was not expecting to see such a strong pattern.

A so-called Difference/Average plot can sometimes be informative, and this is shown in the next graph. This shows the same data as above, but this time the horizontal axis represents the average of the two quality scores for each wine (rather than representing time).

Quality scores for wines re-tasted within 1 year

This graph does not reveal much in the way of tell-tale patterns, which is unusual. However, we might anticipate that high-scoring wines will get more consistent quality scores, and this appears to be so for those few wines scoring >16 points. Furthermore, wines scoring <14 points do not do well at their second tasting.

Finally, we can look at those six California cabernets that were each tasted on four separate occasions. The final graph shows the time-course( horizontally) of their score (vertically), with each wine represented by a single line (as labeled).

Quality scores of wines tasted four times

Note that only one wine (from Stag's Leap) consistently increased in assessed quality over the years, while two other wines (from Beaulieu, and Robert Mondavi) consistently decreased. The remaining three wines had more erratic patterns. These differences may simply reflect random variation, and so we shouldn't read too much into this small sample size. Nevertheless, we do not see the hoped-for general increase in assessed quality of California cabernets over their first few years in bottle.

So, do California cabernets get the same quality scores at different tastings? In general, yes. However, a large number of them end up with notably lower scores if there is a second tasting within a year.

Monday, November 27, 2017

Do you think the new Penfolds G3 wine is too expensive?

Penfolds Grange Bin 95 is probably Australia's best known red wine among connoisseurs, famous for its longevity. In 1995, the Wine Spectator named the 1990 Penfolds Grange as its wine of the year, and the Wine Advocate proclaimed Grange to be “the leading candidate for the richest, most concentrated dry red table wine on planet Earth.” The release price of Grange has a huge effect on the value of other ultra-fine Australian wines, as Penfolds tries to ensure that it is the most expensive wine on general release from Australia (for some details, see Top 25 Most-Expensive Australian Wines).

Mind you, the prices don't even remotely challenge the limited-production wines that make up the top 100 most expensive wines currently available in the world. After all, the current Grange release is only $A850 ($US650) per bottle, whereas these other wines cost thousands (see Top 50 Most Expensive Wines in the World). Mind you, a complete set of Grange vintages (1951-2013) recently sold for $A294,320 ($US224K).


To compete with these more expensive wines, Penfolds also occasionally releases limited-production wines, at thousands of dollars per bottle. Recently, they announced the upcoming release of a wine called G3, which is a mix of three Grange vintages, first blended and then further aged together in barrels (see Introducing Penfolds G3, a new wine born from Grange DNA). The release price will be $A3000 per bottle, with 1200 bottles available by "expressions of interest" only. This is a wine for investment, not drinking.

Needless to say, the media has had a field day, especially in Australia. Philip White has provided a summary on his blog of some of the comments: Penfolds G3 reviews reviewed. For my purposes here, the most pertinent comment has been by Campbell Mattinson, at The Wine Front:
For $3000 you could buy three or four vintages of Grange and, if you really wished, make up a blend yourself. Not only would this be more economical, it would be a more interesting experience / exercise for a wine lover ... Penfolds G3 would have been more interesting, even compelling, had it been released at $1000 per bottle rather than at $3000.
In the tradition of this blog, I cannot let these numbers go by without taking a closer look at them. Penfolds has set the G3 at 3.5 times the cost of the three Grange wines being blended. Here, I show that this makes the G3 wine nearly twice as expensive as might be expected, even given that it is a Grange blend.

The expected price of Grange

For my analysis, I first need to work out whether Grange is being released at a reasonable price in the first place. Surprisingly, one can make a case that the current release is actually under-priced.

To see this, we need the prices of Australia's most expensive wines. I started with The Wine Front's most recent list (May 2017) of Australia’s Most Expensive Wines, which is comprehensive but not exhaustive. This compilation consists of current-release still wines listed at $A150 ($US115) and above (ie. no sparkling wines, no fortifieds; and excluding special & limited releases). I then updated a couple of the prices, and added a few more candidates, making a total of 117 wines for my analysis.

The idea here is to use these data to derive an "expected" price for the Grange wine based on the prices of the other wines, so that we can compare this to the actual Grange price. I will do this by relating the wine prices to the rank order of those prices. This is shown in the graph, where each point represents a single wine; only the two most expensive wines are labeled. The graph is plotted with the logarithm of both axes, which means that the Zipf "power model" can be represented by a straight line on the graph, as explained in a previous post.



As shown, the Power model fits the data extremely well (98% of the data are fitted), but only if we exclude the two most expensive wines (shown in pink). This means that neither the Penfolds Grange (the most expensive wine) nor the Henschke Hill of Grace (the second wine) has a bottle price that is in line with the other 115 wines. Indeed, both wines appear to be under-priced!

The Power model predicts that the current release of the Henschke Hill of Grace wine would be expected to have a bottle price of $A1,000, and the Penfolds Grange would have a price of $1,400 per bottle, based on the prices of the other wines. Their current retail prices are $A825 ($US635) and $A850 ($US650), respectively.

So, don't be surprised to see these two wines increase to these "expected" prices in future releases. Neither Henschke nor Penfolds has recently been shy about raising their prices to ensure their primacy in the Australian market.

The expected price of G3

We can now try to derive an "expected" price for the new G3 wine, by repeating the above analysis, but moving all of the wines down one place in the rank order (ie. Grange becomes the #2 wine instead of #1, etc). The new equation for the Power model (excluding Grange and Hill of Grace) will give us a predicted price for the new #1 most-expensive wine in the list.

This turns out to be a bottle price of $A1,550, which is fractionally more than half of Penfolds' proposed price for the G3 wine. So, even given that the wine is a Grange, the proposed G3 price is, indeed, much more than we should expect to pay. Mind you, this expected price is 50% more than Campbell Mattinson is prepared to pay for a Grange blend, so he may actually be under-valuing the wine.

Monday, November 20, 2017

Wine collector fraud, and wine snobbery

The New Testament gospels warn us about the danger of putting new wine into old wineskins. This was a religious parable of Jesus, with several possible interpretations; but it has taken on a very different relevance in the modern world, with increasing incidences of collector fraud involving wines.

Counterfeit wine has been much in the media in recent weeks (eg. Wine maven Kurniawan, convicted of fraud, loses bid for freedom ; Billionaire Koch brother's crusade against counterfeit wine ; Why it’s so hard to tell if a $100,000 bottle of wine is fake ; Napa wine merchant accused of fraud in client's lawsuit). We have even gotten to the stage where there is fake news about allegedly fake wines (Penfolds hit by fake wine claims).


Discussion of these topics seems to range from outrage at the fraudster, through fascination with how it's done, to wondering how much of it has been done (eg. $100 million of counterfeit wine in circulation ; 20% of all wine in the world is fake). Among all of these news stories and commentaries, there is one general point that seems not to have been emphasized — wine collector fraud and wine consumption fraud are two different things. Furthermore, wine collector fraud requires a combination of massive wealth and massive snobbery on the part of the collectors — if there were no people with this combination of characteristics, then collector frauds would not even be conceived, let alone perpetrated.

There are two types of wine fraud

Fraud directed against wine collectors is a rather different thing from most other frauds, which are usually grouped as consumption fraud rather than collector fraud. Far too much of the wine discussion has failed to clearly distinguish these to types of fraud, which are clearly described by, for example, Lars Holmberg (2010. Wine fraud. International Journal of Wine Research 2: 105-113). The difference is very important, because consumers and collectors are very different people. The main purpose of this blog post is to call attention to this distinction.

Consumption wine fraud is usually directed at inexpensive or mid-price wines, and includes things like: misrepresenting the grape variety, grape origin or alcohol content; adulterating the wine with sugar, water, coloring, flavors, or something much worse (like glycol or lead); and running a retail ponzi scheme. These things can be done on a large scale, and they potentially affect all consumers. Collector fraud, on the other hand, usually involves luxury wines, and is directed almost solely at individuals with more money to pay for the wine than they have technical ability to correctly identify that wine.

In the latter case, irrespective of what we may feel about the fraudster, we should recognize that the collectors who bought the wines are ultimately victims of their own snobbery, and having the wealth to display that snobbery. Anyone who spends tens of thousands of dollars on a bottle of wine can only be doing so for the snob value of having people know that they did this (Campbell Mattinson: "the rich and powerful need something rich and powerful to spend their money on"). These are wine investors, not wine drinkers, and so we are actually talking about wine investment fraud, which is not too dissimilar to art investment fraud. This is a far cry from consumption frauds directed at wine drinkers in general.

Wine can be a good financial investment, of course, but only if you can authenticate the wine. This is a very hard and expensive thing to do. Perhaps these investors might consider some alternative means of disposing of their massive wealth? There are plenty of people besides fraudsters who would like the opportunity to make good use of the money; and many of these people actually perform publicly useful services, rather than the solely private one of enhancing investor snobbery.


Interestingly, there seems to have been no diminution of the prices of rare wines, in spite of all of the fuss about collector fraud (Q&A: François Audouze, wine collector). This illustrates the illogicality of luxury wine prices.

Snobbery

Wine snobbery comes in many guises. Snobs are conventionally considered to be those people who value exclusivity and status above everything else. However, there are alternative ideas about this characterization. For example, Jeany Miller (The parasitic nature of the wine fraud) has suggested that: “Wine snob is an affectionate term for people who understand and enjoy wine." This may be giving the real snobs a bit too much credibility, but it does emphasize the wide-ranging nature of the term. In particular, not all wine snobs have massive wealth, although a certain level of financial liquidity is obviously required. Snobbery on its own is usually relatively harmless, but combining it with increasing wealth is simply asking for increasing amounts of trouble.

Wine snobbery has been a topic of discussion for quite a while. For example, whole books on the topic have been around since the 1980s, varying from the humorous (The Official Guide to Wine Snobbery, by Leonard S. Bernstein, 1982) to the very serious (Wine Snobbery: an Insider's Guide to the Booze Business, by Andrew Barr, 1988).

Barr, in particular, describes how a large section of the drinks industry relies on snobbery for its profitability. Luxury wines cost an arm and a leg (see The cost of luxury wines), but they are not much better in quality than wines costing a tenth of the price (see Luxury wines and the relationship of quality to price). It takes snobbery and wealth to get involved in this segment of the refreshments business.

Alternatives

Fortunately for those of us who understand and enjoy wine, and therefore might conceivably be considered snobs, there is another segment of wine snobbery that requires expertise rather than wealth — knowing about little-known wines and regions requires time and effort, but not necessarily wealth. For example, few Americans know much about Australian wine, and yet Australia is a continent as well as a country, and it therefore has as wide a diversity of wine regions and wines as any other continent. Wine writers are often lazy, and they treat "Australia" as a single wine region, just as they do for any of the much smaller countries of South America or Europe, in spite of its greater vinous diversity than most of these other countries. You can get a lot of snob value out of knowing more about Australian wine than just shiraz! (Some examples: So much more than “just shiraz”! ; Why there's more to Australian wine than chardonnay ; Alternative Australian wines.)

Wine Cellar, Park Hotel

Old bottles of wine also provide snob value, of course, but they can often do this without much monetary expenditure. In Europe, old wine is available on eBay, but massive wealth is not usually to be found there — the wealthy shop elsewhere than eBay (or Amazon). Snobbery is available on eBay, like anywhere else, but it is not massive — there is little snob value to be gained from saying that you shop on eBay. But turning up to dinner with an old bottle of wine does not require that you tell anyone where you got it!

Consumer wine fraud has been detected involving some relatively inexpensive wines, as well as the more newsworthy expensive ones, and so caveat emptor always applies, on eBay as much as anywhere else. However, on eBay it is much more likely that an old bottle of wine will be undrinkable, rather than that it will be drinkable but not what the label says it is. Poor storage of old bottles is a far bigger risk than is a problematic pedigree. It is for this reason that reputable sellers on eBay emphasize that you are buying the bottle not its contents.

Perhaps that is a warning we should put on all old bottles, no matter what their price or provenance?
You are buying the snob value of the label, not the wine — pay accordingly, and don't complain.
Postscript

For a later, but similar, take on the importance of distinguishing the two types of wine fraud, see Oliver Styles' commentary: Worried about wine fraud? that's rich.

Monday, November 13, 2017

CellarTracker wine scores are not impartial

Among other things, CellarTracker consists of a community-driven database of wines, along with comment notes on the tasting of those wines, and often also a quality score assigned at each tasting. Some time ago, Tom Cannavan commented:
The current thinking seems to be that the "wisdom of crowds" (CellarTracker, TripAdvisor, Amazon Reviews) is the most reliable way to judge something; but that thinking is deeply flawed. It's not just the anonymity of those making the judgements, and the fact that they may or may not have experience, independence, or have an agenda, but that crowd behaviour itself is far from impartial. We've all experienced people "tasting the label"; but there is also no doubt that members of the crowd allow their judgement to be skewed by others. That's why in big groups scores always converge toward the safe middle ground.
So, we can treat this as Proposition #1 against the potential impartiality of the CellarTracker wine-quality scores.

For Proposition #2, I have previously asked Are there biases in community wine-quality scores? In answering that question I showed that CellarTracker users have (for the eight wines I examined) the usual tendency to over-use quality scores of 90 at the expense of 89 scores.

For Proposition #3, Reddit user mdaquan has suggested:
Seems like the CellarTracker score is consistently lower than multiple professional reviewers on a consistent basis. I would think that the populus would trend higher, [and] not be as rigorous as "pro" reviewers. But consistently the CT scores are markedly lower than the pros.
So, we have three different suggestions for ways in which the wine-quality scores of the CellarTracker community might be biased. This means that it is about time that someone took a detailed look at the CellarTracker quality scores, to see how much bias is involved, if any.


The quality scores assigned by some (but not all) of the CellarTracker community are officially defined on the CellarTracker web site: 98-100 A+ Extraordinary; 94-97 A Outstanding; 90-93 A– Excellent; 86-89 B+ Very Good; 80-85 B Good; 70-79 C Below average: 0-69 D Avoid. However, the "wisdom of crowds" never follows any particular formal scheme, and therefore we can expect the users to each be doing their own thing.

But what does that "thing" look like when you pool all of the data together, to look at the community as a whole? This is a daunting question to answer, because (at the time of writing) CellarTracker boasts of having "7.1 million tasting notes (community and professional)". Not all of these notes have quality scores attached to them, but that is still a serious piece of Big Data (see The dangers of over-interpreting Big Data). So, I will look at a subset of the data, only.

This subset is from the study by Julian McAuley & Jure Leskovec (2013. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. Proceedings of the 22nd International Conference on the World Wide Web, pp. 897-908). This dataset contains the 2,025,995 review notes entered and made public by the CellarTracker users up until October 2012. I stripped out those notes without associated quality scores; and I then kept those notes where the wine was stated to have been tasted between 2 October 2002 and 14 October 2012. This left me with 1,569,655 public quality scores (and their associated tasting date), which covers the first 10 years of CellarTracker but not the most recent 5 years.

Time patterns in the number of scores

The obvious first view of this dataset it to look at the time-course of the review scores. The first graph shows how many public user quality scores are represented for each month of the study period.

Time-course of CellarTracker wine-quality scores 2002-2012

CellarTracker was designed in 2003 and released in 2004; therefore, all wines before that time have been retrospectively added. So, the graph's time-course represents recorded tasting time, not time of entry into the database, although the two are obviously related. The database reached its maximum number of monthly scored wines at the beginning of 2011, after which it remained steady. The dip at end of the graph is due to the absence of wines that were tasted before the cutoff date but had not yet been added to the database at that time.

The annual cycle of wine tasting is obvious from 2005 onwards — the peak of tasted wines is at the end of each year, with a distinct dip in the middle of the year. This presumably represents wine consumption during the different northern hemisphere seasons — wine drinking is an early winter thing.

The quality scores

The next graph show the frequency (vertically) of the wine-quality scores (horizontally). This should be a nice smooth distribution if the quality scores are impartial; any deviations might be due to any one of the three propositions described above. Although it is somewhat smooth, this distribution shows distinct peaks and dips.

CellarTracker wine-quality scores 2002-2012

For the lower scores, there are distinct peaks at scores of 50, 55, 60, 65, 70, 75, and 80. This is not unexpected, as wine tasters are unlikely to be interested in fine-scale differences in wine quality at this level, or even be able to detect them.

For the scores above 80, 57% of the scores are in the range 88-92. If we are expecting some sort of mathematically average score for wines, then these data make it clear that it is a score of 89-90. That is, the "average" quality of wine consumed by the CellarTracker community is represented by a score of c. 90, with wines assessed as being either better or worse than this.

However, a quality score of 90 shows a very large peak compared to a score of 89, exactly as discussed under Proposition #2 above. I have previously reported this fact for both professional (Biases in wine quality scores) and semi-professional (Are there biases in wine quality scores from semi-professionals?) wine commentators, as well as the CellarTracker community. So, there is apparently nothing unusual about this, although it could be seen as questioning the general utility of wine-quality scores. If subjective factors make people use 90 in preference to 89, then what exactly is the use of a score in the first place?

Moving on, we now need to look for other possible biases in the scores. In order to evaluate whether any of the scores are biased, we need an unbiased comparison. As I explained in my first post about Biases in wine quality scores, this comes from an "expected frequency distribution", also known as a probability distribution. As before, it seems to me that a Weibull distribution is suitable for wine-score data.

This Weibull expected distribution is compared directly with the observed frequency distribution in the next graph. In this graph, the blue bars represent the (possibly biased) scores from CellarTracker, and the maroon bars are the unbiased expectations (from the probability model). Those scores where the heights of the paired bars differ greatly are the ones where bias is being indicated.

Biases in CellarTracker wine-quality scores 2002-2012

This analysis shows that quality scores of 88, 89, and 90 are all over-represented, while scores of 93, 94, and 95 are under-represented, compared to the expectation. This indicates that the CellarTracker users are not giving as many high quality scores as expected, but are tending to give too many scores of 88-90, so that scores are skewed towards values below just 90 rather than just above.

This is exactly what was discussed under Proposition #3 above, where the professionals seem to give somewhat higher scores when the same wines are compared. Furthermore, it is in line with Proposition #1, as well, where the community scores simply converge on a middle ground — a CellarTracker score is more likely to be in the small range 88-90, rather than most other numbers.

Furthermore, quality scores of 81, 83, and 86 are also under-represented, according to the analysis. This creates a clustering of the lower scores at certain values. Presumably, the tasters are not bothering to make fine distinctions among wines below their favorite scores of 88-90.

Time patterns in the quality scores

We can now turn to to look at the time-course of the wine-quality scores. This next graph shows the average quality score for the wines tasted during each month of the study.

Average CellarTracker wine-quality scores 2002-2012

The average score was erratic until mid 2005, which is when the number of wines (with scores) reached 3,000 per month. So, that seems to be the number of wine scores required to reliably assess the community average.

From 2007 to 2009 inclusive, the average quality score was c. 88.5, although there was a clear annual cycle of variation. Notably, after 2009 the average quality score rose to >89. Does this represent the proverbial score inflation? Or perhaps it is simply the community maturing, and starting to give scores more in line with those of the professionals (which are higher)?

To try to assess this, the final graph shows the time-course of the proportion of scores of 95 or above. Many of the professional reviewers have been accused (quite rightly) of over-using these very high scores, compared to the situation 20 years ago, and so we can treat this as an indication of score inflation.

High CellarTracker wine-quality scores 2002-2012

This graph shows no post-2009 increase in the proportion of very high scores. So, the increase in the average CellarTracker quality score does not represent an increased usage of very high scores, but is instead a general tendency to assign higher scores than before. Or perhaps it represents the community drinking better wines?

Finally, it is worth pointing out the annual cycle in the average scores and in the proportion of very high scores. The annual peak in quality scores is in December. That is, wines get higher scores in December than at most other times of the year. I hope that this represents people buying better wines between All Hallows Day and New Year, rather than drinking too much wine and losing their sense of values!

Conclusions

All three predicted biases in the CellarTracker wine-quality scores are there! The community scores are generally lower than expected, they cluster in a smaller range around the average than expected, and a score of 90 is over-used compared to 89. There are also very distinct seasonal patterns, not only in the number of wines tasted but also in the scores assigned to them.

These conclusions are not necessarily unexpected. For example, Julian McAuley & Jure Leskovec (cited above) noted: "experienced users rate top products more generously than beginners, and bottom products more harshly." Furthermore, Omer Gokcekus, Miles Hewstone & Huseyin Cakal (2014. In vino veritas? Social influence on ‘private’ wine evaluations at a wine social networking site. American Association of Wine Economists Working Paper No. 153) have noted that community scores are normative (= "to conform with the positive expectations of another") rather than informational ("to accept information obtained from another as evidence about reality").

In the modern world, it may well be true that "the crowd is the new critic", but it turns out that the crowd as a group is actually no more impartial than is any single person.

Monday, November 6, 2017

The dangers of over-interpreting Big Data (in the wine business)

In order to understand complex sets of information, we usually summarize them down into something much simpler. We extract what appear to be the most important bits of information, and try to interpret that summary. Only the simplest pieces of information can be left alone, and grasped on their own. This creates an inherent problem — data summaries also leave information out, and that information may actually be very important. Sadly, we may never find this out, because we left the information out of the summary.

Clearly, the biggest danger with what are known in the modern world as Big Data is that, in order to understand it, we first turn it into Small Data by ignoring most of it. That is, the bigger the dataset then the more extreme is the summary process, because of our desire to reduce the complexity. Data summaries tend to be all the same size, no matter how big the original dataset was. Unfortunately, most of the discussion about Big Data has involved only the technical aspects, along with the optimistic prospects for using the data, without much consideration for the obvious limitations of data summarizing.


One of the most common ways that we have historically used to summarize data is to organize the data into a few groups. We then focus on the groups, not on the original data. In this post, I will discuss this in the context of understanding wine-buying customers.

Grouping

By summarizing data, we are looking for some sort of mathematical structure in the dataset. That is, we are looking for simple patterns, which might then mean something to us, preferably in some practical sense.

Putting the data into groups is one really obvious way to do this; and we have clearly been doing it for millenia. For example, we might group plants as those that are good to eat, those that are poisonous, those that are good as building material, etc.

The biggest limitation of this approach is that we can end up treating the groups as real, rather than a mathematical summary, and thus ignore the complexity of the original data. For example, groups can overlap — a plant can be both poisonous and good for making house walls, for example; and focusing on one group or the other can make us forget this.

Groups can also be fuzzy, which means that the boundaries between the groups are not always clear. Dog breeds are a classic example — pure-bred dogs clearly fit into very different groups, and we cannot mistake one breed for another. But dogs of mixed parentage do not fit neatly into any one group, although we often try to force them into one by emphasizing that they are mostly from one breed or another. That is, the breeds are treated as real groups, even though they overlap, and thus are not always distinct.

Examples of grouping

Let's consider two examples, one where the groups might make sense and one where they are more problematic.

When considering customers, one obvious grouping of people is gender, male versus female. In science, this is simply a genetic grouping (based on which genes you have), but elsewhere it is usually treated as also being a behavioral grouping. Businesses are therefore interested in what any gender-associated differences in behavior might mean for them.

Consider this example of using Twitter hashtags to quantify gender differences: The hard data behind how men and women drink. The data come from "half a million tweets collected over the course of a year (June 2014 - July 2015), with the gender detected from the first name of the tweeter." The first graph shows the frequency of 104 drink-related hashtags, arranged according to how often they originated from male versus female tweeters.


Note that no hashtags are used exclusively by either males or females — indeed, only two exceed 80% gender bias (homebrew, malt). Equally, no hashtags are used equally by males and females — the closest are: cachaca, patron, caipirinha. We thus might be tempted to recognize two groups, of 40 "female" words and 64 "male" words.

However, we have to be careful about simply confirming our starting point. We pre-defined two groups that represent observed differences (in genetics), and then we have demonstrated that there are other differences (in behavior). The data are essentially continuous, with some words having less than 47% vs. 53% gender distinction. In this case, gender still forms indistinct groups.

Moving on, this situation becomes even more complex when we start to consider situations with many possible groups, based simultaneously on lots of different characteristics. In an earlier post, I discussed the mathematical technique of using ordinations to summarize this type of data (Summarizing multi-dimensional wine data as graphs, Part 1: ordinations).

This next graph shows an example of the resulting data summary, called an ordination diagram. If each point represents a person, then the spatial proximity of the points would represent how similar they are. So, points close together are similar based on the measured characteristics, while points progressively further apart are progressively more different.


This ordination diagram does not contain any obvious groups of people — they are spread pretty much at random. However, that does not mean that we cannot put the people into groups! Consider this next version of the same diagram, in which the points are now colored. The five different colors represent five groups, one in each corner of the diagram and one in the center.


Clearly, these groups do not overlap. More to the point, the centers of each group are quite distinct. Thus, the groups do have meaning as a summary of the data — combining the descriptions of each group of people would create an easily interpreted summary of the whole dataset.

However, these are fuzzy groups — the boundaries are not distinct, and the groups of people are not discrete. Thus, I am also losing a lot of information, as I must in a summary of complex data; and I need to care about that lost information as well. I cannot treat the groups as being real — they are a convenience only. As a technical aside, it is worth noting that the groups are not an illusion — they are an abstraction.

The point of this blog post is to make it clear that this problem must especially be addressed when dealing with Big Data, because that is where techniques like ordination come into play.

Big Data

Wikipedia has this to say about Big Data:
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them ... Lately, the term "big data" tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data ... Analysis of data sets can find new correlations to spot business trends, prevent diseases, combat crime, and so on.
In business, the use of information from social media is the most obvious source of Big Data. People are often perceived as being much more honest in their online social interactions than they are in formal surveys; and so this relatively recent source of information could potentially be much more useful to modern business practices.


As this infographic indicates, the social media can generate some really big datasets. Making sense of these data involves some pretty serious summarizing of the data. Therefore, the principles that I have discussed above become particularly important — we have to be very careful about how we interpret those summaries, especially if we have summarized the data into groups.

An example from the world of wine

So, let's conclude with a real example from the world of wine buying: the 2016 Digital Wine Report: the Five Tribes of Online Wine Buyers, prepared by Paul Mabray, Ryan Flinn, James Jory and Louis Calli. (Thanks to Bob Henry for getting me a copy of the "Academic edition" of this report.)

This study was produced by a group originally called VinTank, and who at the time were a subsidiary of W2O (who subsequently closed them down!). The objective of the report was to combine data about wine drinkers, based on the social media, with data about wine buyers, based on online purchases. This is a perfect example of using Big Data to help businesses understand their customers.


The social data were for 12,500 individuals, based on 183,000 Twitter posts assessed by the TMRW Engine software. The buying data were for 53,000 online wine purchases, recorded by Vin65. So, the report attempts to summarize the wine behavior of people who use both social media to discuss wine and online shopping to purchase wine, in the USA. Clearly, this does not attempt to represent all US wine drinkers and buyers — the people summarized "buy directly from wineries, they are digitally savvy and use both e-commerce and social media, and they like wine more than the casual consumer."

The crux of the report's methodology is this:
Using a methodology built upon the foundations of demographic and psychographic market research techniques, we segmented [= grouped] online wine customers according to their psychographic profiles: including hobbies, preferences, activities, and political outlooks ... We were [then] able to apply this segmentation to purchasing behavior and demographic profile at the individual customer level. As a result we've identified 5 common "tribes" of online wine buyers.
To personalize these five tribes, we've given each one a name, a theme and a personality description.
You can immediately see what I am warning you about here — these five tribes are not real, even though they have names and distinct personalities. The psychographic and demographic characteristics of the people vary continuously, and grouping them is merely a convenient mechanism for data summary.

In order to get a sense of what these groups look like, refer to the colored version of the ordination diagram shown above, where the group centers are different but the boundaries are fuzzy. I have carefully analyzed the data presented in the report, and I can assure you that the five "tribes" really do have different behavioral "centers"; but I would hate to have to assign anyone to one group or another. At a personal level, I can't see myself as being in any of these five tribes.

Part of the problem here is that categorizing people in this manner simply perpetuates cultural stereotypes. In this case we have: Anna, the sophistocrat; Graham, the info geek; Sofia (or Sophia), the digital native; Don, the southern conservative; and Kevin, the trophy hunter. If none of these people sounds like you, then you are probably right.

Conclusion

Big Data are useful, there is no doubt about it. However, big data can potentially have big problems, as well, and we need to guard against the consequences of this. One of the most common ways to summarize Big Data is to assign the study objects to groups, but these groups are not real — they are a conceptual convenience, nothing more. Hopefully, grouping their customers will help businesses provide services to those customers, but that does not mean that the businesses should ignore those people who do not fit neatly into any of their groups.

PS.

VinTank has reappeared as AveroBuzz, which is intended for the hospitality industry as a whole, not just the wine part.

Monday, October 30, 2017

The anti-social media arrives on the wine-blogging scene!

I have previously been subject to professional plagiarism, as I have described on my other blog:
          Unacknowledged re-use of intellectual property

However, I was not expecting to have my wine blog plagiarized. Sadly, this has now happened, too. There is a Blogspot blog, called "artsforhealthmmu" that, as of today, consists entirely of copies of 19 of my posts from the Wine Gourd, without an iota of reference to me or my blog. Needless to say, I am not going to put a link to that blog here, but you will find it if you search for the name on the web.

From the University Libraries of the University of Tennessee Knoxville.

It is not immediately obvious what the game is, here. That other blog is well designed, and there is currently no advertising or rogue links that I can find, although that may change. Interestingly, there is a genuine site called the "Arts and health blog" that has almost exactly the same web address as the plagiarizing blog. (The latter has an extra "s" in the address.) So, there are other people who seem to be as much a victim of this situation as I am.

I have submitted a request to Blogger to have the offending blog removed, but that will take some time to implement. I am hoping that this will result in a better resolution than happened the last time I was plagiarized (mentioned above), where the offending (well-known) author had no real excuse or apology, and the (well-known) book publisher metaphorically just shrugged his shoulders. A pox on all of them!

Update 4 Nov.:

Notice from Google:
"In accordance with the Digital Millennium Copyright Act, we have completed processing your infringement notice. We are in the process of disabling access to the content in question"