The Wine Gourd: December 2017

Monday, December 25, 2017

Medical practitioners and malt whisky

Greetings of the season.

For Christmas this year I thought that I might follow up on two previous posts that have proved to be quite popular on my other blog, which are about the detectable qualities of Scotch whiskies:

Medical practitioners have been known to partake of these tipples; and along the way some of them have pondered the question as to whether it it possible for people to reliably distinguish among the various whiskies, in even the most basic way. For example, two medical groups have done some experiments, and published them in the Christmas issue of the British Medical Journal:

Stephen J Chadwick, Hugh A Dudley (1983) Can malt whisky be discriminated from blended whisky? The proof. A modification of Ronald Fisher's hypothetical tea tasting experiment. British Medical Journal 287:1912-1913.
EJ Moran Campbell, Diana ME Campbell, Robin S Roberts (1994) Ability to distinguish whisky (uisge beatha) from brandy (cognac). British Medical Journal 309:1686-1688.

For those of you who think you know your whiskies, it turns out to be a lot harder to discriminate them than you think.

Here is the abstract of the first paper:

A modified version of Fisher's tea tasting experiment was performed to test the confident assertions of some members of an academic surgical unit that they could easily distinguish malt from blended whisky. Eight male volunteers from the unit, divided into regular and inexperienced whisky drinkers, were blindfolded and given a glass of each of six whiskies. The whiskies included three malts and three blends, and each subject tasted each whisky six times. They were asked whether the whisky was malt or blended, whether they could identify the distillery, and whether they liked it (ranked on a nine-point scale). Statistical analysis of the data suggested that within the [surgical] unit malt whisky could not be distinguished from blended whisky, and that experience did not alter powers of discrimination. These results suggest that, although "uisgebeatha" has unique properties, the inexpert drinker should choose his whisky to suit his taste and pocket and not his self image.

Here is the abstract of the second paper:

Objective: To assess the ability to distinguish between first rate malt whisky and brandy and between different brands of each. Design: Crossover with two sessions of 12 blindfold tastings of two whiskies and two brandies before and after supper, repeated not more than seven days later. Participants: four volunteers aged 50-68 years, all moderate drinkers of alcohol and members of a wine club. Results: Only one participant produced irrefutable statistical evidence of being able to distinguish between whisky and brandy, correctly identifying 50/51 (98%) samples. The participant who was best able to distinguish between whisky and brandy was also best able to identify correctly the brand of whisky (100%). Conclusion: The results show that some participants could distinguish neither between malt whisky and brandy nor between different brands of whisky and brandy. However, the success of one participant [a Scotsman] shows that "it can be done", and that his whisky specific ability is acquired not innate.

These experiments received comments from some of their medical colleagues, for those of you who might like to read them:

James Howie (1983) Good motivation but indifferent methods. British Medical Journal 287:1913-1914.
Douglas G Altman (1983) How blind were the volunteers? British Medical Journal 287:1914-1915.
Stephen J Chadwick, Hugh A Dudley (1983) In defense of the whisky drinker on the Clapham omnibus. British Medical Journal 287:1915.
Ken MacRae (1994) A spirited attempt. British Medical Journal 309:1688.

Of these, perhaps the most pertinent one is from both Howie and MacRae, who point out that some of the drinks chosen were rather similar. Obviously, this point alone obviates the need for an experiment at all — if it is known beforehand that whiskies are similar to each other (as shown in the two blog posts linked above), then why do we need an experiment to show it? Except for the fun of doing the tasting, of course!

Finally, Altman notes that "last Christmas I helped to perform a small experiment that demonstrated that white wine and red wine cannot always be distinguished (unpublishable results)." Christmas can have that effect on you.

Monday, December 18, 2017

Sample sizes, and the supposed wine differences between women and men

We have seen a number of web discussions this year about the ostensible differences between males and females when it comes to sensory perception, particularly the tasting of wine. For example:

More recently, a paper appeared in the Journal of Wine Economics that has been taken to shed some light on this issue (for example, see Men and women don't actually have different taste in wine ; Do men and women taste differently?). It apparently opposes the notion that such differences are of much practical importance:

Jeffrey C. Bodington (2017) Wine, women, men, and Type II error.
Journal of Wine Economics 12: 161-172.

The author compiled data from 23 different wine tastings (conducted by other people) in which the wine-quality scores could be subdivided into those from male and female tasters. He then proceeded to apply various statistical tests to the data, to assess whether there were differences between women and men in the wine-tasting results.

Leaving aside the physiology of wine tasting for the moment, there is one thing that immediately struck me when I read the paper — the incredibly small sample sizes used in the data analyses. This is a topic that I have commented on before, when I pointed out that there are issues with experiments that can far outweigh the effects of sample size, notably biased sampling (Why do people get hung up about sample size?). However, in this current case, the samples sizes seem quite unbelievable — there were never more than 20 people per gender at each of 23 different tastings, and often much fewer.

Now, it might seem intuitively obvious that such small sizes are unlikely to lead us anywhere useful, but we can do better than merely express an opinion about this. It seems worthwhile for me to try to quantitatively assess the situation.

Samples sizes and experiments

First, however, we need to be clear about the consequences of samples sizes for experiments. Indeed, the paper's author himself directs attention to the issue, by mentioning "Type II error" in his title. This esoteric expression is the statistical term for what scientists call "false negatives", the failure to find something when it really is there to be found. Alternatively, "Type I errors" are "false positives", the finding of something that is either illusory or trivially unimportant.

Using an analogy, if I am looking for a needle in a haystack then a false negative means that I fail to find it when there really is a needle in there — I have made a Type II error. If I find something in the haystack that I interpret as being a needle when it is not, then that is a false positive — I have made a Type I error.

Small sample sizes are prone to false negatives, whereas gratuitously large sample sizes are prone to false positives. Neither situation can be considered to be a good thing for an experiment.

Needless to say, statisticians have had a good look at these issues, and they have developed the notions of both statistical Power and statistical Sensitivity. Mathematically, Power is the complement of a Type II error, and thus expresses the theoretical probability that a statistical test will correctly reject the null hypothesis being tested — that is, Power tells us how probable it is that the statistical analysis will find something if there really is something to find. Sensitivity is a related concept, referring to the empirical ability of an experiment to to correctly reject the null hypothesis. Both concepts can be expressed mathematically.

A look at the paper

Returning to the paper in question, two experimental null hypotheses were tested, using different subsets of the 23 tastings:
H1: Women’s and men’s scores have different means and standard deviations
H2: Women and men have differently shaped distributions of scores
Various statistical tests were applied to test each hypothesis. I won't go into the technical details of what this all means, but will instead jump straight to the results presented in Tables 1, 2 and 3 of the paper.

Sufficient details are presented for me to perform a Power analysis of H1 and a Sensitivity analysis of both H1 and H2. (The test statistics are not presented for H2, only the Type I errors, and so the Power analysis cannot be performed.) My calculations were performed using the G*Power v.3.1.9.2 program; and the outcomes of my analyses are shown in the table below. Each row of the table represents one wine tasting, as listed in the first column. The second column shows the sample sizes, taken from the paper, while the final two columns are the results of my new calculations.

Power and Sensitivity analyses of wine tasting sample sizes

Formally, my Power analyses assess the post hoc achieved power of the statistical t-test at p=0.05 Type I error. More practically for you, all you need to know is that a Power of 80% is conventionally considered to be the minimum acceptable level for an experiment, and this represents an allowable probability of 20% for false negatives. For preference, a good experiment would require a Power of 95%, which is a 5% probability for false negatives.

As you can see in the table, the Power of Bodington's analyses are nowhere near these levels, as the Power of his analyses never exceeds 11%. Indeed, his probabilities of false negatives are in the range 90-95%, meaning that he is coming close to certainty that he will accept the null hypothesis for each statistical test, and thus conclude that he found no difference between men and women, irrespective of whether there actually are such differences or not.

Formally, my Sensitivity analyses quantify what is called the required Effect size. This is a quantitative measure of the "strength" of the phenomenon that the experiment is seeking to find. Large Effect sizes mean that the phenomenon will be easy to detect statistically, while small Effect sizes will be hard to find. Using my earlier analogy, if I am looking for a sewing needle in my haystack, then that is a small Effect size, whereas looking for a knitting needle would be a large Effect size.

Small Effect sizes require large sample sizes, while large Effect sizes can be detected even with small sample sizes. I specified p=0.05 & power=0.80 for my calculations, which relate to the acceptable levels of false positives and negatives, respectively.

Using the standard rule of thumb (Shlomo S. Sawilowsky. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods 8: 597-599), Effect sizes are grouped from very small (0.1) to huge (2.0). Effect sizes larger than 1 are encountered only rarely in the real world, and they do not need statistical analysis because the phenomenon being studied will be obvious, even to the naked eye.

For Bodington's statistical analyses, the various Effect sizes are: 1 medium, 8 large, 15 very large, and 7 huge; none of them are small or very small. So, his tests were capable of detecting only the most blatant differences between women and men. Such differences would actually not need an experiment, because we would be able to see them for ourselves, anyway.

Required sample sizes

To put this another way, there would have to be almost no variation between people within each of the two genders, in order for Bodington's statistical tests to have detected anything. When there is only small variation, then small sample sizes can be effective. However, if there are large differences between people, irrespective of gender, then we would require large sample sizes. My Power and Sensitivity analyses show that we would, indeed, require very large sample sizes in this case.

To examine this, we can perform what is called a prospective (or a priori) Power analysis. My calculations show that sample sizes of 300-450 people would be required per gender, depending on the particular assumptions made during the calculations. That is, if you want to do this experiment for yourselves, you will need to find at least 300 men and 300 women, give each of them the same sets of wines to taste, and then record their scores. If your statistical analyses of the scores still do not detect any gender differences, then you can be fairly sure that any such differences are indeed small.

Conclusion

Bodington's conclusion that the experiments found no detectable difference between men and women is therefore unsurprising — the analyses have so little statistical power that any such differences, if they exist, were unlikely to be found. The author suggests that "these differences are small compared to non-gender-related idiosyncratic differences between individuals and random expressions of preference." If so, then any study of the idea that the genders prefer different wines will require much larger sample sizes than the ones used in the published paper.

On this particular wine topic, the jury is still out. Based on differences in physiology, there is good reason to expect differences in wine tasting between females and males. However, measuring the size and nature of this difference, if it really does exist, remains to be done.

Monday, December 11, 2017

Do community wine-quality scores converge to the middle ground?

The short answer appears to be: not very often. This is surprising, given what is reported for other communities. This may indicate something unique about the wine community.

A few weeks ago, I discussed community wine-quality scores, such as those in the Cellar Tracker database (Cellar Tracker wine scores are not impartial). One of the subjects I commented on was the suggestion that the "wisdom of crowds" can mean that members of the crowd allow their judgement to be skewed by their peers. In the case of wine-quality scores, this would mean that scores from large groups of tasters may converge towards the middle ground, as the number of scores increases.

In the formal literature, this topic has been examined by, for example, Omer Gokcekus, Miles Hewstone & Huseyin Cakal (2014. In vino veritas? Social influence on ‘private’ wine evaluations at a wine social networking site. American Association of Wine Economists Working Paper No. 153). They looked at the trend in Cellar Tracker scores for wines through time, from when the first score is added for each wine. They wanted to see whether the variation in scores for a wine decreases as more scores are added for that wine, which would support the thesis about crowd behavior. They concluded that there is some evidence of this.

The important practical point here is that Cellar Tracker displays the average score for each wine when a user tries to add a new score of their own, and it is hard to ignore this information. So, it would be rather easy for a user to be aware of the difference between their own proposed score and the current "wisdom of the crowds". This would presumably have little or no effect when only a few scores have been added for each wine, but it might potentially have an effect as more score are added, because the crowd opinion then becomes so much clearer.

It has occurred to me that some data that I used in another blog post (Are there biases in community wine-quality scores?) might also be used to examine the possibility that Cellar Tracker scores are biased in this way. In my case, I will look at individual wines, rather than pooling the data across all wines, as was done in the research study described above.

The data at hand are the publicly available scores from Cellar Tracker for eight wines (for my data, only 55-75% of the scores were available as community scores, with the rest not being shared by the users). These eight wines included red wines from several different regions, a sweet white, a still white, a sparkling wine, and a fortified wine. In each case I searched the database for a wine with at least 300 community scores; but I did not succeed for the still white wine (which had only 189 scores).

The results for the eight wines are shown in the graphs at the end of the post. Each point represents one quality score for the wine (some users enter multiple scores through time). For each wine, each score is shown (vertically) as the difference from the mean score for the wine — positive scores indicate that score was greater than the average score, while negative scores indicate that it was less than the average. The time is shown (horizontally) as the number of days after the first tasting recorded for that wine.

The expectation is that, if the wine-quality scores do converge towards the middle ground, then the variability of the scores should decrease through time. That is, the points in the graphs will be more spread out vertically during the earliest times, compared to the later times.

The results seem to be quite consistent, with one exception. That exception is the first one, where the scores are, indeed, more variable through the first third of the time period. In all of the other cases, the scores are most variable during the middle period, which is when most of the scores get added to the database, or sometimes also in the late period.

So, for these wines at least, I find little evidence that Cellar Tracker scores do converge towards the middle ground. This seems to disagree with the study of Gokcekus, Hewstone & Cakal (mentioned above), who concluded that community scores are normative (= "to conform with the positive expectations of another") rather than informational ("to accept information obtained from another as evidence about reality").

However, a study by Julian McAuley & Jure Leskovec (2013. From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. Proceedings of the 22nd International Conference on the World Wide Web, pp. 897-908), found that user behavior in the Cellar Tracker database was quite different from the other four community databases that they studied (Beer Advocate; Rate Beer; Amazon Fine Foods; Amazon Movies).

So, maybe wine drinkers really are different from beer drinkers and movie goers, when it comes to community assessment of their products? The wisdom of the wine crowd may be unique! In particular, you will note that wine drinkers are not afraid to give rather low scores for wines — the scores in the graphs go much further below the average than they do above it. Note that the dataset excludes wines that are considered to be flawed, which are usually not given scores at all (although very rarely they receive scores in the 50-60 range, which I excluded, as representing faulty wines).

It seems to me that community wine scores are actually informational, rather than normative, expressing the opinion of the drinker rather than that of the crowd. This also fits in with the easily observed fact that the community scores are consistently lower than are those of the professional wine critics (see my previous post Cellar Tracker wine scores are not impartial) — the wine community is not easily swayed by expert opinion. However, the tendency of all wine reviewers, professional, semi-professional and amateur, to favor a score of 90 over a score of 89 certainly represents an unfortunate bias.

Cellar Tracker wine-quality scores through time for Alvear Pedro Ximenez

Cellar Tracker wine-quality scores through time for Barbaresco 2006

Cellar Tracker wine-quality scores through time for Caymus Cabernet 2012

Cellar Tracker wine-quality scores through time for Clicquot NV Brut

Cellar Tracker wine-quality scores through time for Edwards Sauvignon Blanc 2012

Cellar Tracker wine-quality scores through time for Pontet-Canet 2003

Cellar Tracker wine-quality scores through time for Rieussec 2001

Cellar Tracker wine-quality scores through time for Tondonia 2001

Monday, December 4, 2017

California cabernets do not get the same quality scores at different tastings

We are commonly told that most wines are drunk within a very few days of purchase. On the other hand, it is a commonly held belief among connoisseurs that many wines are likely to improve with a bit of bottle age, especially red wines. Counter-balancing the latter idea, it is the easy to demonstrate that the perception of wine quality depends on the people concerned and the circumstances under which the wines are tasted.

Therefore, I thought that it might be interesting to look at some repeated tastings of the same wines under circumstances where they are formally evaluated under roughly the same conditions. Do these wines get the same quality scores at the different tastings?

To look at this, I will use some of the data from the tastings of the Vintners Club, based in San Francisco. The results of the early club tastings are reported in the book Vintners Club: Fourteen Years of Wine Tastings 1973-1987 (edited by Mary-Ellen McNeil-Draper. 1988); and I have used this valuable resource in several previous blog posts (eg. Should we assess wines by quality points or rank order of preference?).

For each wine tasted, the book provides the average of the UCDavis points (out of 20) assigned by the group of tasters present at that meeting. The Vintners Club has "always kept to the Davis point system" for its tastings and, therefore, averaging these scores is mathematically valid, as is comparing them across tastings. Many of the wines were tasted at more than one meeting, although each meeting had its own theme. For out purposes here, the largest dataset is provided by the tastings involving California cabernet wines, which I will therefore use for my exploration.

In the book, there were 170 different California cabernets that the Club tasted more than once, sometimes up to five years apart. Most of these were tasted only twice, but some were tasted up to four times over four years. Of these 170 wines, only eight wines produced the same average score on their first two tasting occasions. Of the rest, 63 wines (37%) produced a lower score on the second occasion, and 99 (58%) produced a higher average score. Perhaps we might tentatively conclude that California cabernet wines do tend to increase in quality with a bit of time in bottle?

However, it is instructive to look at those 137 wines that were re-tasted within one year of their first tasting. This will give us some idea of the repeatability of wine quality scores, as we should not really be expecting California cabernet wines to change too much within their first year in the bottle. Any differences in scores are therefore likely to reflect the tasting group rather than the wine itself (unless there is much bottle variation, or the wines mature very rapidly).

The data are shown in the first graph, with the difference between the two tasting dates shown horizontally, and the difference between the two average UCDavis scores shown vertically (ie. second tasting score minus first tasting score). Each point represents one wine, tasted twice within a year.

Difference in quality scores for wines re-tasted within 1 year

Clearly, there is considerable variability in the quality scores between tastings (the scores are spread out vertically over several quality points). Moreover, there is not much pattern to this variability — even after only a few days the scores can differ by more than 1 point; and even after a year they can still be identical. Most of the wines (70%) produced scores within +/– 1 point at the two tastings.

Notably, however, there were more decreases in score than there were increases between the two tastings. Only eight wines produced an increase in score of more than 1 point, while 33 wines (24%) produced a decrease in score of more than 1 point, and four of these actually decreased by more than 2 points. (NB: some of the points on the graph sit on top of each other.) I was not expecting to see such a strong pattern.

A so-called Difference/Average plot can sometimes be informative, and this is shown in the next graph. This shows the same data as above, but this time the horizontal axis represents the average of the two quality scores for each wine (rather than representing time).

Quality scores for wines re-tasted within 1 year

This graph does not reveal much in the way of tell-tale patterns, which is unusual. However, we might anticipate that high-scoring wines will get more consistent quality scores, and this appears to be so for those few wines scoring >16 points. Furthermore, wines scoring <14 points do not do well at their second tasting.

Finally, we can look at those six California cabernets that were each tasted on four separate occasions. The final graph shows the time-course( horizontally) of their score (vertically), with each wine represented by a single line (as labeled).

Quality scores of wines tasted four times

Note that only one wine (from Stag's Leap) consistently increased in assessed quality over the years, while two other wines (from Beaulieu, and Robert Mondavi) consistently decreased. The remaining three wines had more erratic patterns. These differences may simply reflect random variation, and so we shouldn't read too much into this small sample size. Nevertheless, we do not see the hoped-for general increase in assessed quality of California cabernets over their first few years in bottle.

So, do California cabernets get the same quality scores at different tastings? In general, yes. However, a large number of them end up with notably lower scores if there is a second tasting within a year.