Monday, December 18, 2017

Sample sizes, and the supposed wine differences between women and men

We have seen a number of web discussions this year about the ostensible differences between males and females when it comes to sensory perception, particularly the tasting of wine. For example:
More recently, a paper appeared in the Journal of Wine Economics that has been taken to shed some light on this issue (for example, see Men and women don't actually have different taste in wine ; Do men and women taste differently?). It apparently opposes the notion that such differences are of much practical importance:
The author compiled data from 23 different wine tastings (conducted by other people) in which the wine-quality scores could be subdivided into those from male and female tasters. He then proceeded to apply various statistical tests to the data, to assess whether there were differences between women and men in the wine-tasting results.

Leaving aside the physiology of wine tasting for the moment, there is one thing that immediately struck me when I read the paper — the incredibly small sample sizes used in the data analyses. This is a topic that I have commented on before, when I pointed out that there are issues with experiments that can far outweigh the effects of sample size, notably biased sampling (Why do people get hung up about sample size?). However, in this current case, the samples sizes seem quite unbelievable — there were never more than 20 people per gender at each of 23 different tastings, and often much fewer.

Now, it might seem intuitively obvious that such small sizes are unlikely to lead us anywhere useful, but we can do better than merely express an opinion about this. It seems worthwhile for me to try to quantitatively assess the situation.

Samples sizes and experiments

First, however, we need to be clear about the consequences of samples sizes for experiments. Indeed, the paper's author himself directs attention to the issue, by mentioning "Type II error" in his title. This esoteric expression is the statistical term for what scientists call "false negatives", the failure to find something when it really is there to be found. Alternatively, "Type I errors" are "false positives", the finding of something that is either illusory or trivially unimportant.

Using an analogy, if I am looking for a needle in a haystack then a false negative means that I fail to find it when there really is a needle in there — I have made a Type II error. If I find something in the haystack that I interpret as being a needle when it is not, then that is a false positive — I have made a Type I error.

Small sample sizes are prone to false negatives, whereas gratuitously large sample sizes are prone to false positives. Neither situation can be considered to be a good thing for an experiment.

Needless to say, statisticians have had a good look at these issues, and they have developed the notions of both statistical Power and statistical Sensitivity. Mathematically, Power is the complement of a Type II error, and thus expresses the theoretical probability that a statistical test will correctly reject the null hypothesis being tested — that is, Power tells us how probable it is that the statistical analysis will find something if there really is something to find. Sensitivity is a related concept, referring to the empirical ability of an experiment to to correctly reject the null hypothesis. Both concepts can be expressed mathematically.

A look at the paper

Returning to the paper in question, two experimental null hypotheses were tested, using different subsets of the 23 tastings:
    H1: Women’s and men’s scores have different means and standard deviations
    H2: Women and men have differently shaped distributions of scores
Various statistical tests were applied to test each hypothesis. I won't go into the technical details of what this all means, but will instead jump straight to the results presented in Tables 1, 2 and 3 of the paper.

Sufficient details are presented for me to perform a Power analysis of H1 and a Sensitivity analysis of both H1 and H2. (The test statistics are not presented for H2, only the Type I errors, and so the Power analysis cannot be performed.) My calculations were performed using the G*Power v. program; and the outcomes of my analyses are shown in the table below. Each row of the table represents one wine tasting, as listed in the first column. The second column shows the sample sizes, taken from the paper, while the final two columns are the results of my new calculations.

Power and Sensitivity analyses of wine tasting sample sizes

Formally, my Power analyses assess the post hoc achieved power of the statistical t-test at p=0.05 Type I error. More practically for you, all you need to know is that a Power of 80% is conventionally considered to be the minimum acceptable level for an experiment, and this represents an allowable probability of 20% for false negatives. For preference, a good experiment would require a Power of 95%, which is a 5% probability for false negatives.

As you can see in the table, the Power of Bodington's analyses are nowhere near these levels, as the Power of his analyses never exceeds 11%. Indeed, his probabilities of false negatives are in the range 90-95%, meaning that he is coming close to certainty that he will accept the null hypothesis for each statistical test, and thus conclude that he found no difference between men and women, irrespective of whether there actually are such differences or not.

Formally, my Sensitivity analyses quantify what is called the required Effect size. This is a quantitative measure of the "strength" of the phenomenon that the experiment is seeking to find. Large Effect sizes mean that the phenomenon will be easy to detect statistically, while small Effect sizes will be hard to find. Using my earlier analogy, if I am looking for a sewing needle in my haystack, then that is a small Effect size, whereas looking for a knitting needle would be a large Effect size.

Small Effect sizes require large sample sizes, while large Effect sizes can be detected even with small sample sizes. I specified p=0.05 & power=0.80 for my calculations, which relate to the acceptable levels of false positives and negatives, respectively.

Using the standard rule of thumb (Shlomo S. Sawilowsky. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods 8: 597-599), Effect sizes are grouped from very small (0.1) to huge (2.0). Effect sizes larger than 1 are encountered only rarely in the real world, and they do not need statistical analysis because the phenomenon being studied will be obvious, even to the naked eye.

For Bodington's statistical analyses, the various Effect sizes are: 1 medium, 8 large, 15 very large, and 7 huge; none of them are small or very small. So, his tests were capable of detecting only the most blatant differences between women and men. Such differences would actually not need an experiment, because we would be able to see them for ourselves, anyway.

Required sample sizes

To put this another way, there would have to be almost no variation between people within each of the two genders, in order for Bodington's statistical tests to have detected anything. When there is only small variation, then small sample sizes can be effective. However, if there are large differences between people, irrespective of gender, then we would require large sample sizes. My Power and Sensitivity analyses show that we would, indeed, require very large sample sizes in this case.

To examine this, we can perform what is called a prospective (or a priori) Power analysis. My calculations show that sample sizes of 300-450 people would be required per gender, depending on the particular assumptions made during the calculations. That is, if you want to do this experiment for yourselves, you will need to find at least 300 men and 300 women, give each of them the same sets of wines to taste, and then record their scores. If your statistical analyses of the scores still do not detect any gender differences, then you can be fairly sure that any such differences are indeed small.


Bodington's conclusion that the experiments found no detectable difference between men and women is therefore unsurprising — the analyses have so little statistical power that any such differences, if they exist, were unlikely to be found. The author suggests that "these differences are small compared to non-gender-related idiosyncratic differences between individuals and random expressions of preference." If so, then any study of the idea that the genders prefer different wines will require much larger sample sizes than the ones used in the published paper.

On this particular wine topic, the jury is still out. Based on differences in physiology, there is good reason to expect differences in wine tasting between females and males. However, measuring the size and nature of this difference, if it really does exist, remains to be done.


  1. Given your publicized investigation [*] of the vote recording irregularities at the 1976 "Judgment of Paris," can Bodington's cited gender statistics [9 men and 2 women] for that tasting be trusted?

    [*"Why We No Longer Have The Data From The Judgment of Paris: A Guest Post by David Morrison"
    The Academic Wino blog - February 9, 2017


    -- and --

    "The Paris Tasting Results—And The Meaning Thereof: Part 2"
    Connoisseurs' Guide to California Wine website - January 19, 2017


    1. The answer is "yes" and "no". The data are consistent for one lot of wines but not for the other. In the second case, the irregularities involve two of the males — there were only two females, anyway, so the data will never provide much help.

  2. It's been decades since I last thumbed through the pages of "Vintners Club: Fourteen Years of Wine Tastings 1973-1987" tome.

    Do they identify the judges by gender?

    1. The Vintners Club book does not identify the individual tasters in any way. Like most wine tastings, the tasters got to be anonymous. Publishing the results of the Judgment of Paris can therefore be seen as a brazen violation of accepted protocol in the wine world. Given this situation, it is remarkable that Bodington found any data at all. The only realistic way to study this question would be to perform an actual experiment.