- Are women better at tasting wine than men? Science has the answer
- War of the noses
- The wine glass ceiling
- Jeffrey C. Bodington (2017) Wine, women, men, and Type II error.
Journal of Wine Economics 12: 161-172.
Leaving aside the physiology of wine tasting for the moment, there is one thing that immediately struck me when I read the paper — the incredibly small sample sizes used in the data analyses. This is a topic that I have commented on before, when I pointed out that there are issues with experiments that can far outweigh the effects of sample size, notably biased sampling (Why do people get hung up about sample size?). However, in this current case, the samples sizes seem quite unbelievable — there were never more than 20 people per gender at each of 23 different tastings, and often much fewer.
Now, it might seem intuitively obvious that such small sizes are unlikely to lead us anywhere useful, but we can do better than merely express an opinion about this. It seems worthwhile for me to try to quantitatively assess the situation.
Samples sizes and experiments
First, however, we need to be clear about the consequences of samples sizes for experiments. Indeed, the paper's author himself directs attention to the issue, by mentioning "Type II error" in his title. This esoteric expression is the statistical term for what scientists call "false negatives", the failure to find something when it really is there to be found. Alternatively, "Type I errors" are "false positives", the finding of something that is either illusory or trivially unimportant.
Using an analogy, if I am looking for a needle in a haystack then a false negative means that I fail to find it when there really is a needle in there — I have made a Type II error. If I find something in the haystack that I interpret as being a needle when it is not, then that is a false positive — I have made a Type I error.
Small sample sizes are prone to false negatives, whereas gratuitously large sample sizes are prone to false positives. Neither situation can be considered to be a good thing for an experiment.
Needless to say, statisticians have had a good look at these issues, and they have developed the notions of both statistical Power and statistical Sensitivity. Mathematically, Power is the complement of a Type II error, and thus expresses the theoretical probability that a statistical test will correctly reject the null hypothesis being tested — that is, Power tells us how probable it is that the statistical analysis will find something if there really is something to find. Sensitivity is a related concept, referring to the empirical ability of an experiment to to correctly reject the null hypothesis. Both concepts can be expressed mathematically.
A look at the paper
Returning to the paper in question, two experimental null hypotheses were tested, using different subsets of the 23 tastings:
H1: Women’s and men’s scores have different means and standard deviations
H2: Women and men have differently shaped distributions of scores
Various statistical tests were applied to test each hypothesis. I won't go into the technical details of what this all means, but will instead jump straight to the results presented in Tables 1, 2 and 3 of the paper.
Sufficient details are presented for me to perform a Power analysis of H1 and a Sensitivity analysis of both H1 and H2. (The test statistics are not presented for H2, only the Type I errors, and so the Power analysis cannot be performed.) My calculations were performed using the G*Power v.188.8.131.52 program; and the outcomes of my analyses are shown in the table below. Each row of the table represents one wine tasting, as listed in the first column. The second column shows the sample sizes, taken from the paper, while the final two columns are the results of my new calculations.
Formally, my Power analyses assess the post hoc achieved power of the statistical t-test at p=0.05 Type I error. More practically for you, all you need to know is that a Power of 80% is conventionally considered to be the minimum acceptable level for an experiment, and this represents an allowable probability of 20% for false negatives. For preference, a good experiment would require a Power of 95%, which is a 5% probability for false negatives.
As you can see in the table, the Power of Bodington's analyses are nowhere near these levels, as the Power of his analyses never exceeds 11%. Indeed, his probabilities of false negatives are in the range 90-95%, meaning that he is coming close to certainty that he will accept the null hypothesis for each statistical test, and thus conclude that he found no difference between men and women, irrespective of whether there actually are such differences or not.
Formally, my Sensitivity analyses quantify what is called the required Effect size. This is a quantitative measure of the "strength" of the phenomenon that the experiment is seeking to find. Large Effect sizes mean that the phenomenon will be easy to detect statistically, while small Effect sizes will be hard to find. Using my earlier analogy, if I am looking for a sewing needle in my haystack, then that is a small Effect size, whereas looking for a knitting needle would be a large Effect size.
Small Effect sizes require large sample sizes, while large Effect sizes can be detected even with small sample sizes. I specified p=0.05 & power=0.80 for my calculations, which relate to the acceptable levels of false positives and negatives, respectively.
Using the standard rule of thumb (Shlomo S. Sawilowsky. 2009. New effect size rules of thumb. Journal of Modern Applied Statistical Methods 8: 597-599), Effect sizes are grouped from very small (0.1) to huge (2.0). Effect sizes larger than 1 are encountered only rarely in the real world, and they do not need statistical analysis because the phenomenon being studied will be obvious, even to the naked eye.
For Bodington's statistical analyses, the various Effect sizes are: 1 medium, 8 large, 15 very large, and 7 huge; none of them are small or very small. So, his tests were capable of detecting only the most blatant differences between women and men. Such differences would actually not need an experiment, because we would be able to see them for ourselves, anyway.
Required sample sizes
To put this another way, there would have to be almost no variation between people within each of the two genders, in order for Bodington's statistical tests to have detected anything. When there is only small variation, then small sample sizes can be effective. However, if there are large differences between people, irrespective of gender, then we would require large sample sizes. My Power and Sensitivity analyses show that we would, indeed, require very large sample sizes in this case.
To examine this, we can perform what is called a prospective (or a priori) Power analysis. My calculations show that sample sizes of 300-450 people would be required per gender, depending on the particular assumptions made during the calculations. That is, if you want to do this experiment for yourselves, you will need to find at least 300 men and 300 women, give each of them the same sets of wines to taste, and then record their scores. If your statistical analyses of the scores still do not detect any gender differences, then you can be fairly sure that any such differences are indeed small.
Bodington's conclusion that the experiments found no detectable difference between men and women is therefore unsurprising — the analyses have so little statistical power that any such differences, if they exist, were unlikely to be found. The author suggests that "these differences are small compared to non-gender-related idiosyncratic differences between individuals and random expressions of preference." If so, then any study of the idea that the genders prefer different wines will require much larger sample sizes than the ones used in the published paper.
On this particular wine topic, the jury is still out. Based on differences in physiology, there is good reason to expect differences in wine tasting between females and males. However, measuring the size and nature of this difference, if it really does exist, remains to be done.