Monday, 28 August 2017

Why do people get hung up about sample size?

We cannot collect all of the data that we might want, in order to find out whatever it is that we want to know. A botanist cannot collect data from every rose plant on the planet, an ornithologist cannot collect data on every humming bird on the planet, and a wine researcher cannot collect data on every wine on the planet. So, instead, we collect data on a sub-sample, and we then generalize from that sample to the population that we are actually interested in.

Many people seem to think that the size of the sample is important, and they are right. However, size is not the most important thing, not by a long chalk. The most important thing is that the sample must not be biased. Even a small unbiased sample is much much better than a large biased sample.

Bias refers to whether the sample accurately represents the population we are taking the sample from. If the sample does represent the population then it is unbiased, and if it does not represent the population then it is biased. Bias is bad. In fact, it is often fatal to the work, because we will end up making claims about the population that are probably untrue.


Let's take the first example that I worked out for myself, when I was first learning about science. In 1976, Shere Hite published The Hite Report on Female Sexuality in the U.S.A. She had distributed questionnaires in many different ways, including direct mailouts and enclosures in magazines. She described the final sample of females as follows: "All in all, one hundred thousand questionnaires were distributed, and slightly over three thousand returned (more or less the standard rate of return for this type of questionnaire distribution).” She also emphasized that her sample size was much larger than had ever before been used for studies of human sexual behavior (eg. by Kinsey, or Masters and Johnson).

Here, the intended population from which the sample was taken is not the same as the actual sampled population — the questionnaires may well have been distributed to a group of females who were representative of women in the U.S.A., but there is no reason to expect that the respondents were. The respondents chose to respond, while other women chose not to.

It should be obvious that there are only two reasonable conclusions about females in the U.S.A. that can be drawn from this study: (1) it seems that c. 3% of the females will discuss their sex lives, and (2) it is likely that 97% of the females do not voluntarily discuss their sex lives. There is no necessary reason to expect that the sexual activities of these two groups will be the same, at least in the 1970s. Indeed, our general knowledge of people probably leads us to expect just the opposite. Hite’s report is thus solely about the smaller of these two groups (ie. those who will reveal their sex lives), and no justifiable conclusions can be reached about the larger group.

Note that the problem here is not the sample size of 3,000 — it is solely the non-representativeness of this sample that is at issue, since a sample of this size could easily be representative even of a population as large as that of the U.S.A. At one extreme, if I want to work out the ratio of males:females on this planet, then I will actually get the right answer even with a sample of two people, provided one is male and the other is female!

It is important to note that all samples are an unbiased representation of some population, whether large or small. The trick is that we need to work out what that population is. If it is not the same as the population that we intended, then we are in trouble, if we try to generalize our conclusions beyond the actual population. This was Shere Hite's problem, because she drew general conclusions about women in the U.S.A. (her intended population) rather than just those women who will discuss their sex lives (her sampled population).

It is for this reason that government censuses try to sample all (or almost all) of the relevant people. This is the best way to avoid biases — if you can get data from nearly everyone, then there cannot be much bias in your sample!


Professional survey organizations (e.g. Nielsen, Gallup, etc) usually try to address this issue by defining specific sub-samples of their intended population, and then pooling those sub-samples to get their final sample (this is called stratified sampling). For example, they will explicitly sample people from different ages, and different professions, and different ethnic backgrounds, etc — defining sub-groups using any criteria that they feel might be relevant to the question at hand. This greatly increases their chances of getting an unbiased sample of the general populace.

But even this approach does not guarantee that they will succeed. The example that I used to give my students involved predictions for nine consecutive Australian federal elections (1972-1990) from seven different survey organizations. These polling groups mostly forecast the winning political party correctly, although the winning percentages were sometimes quite inaccurately estimated. However, there was one year (1980) when they all got it wrong; that is, they all predicted that the Labor party would win, by margins of 2-9%, whereas the Liberal/NCP coalition actually won by 1% of the vote. In this case their stratified sampling failed to account for the geographical distribution of voters in the various electoral regions.

Note, also, that these types of survey organizations do not focus as much on sample size as they do on bias, as I emphasized above. For example, in 2014, the Nielsen survey organization announced an addition of 6,200 metered homes to its sample used for assessing television markets in the USA, in terms of which channels/shows are being watched (see Nielsen announces significant expansion to sample sizes in local television markets) — this represented "an almost 50% increase in sample size across the set meter market." That is, even after the increase, c. 20,000 homes are currently being used to sample an estimated population of nearly 120,000,000 US homes with TVs (see Nielsen estimates 118.4 million TV homes in the U.S. for the 2016-17 TV season).


The points that I have made here also apply to the modern phenomenon of collecting and analyzing what is called "Big Data". This has become a buzz expression in the modern world, appearing, for example, in biology with the study of genomics and the business world with the study of social media. Apparently, the idea is that the sheer size of the samples will cure all data analysis ills.

However, data are data, and an enormous biased dataset is of no more use than is a small biased one. In fact, mathematically, all Big Data may do is make you much more confident of the wrong answer. To put it technically, large sample sizes will address errors due to stochastic variation (ie. random variability), but they cannot address errors due to bias.

So, Bid Data can lead to big mistakes, unless we think about possible biases before we reach our conclusions.

Monday, 21 August 2017

Wine tastings: should we assess wines by quality points or rank order of preference?

At formal wine tastings, the participants often finish by putting the wines in some sort of consensus quality order, from the wine most-preferred by the tasting group to the least-preferred. This is especially true of wine competitions, of course, but trade and home tastings are often organized this way, as well.

One interesting question, then, is how should this consensus ordering be achieved; and do different methods consistently produce different results?


At the bottom of this post I have listed a small selection of the professional literature on the subject of ranking wines. In the post itself, I will look at some data on the subject, ranking the wines in two different ways.

Dataset

The data I will look at come from the Vintners Club. This club was formed in San Francisco in 1971, to organize weekly blind tastings (usually 12 wines). Remarkably, the club is still extant, although the tastings are now monthly, instead of weekly. The early tastings are reported in the book Vintners Club: Fourteen Years of Wine Tastings 1973-1987 (edited by Mary-Ellen McNeil-Draper. 1988).

The Vintners Club data consist of three pertinent pieces of information for each wine at each tasting:
  • the total score, determined by summing each taster's ranking (1-12) of the wines in descending order of preference (1 is most preferred, 12 is least preferred)
  • the average of the UCDavis points (out of 20) assigned by each taster — the Vintners Club has "always kept to the Davis point system" for its tastings and, therefore, averaging the scores is mathematically valid
  • the number of tasters voting for the wine as 1st place (and also 2nd and 12th).
The Vintners Club uses the total score as their preferred ranking of the wines for each tasting. That is, in the book the wines are ranked in ascending order of their total score, with the minimum score representing the "winning" wine.

For my dataset, I chose the results of the 45 "Taste-offs"  of California wine. These tastings were the play-offs / grand finals (depending on your sporting idiom), consisting of the first- and second-place wines from a series of previous tastings of the same grape varieties. The Vintners Club apparently began its annual Taste-off program in 1973, and has pursued the concept ever since.

In my dataset, there are 14 Taste-offs for cabernet sauvignon, 12 for chardonnay, 9 for zinfandel, 4 for pinot noir, 3 for riesling, and one each for sauvignon blanc, gamay, and petite sirah. There were 17-103 people attending each the 45 Taste-offs (median 56 people per tasting), of whom 43-96% submitted scores and ranks (median 70%).

For each tasting, I calculated the Spearman correlation between the rank-order of the wines as provided by the total scores and the rank-order of the wines as provided by the average Davis points for each wine. This correlation provides a measure (scale: 0-100%) of how much of the variation in ranks is shared by the two sets of data (total scores versus average points). The percentage is thus a measure of agreement between the two rankings for each tasting.

Total scores and average points

The graph shows the results of the 45 tastings, with each point representing one of the Taste-offs. The horizontal axis represents the number of people providing scores for that tasting, while the vertical axis is the Spearman correlation for that tasting.

Correlation between two methods for ranking wines

As you can see, in most cases the correlation varies from 50-100%. However, only 1 in every 5 times is the correlation above 90%, which is the level that would indicate almost the same ranking for the two schemes. So, we may conclude that, in general, the total score and the average points do not usually provide the same rank-order of the wines at each tasting.

Indeed, in two cases the two schemes provide very different rank-orders for the wines, with correlations of only 41% and 23%. This is actually rather surprising. These two tastings both involved chardonnay wines, for some reason.

It is a moot point whether to sum the ranks or average the scores. That is, we cannot easily claim that one approach is better than the other — they produce different results, not better or worse results. However, for both approaches there are technical issues that need to be addressed.

For averaging, we need to ensue that everyone is using the same scale, otherwise the average is mathematically meaningless (see How many wine-quality scales are there? and How many 100-point wine-quality scales are there?). Similarly, when trying to combine ranks together, there is no generally agreed method for doing so — in fact, different ways of doing it can produce quite inconsistent outcomes (see the literature references below).

Number of first places

For those wines ranked first overall at each tasting, only 4-60% of the scorers had actually chosen them as their personal top-ranked wines of the evening, with an average of 22%. That is, on average, less than one-quarter of the scorers ranked the overall "winning" wine as being at the top of their own personal list. This indicates that rarely was there a clear winner.

Indeed, for only half of the wines was the "winning" wine the one that got the largest number of first places, based on either the sum of ranks or the average points. Indeed, for those wines ranked first overall at each tasting, for only 24 of the 45 tastings was that wine the one that received the greatest number of 1st place votes during the evening. Similarly, for the wines with the highest average score at each tasting, for only 25 of the 45 tastings was that wine the one that received the greatest number of 1st place votes during the evening.

We may safely conclude that neither being ranked 1st by a lot of people, nor getting a high average score from those people, will actually make a wine the top-ranked wine of the evening. As I have noted in a previous blog post, often the winning wine is the least-worst wine

Footnote

Confusingly, for each tasting, the Vintners Club rank data very rarely add up to the expected total for the number of people providing results. That is, the sum of the ranks should = 78 x the number of people providing scores. A few points less than the expected number likely represents a few tied votes by some of the scorers. However, there are also many tastings where the total scores add up to much more than is possible for the number of people present at the tasting. I have no explanation for this. (And yes, I have considered the effect of alcohol on the human ability to add up numbers!)



Research Literature

Michel Balinski, Rida Laraki (2013) How best to rank wines: majority judgment. In: E. Giraud-Héraud and M.-C. Pichery (editors) Wine Economics: Quantitative Studies and Empirical Applications, pp. 149-172. Palgrave Macmillan.

Jeffrey C. Bodington (2015) Testing a mixture of rank preference models on judges’ scores in Paris and Princeton. Journal of Wine Economics 10:173-189.

Victor Ginsburg, Israël Zang (2012) Shapley ranking of wines. Journal of Wine Economics 7:169-180.

Neal D. Hulkower (2009) The Judgment of Paris according to Borda. Journal of Wine Research 20:171-182.

Neal D. Hulkower (2012) A mathematician meddles with medals. American Association of Wine Economists Working Paper No. 97.

Monday, 14 August 2017

European wine taxes — and what to do about them

Almost all countries have some sort of tax on wine. Some places have a uniform tax throughout the country; and some countries have taxes that vary between their states (eg. the USA — see State and Local Alcohol Tax Revenue, 2014). The countries of the European Union (EU) may have many common economic policies, but uniform taxes is certainly not one of them. Therefore, the tax on wine varies considerably between the countries of Europe.

There are three possible taxes that might apply to wine:
  • Import Duty, if the wine comes from outside Europe
  • Value-Added Tax, which applies to all goods and services, and which goes by many names throughout the world (eg. VAT, TVA, IVA, GST, MwSt, Moms)
  • Excise Tax, which applies to specific products, such as alcohol or tobacco (and, in the dim past, also things like salt and sugar).
Here, we will concern ourselves only with the latter two taxes, on the assumption that the wine has originated in Europe or has already been imported there.

Recently, the Facebook page of the American Association of Wine Economists has considered this issue in one of their posts: Excise Taxes and VAT on Still Wine in the EU 2017. I have used these data as my basis; but I have modified them to account for the fact that in Europe the Value-Added Tax is charged on the total bottle price, including the Excise Tax (and also the Import Duty, for that matter). I have also added the data for Norway and Switzerland, which are currently not in the EU.

This graph shows the full data for 29 European countries, with the combined taxes on the wine expressed as a percentage of the final bottle price paid by the consumer.

Percent taxes on a bottle of wine in various European countries

Note that the EU countries are neatly bracketed by the two non-EU countries, with Swiss alcohol being dramatically cheaper than the Norwegian stuff. Also, note that the six "top" countries stand out from the rest — there is not a great difference between Estonia (23%) and Germany (16%), but the denizens of Denmark, Sweden, the UK, Finland, Ireland and Norway pay notably more tax for their wine than do other Europeans.

What to do?

Needless to say, the residents of these six countries do not take this situation lying down. Something must be done!

And the thing to do is to take advantage of the fact that there is free trade throughout the EU. That is, if you are a resident of a country with a sales tax that you don't like, then you can simply purchase your goods in another country, with a tax more to your liking. [*]

So, the British go to France, this being their closest country with lower alcohol taxes. These days there are large shopping complexes at the appropriate entry points into France (or exit points if you are going the other way). These are often owned and run by British companies. You simply order your goods online before you leave Britain, travel to France for the day (or a few days, if you want a proper holiday), and pick up your pre-packed goods on the way home. It is as simple as that.

I have no idea what the wine drinkers plan to do if Brexit is implemented.

Similarly, the Swedes and Danes go to Germany. Once again, there are large shopping complexes just where the main roads cross the Danish border, or near the boats too / from Sweden. These are usually run by Germans, and you don't order your goods ahead of time; but Scandinavians do commonly stop at them on their drive home at the end of any trip down south.

The Finns go to Sweden, on boats. There are a couple of large cruise liners that go back and forth between Stockholm and Helsinki every evening. Each trip runs overnight, and you spend the following day in the destination city, returning home that evening (ie. the entire trip, there and back. takes 40 hours). These boats are nominally run to transport trucked goods to Finland, which would otherwise need to be accessed via a long land detour through Russia — the boat is officially part of European road number E18. But, in reality, these are duty-free party boats, with a casino, nightclub, movie theater, duty-free shops, etc. Both Swedes and Finns are observed to depart from these boats with shopping carts loaded with duty-free alcohol.

Finally, the Norwegians go to Sweden for their alcohol. This might seem odd, since Norway is not in the EU and Sweden is, so that the goods are technically being imported into Norway (and should therefore attract Import Duty). However, there is unrestricted land movement of people between these two countries (see Nordic Passport Union), and only large trucks are stopped at the customs points. It has always been like this, even though the two countries parted company over a century ago.


So, if you drive south from, say, Oslo (the largest city in Norway), then just after you cross the Swedish border (about 90 minutes from Oslo) you will encounter two large shopping complexes, labeled Nordby and Svinesund in the satellite photo above. These are in the middle of a forest, with no nearby Swedish town — the Swedes have constructed them solely for the Norwegians. If you visit these shops, you will see lots of these people packing their station wagons with goods that attract much lower taxes in Sweden than they do in Norway. Indeed, Nordby actually has the liquor store with the biggest annual turnover in Sweden!

Such is life in modern Europe.



[*] In this sense, the European Union is more united than is the United States, where free trade in alcohol is still not widespread (see Consumers short-changed again on shipping), in spite of the fact that Prohibition was repealed eight decades ago.

Monday, 7 August 2017

The blind leading the blind?

Recently, The Economist magazine tried to champion the cause of blind tastings by using the results of the 2017 wine-tasting contest between Oxford and Cambridge universities (Think wine connoisseurship is nonsense? Blind-tasting data suggest otherwise). The conclusion was that the tasters "performed far better than random chance would indicate." However, very little data analysis was performed, and so a look at their data is in order.


The Economist notes:
The main results of the 2017 Varsity blind-tasting match, held on February 15th, are depicted above. Two teams of seven tasters each (including one reserve per side) were presented with 12 wines, six whites and six reds. The judges granted each taster between zero and 20 points per wine, depending on how close (in their estimation) the drinkers’ guesses were to the correct answers, and how convincingly they explained their reasoning. However, we prefer a simpler scoring system: one point for getting the country of origin right, another point for getting the grape variety right and a judicious half-point of partial credit only in a handful of specific cases.
The group’s overall accuracy was far superior to what could be expected from random chance. Given the thousands of potential country-variety pairs, a monkey throwing darts would have virtually no hope of getting a single one right. But 47% of the Oxbridge tasters' guesses on grape variety were correct, as were 37% on country of origin.
The Economist does point out the rather obvious variation in success, among both the tasters and the wines — some tasters did much better than others, and some wines were identified much more commonly than others. However, a variance-components analysis of the data indicates that it is the variation among the wines that dominates the dataset — for the successful identification of grape variety, 90% of the variability is due to the variation among the wines and only 5% is due to the variation among the tasters; and for the identification of country of origin, it is 65% and 25%, respectively.

So, any general comments about the success of blind wine-tasting must be tempered by the fact that some wines are apparently much easier to identify (by grape or country) than are others.

Statistical evaluation

The Economist's assessment of the probability of success is based on a mathematically naïve set of assumptions. As an example of their "dart-throwing" calculation: there are c. 100 common red-grape varieties, and so there is a 1% chance of me getting one right at a blind tasting by simply guessing. I would then have a 6% chance of getting at least one wine right if I simply guess the same red grape each time, for the six wines. This makes the 47% success rate of the tasters look pretty good.

However, this calculation is mathematically naïve because human beings are not monkeys, with or without darts. Some grape varieties occur in wines much more commonly than do others, and those grapes are more likely to be represented in the tasting contest; and human beings know this, even if the monkeys do not. Similarly, some countries are more likely to be represented in a wine tasting than are others, especially given the presence of certain grape varieties. For example, how many Gamay wines are made outside of France? If I simply assume "Beaujolais" for a Gamay wine then I have a 95% chance of being right!

We therefore cannot assume that an educated wine taster is the same as a monkey throwing darts. The wine taster is not guessing, any more than a motor mechanic is guessing when diagnosing a fault in your car. They both have prior knowledge, which even at worst produces an educated guess (and at best is professional expertise). That is, an "educated guess" should be the basis of our statistical comparison, not a "random guess", as done by The Economist.

So, in order to work out the actual probabilities of success for each grape (and country) I need to know the probability of one of the wines in the contest being, say, Chardonnay. That is, I would need to know the probability of the competition organizers choosing each of the grape varieties and countries for the tasting. Sadly, I do not have this information.

As a realistic substitute, I will use how common the different varieties/countries are in liquor stores. That is, I will assume that the bottles have chosen from the selection available in the shops.

For this, I will use the wine database of the Systembolaget liquor chain, in Sweden. I have used this database before (eg. How many wine prices are there?) because, being the third largest liquor chain in the world, it's selection of wines is extensive. Furthermore, being a European chain, it is likely to match the British organizers' probabilities of choice better than would many other sources. Indeed, for both the red and the white varieties, the organizers chose 4 of the 5 most common grapes in the Systembolaget database (out of the 6 chosen). So, my probabilities may be pretty good, at least from the point of view of the participants working out which wines they are likely to encounter in the tasting.

As an example, 25% of the white wines in Systembolaget's database have Chardonnay listed as a principal grape variety. This means that we would expect an 82% chance of at least one of the 6 white wines being Chardonnay. The participants actually had an 86% success rate at identifying the Chardonnay. So, my analysis suggests that in this one case they have not actually done any better than they could have done by taking an educated guess based simply on how common the wines are in the shops. The question they are answering in the tasting is not "is this a Chardonnay?" but "which one is the Chardonnay?"!

Statistical results

So, my basis for estimating the prior probabilities of expected success for the participants is to work out the probability of at least one of the wines being of that variety or region (based on its frequency in the Systembolaget database). We can then compare this to the tasting results for each grape variety and each country, to see if the participants actually did better than an educated guess.

For each of the graphs presented below, the interpretation is as follows. Each variety or country is represented by a horizontal line, as indicated by the legend. The central point on each of the lines represents the percentage of the tasters who succeeded at the task for that wine. The two end points on each line are the boundaries of the estimated 95% confidence interval (formally: the Score binomial 95% confidence interval). This interval gets smaller as the sample size (the number of tasters) gets larger, as it represents our statistical "confidence" in the results. The asterisk represents the expected results if the tasters are performing in accordance with the estimated prior probabilities. So, if the asterisk is within the 95% confidence interval for a particular wine, then the tasters have done no better than an educated guess for that wine, whereas if the asterisk lies outside the 95% confidence interval then the tasters have done better (or worse) than expected.

Expected versus actual correctness for grape varieties

Expected versus actual correctness for countries

The analyses indicate that in only 2 out of 12 cases did the participants identify the grape variety with any more success than would be expected based on the commonness of the wines: the Pinot Noir and the Gamay. Otherwise, they did as well as we would expect using an educated guess — except in the case of the Riesling wine, where they did rather poorly. In this case, Riesling is apparently a more common wine grape than the participants realize!

The analyses also indicate that the tasters did both better and worse than expected with the identification of country of origin. In three cases they did better than expected (France and New Zealand for the red wines, and Australia for the white wines), and in three cases they did worse than expected (Spain for the red wines, and France and Italy for the white wines). That is, French white wine is apparently a more common type than the participants realize, as also are Italian white wine and Spanish red wine.

Conclusion

I have indicated before that blind tastings are notoriously hard (see Can non-experts distinguish anything about wine?). The results and analyses presented here confirm that conclusion — for some wines the participants did very well, but in most cases they could have done just as well by guessing based on how commonly the wines are encountered. The Economist's optimism in this case is misplaced, due to a naive assessment of the prior probabilities of success.