## Tuesday, December 27, 2016

### Look at your data before calculating statistics!

It seems to me that there are a few misconceptions about analyzing data, at least among non experts. I will discuss some of these over the next few weeks, and illustrate a few points with some real data from the wine world. Let's start with the vital idea that we should actually look at our data before we rush into any sort of formal analysis.

As the example, we will look at the recent publication by Vox magazine of an article entitled Why amateur wine scores are every bit as good as professionals’.

Statistics is a field of data analysis that calculates various summaries of a set of data, to help identify data patterns that are unlikely to have been produced by random chance. However, this can be no substitute for actually looking at the data, as well.

Statistical calculations summarize only some of the patterns in the set of data, and therefore the data analyst needs to be very careful about the interpretation of the statistics. It is depressingly easy to do a set of calculations that seem to point one way when the data clearly point a different way. It behooves the analyst to look at the data first — this is called exploratory data analysis. The classic textbook on this topic is by John W. Tukey (1977. Exploratory Data Analysis. Addison-Wesley).

A case in point is the recent publication by Vox magazine. The magazine performed a series of correlation analyses, and from these analyses they reached the conclusion that "amateur wine scores are every bit as good as professionals’." However, looking at their data, as displayed in their graphs, shows something quite different.

Vox collated data from thousands of wines, finding the quality ratings from each of four sources: Cellar Tracker (as an example of wine ratings by non-experts, or amateurs), and the Wine Advocate, the International Wine Cellar, and Jancis Robinson (representing the diversity of ratings by professionals). These four rating systems were statistically compared by calculating spearman correlations among the ratings, pairwise. Check out the original article if you are unfamiliar with this type of analysis, which is quite standard for this type of data.

Let's look at the first Vox graph as an example. This shows a direct comparison, for the same wines, of the (average) quality scores from Cellar Tracker (vertically) and the (single) score from the Wine Advocate (horizontally). Each point represents a single wine (there are nearly 10,000 points in the graph). As an aside, to be correct the graph should actually be square, since it compares two things on nominally the same scale, and yet it is rectangular, instead. This is not a trivial point, because the graph distorts the data display — this is the first hint that something might be wrong.

I have provided an annotated version of the Vox figure below, to illustrate the points I am about to make. You can check out the unadorned original here.

The two scoring systems seem to agree for wines rated at >84 points (on the standard 50-100 scale). That is, both systems agree that the high-scoring wines deserve 84 points or more. How much more? Well, the Wine Advocate uses scores from 84-100, but Cellar Tracker rarely uses scores greater than 97. So, that is a kind of agreement, but not necessarily a very large one — a score of 90 on one scale is not necessarily a score of 90 on the other scale.

However, the graph also clearly shows that there is no agreement whatsoever below a score of 84, or so. The wines that the Cellar Tracker raters like are ones that the Wine Advocate dislikes (top left of the graph), and vice versa (bottom right of the graph). Moreover, there are no wines that both rating groups agree deserve low scores — these should be in the bottom-left part of the graph. Agreement between the raters requires some points in the graph both at the top-right and the bottom-left, but the latter are missing.

So, I am tempted to suggest that this is actually a prime example of disagreement between two ratings, possibly the most blatant one that I have seen in 40 years as a scientist.

As a data summary, the statistical correlation value is clearly meaningless — it completely misses the most obvious patterns in the data. That is, there are two clear patterns: (i) wines that are rated highly on Cellar Tracker and (ii) wines that are rated highly by the Wine Advocate. Often, these are the same wines, but sometimes not; and there are no wines that are rated poorly by both groups of raters.

The practical problem here is that there are two patterns in the data, whereas the summary consists of only one number — there is no practical way to get two pieces of information into a single number and then extract that information again. So, the mathematical calculations focus on a single pattern, which turns out to be a compromise between the two main patterns, and therefore does not correspond to either of them. As we all know, sometimes a compromise means that no-one gets what they want!

All of the graphs in the Vox article show this same two-pattern feature to one extent or another (only five of the six possible pairwise graphs are presented). This means that this is a general issue with the Vox analysis, rather than something specific to the Wine Advocate ratings.

I therefore cannot agree with the conclusions reached in the Vox article, based on the data analyses that they present. Looking at the data first would indicate that the correlation analysis is inappropriate as a summary of these data sets, and some other analysis is required.

Note that this does not mean that I think community scores are in any way useless. Indeed, I use them all the time, myself, because they have considerable practical utility. What I am saying is that the data analysis presented by Vox does not actually support their conclusions — community scores are not the same as professional scores. Indeed, why see either type of score as necessarily superior? Why not see, instead, they are equally useful because they are different?

Finally, what if, for the sake of argument, we decide that we do accept the Vox correlations as meaningful summaries of the data? What are these summaries telling us? I recently showed a network analysis based on exactly these same type of data (Summarizing multi-dimensional wine data as graphs, Part 2: networks). So, we can perform that same type of data analysis again here, as a way to summarize the outcome of the correlation analyses. What does it show?

It shows that Jancis Robinson is the odd one out among the four raters, and that the Cellar Tracker scores are a bit more similar to the Wine Advocate scores than to the International Wine Cellar's ones. This does not in any way validate the Cellar Tracker scores as being "every bit as good as" the other four ratings (although it might validate Robinson's self-proclaimed doubts about the whole business of rating wines using numbers).