Tuesday, December 27, 2016

Look at your data before calculating statistics!

It seems to me that there are a few misconceptions about analyzing data, at least among non experts. I will discuss some of these over the next few weeks, and illustrate a few points with some real data from the wine world. Let's start with the vital idea that we should actually look at our data before we rush into any sort of formal analysis.

As the example, we will look at the recent publication by Vox magazine of an article entitled Why amateur wine scores are every bit as good as professionals’.

Statistics is a field of data analysis that calculates various summaries of a set of data, to help identify data patterns that are unlikely to have been produced by random chance. However, this can be no substitute for actually looking at the data, as well.

Statistical calculations summarize only some of the patterns in the set of data, and therefore the data analyst needs to be very careful about the interpretation of the statistics. It is depressingly easy to do a set of calculations that seem to point one way when the data clearly point a different way. It behooves the analyst to look at the data first — this is called exploratory data analysis. The classic textbook on this topic is by John W. Tukey (1977. Exploratory Data Analysis. Addison-Wesley).

A case in point is the recent publication by Vox magazine. The magazine performed a series of correlation analyses, and from these analyses they reached the conclusion that "amateur wine scores are every bit as good as professionals’." However, looking at their data, as displayed in their graphs, shows something quite different.

Vox collated data from thousands of wines, finding the quality ratings from each of four sources: Cellar Tracker (as an example of wine ratings by non-experts, or amateurs), and the Wine Advocate, the International Wine Cellar, and Jancis Robinson (representing the diversity of ratings by professionals). These four rating systems were statistically compared by calculating spearman correlations among the ratings, pairwise. Check out the original article if you are unfamiliar with this type of analysis, which is quite standard for this type of data.

Let's look at the first Vox graph as an example. This shows a direct comparison, for the same wines, of the (average) quality scores from Cellar Tracker (vertically) and the (single) score from the Wine Advocate (horizontally). Each point represents a single wine (there are nearly 10,000 points in the graph). As an aside, to be correct the graph should actually be square, since it compares two things on nominally the same scale, and yet it is rectangular, instead. This is not a trivial point, because the graph distorts the data display — this is the first hint that something might be wrong.

I have provided an annotated version of the Vox figure below, to illustrate the points I am about to make. You can check out the unadorned original here.

Graoh from Vox magazine

The two scoring systems seem to agree for wines rated at >84 points (on the standard 50-100 scale). That is, both systems agree that the high-scoring wines deserve 84 points or more. How much more? Well, the Wine Advocate uses scores from 84-100, but Cellar Tracker rarely uses scores greater than 97. So, that is a kind of agreement, but not necessarily a very large one — a score of 90 on one scale is not necessarily a score of 90 on the other scale.

However, the graph also clearly shows that there is no agreement whatsoever below a score of 84, or so. The wines that the Cellar Tracker raters like are ones that the Wine Advocate dislikes (top left of the graph), and vice versa (bottom right of the graph). Moreover, there are no wines that both rating groups agree deserve low scores — these should be in the bottom-left part of the graph. Agreement between the raters requires some points in the graph both at the top-right and the bottom-left, but the latter are missing.

So, I am tempted to suggest that this is actually a prime example of disagreement between two ratings, possibly the most blatant one that I have seen in 40 years as a scientist.

As a data summary, the statistical correlation value is clearly meaningless — it completely misses the most obvious patterns in the data. That is, there are two clear patterns: (i) wines that are rated highly on Cellar Tracker and (ii) wines that are rated highly by the Wine Advocate. Often, these are the same wines, but sometimes not; and there are no wines that are rated poorly by both groups of raters.

The practical problem here is that there are two patterns in the data, whereas the summary consists of only one number — there is no practical way to get two pieces of information into a single number and then extract that information again. So, the mathematical calculations focus on a single pattern, which turns out to be a compromise between the two main patterns, and therefore does not correspond to either of them. As we all know, sometimes a compromise means that no-one gets what they want!

All of the graphs in the Vox article show this same two-pattern feature to one extent or another (only five of the six possible pairwise graphs are presented). This means that this is a general issue with the Vox analysis, rather than something specific to the Wine Advocate ratings.

I therefore cannot agree with the conclusions reached in the Vox article, based on the data analyses that they present. Looking at the data first would indicate that the correlation analysis is inappropriate as a summary of these data sets, and some other analysis is required.

Note that this does not mean that I think community scores are in any way useless. Indeed, I use them all the time, myself, because they have considerable practical utility. What I am saying is that the data analysis presented by Vox does not actually support their conclusions — community scores are not the same as professional scores. Indeed, why see either type of score as necessarily superior? Why not see, instead, they are equally useful because they are different?

Finally, what if, for the sake of argument, we decide that we do accept the Vox correlations as meaningful summaries of the data? What are these summaries telling us? I recently showed a network analysis based on exactly these same type of data (Summarizing multi-dimensional wine data as graphs, Part 2: networks). So, we can perform that same type of data analysis again here, as a way to summarize the outcome of the correlation analyses. What does it show?

Phylogenetic network of wine qulaity scores

It shows that Jancis Robinson is the odd one out among the four raters, and that the Cellar Tracker scores are a bit more similar to the Wine Advocate scores than to the International Wine Cellar's ones. This does not in any way validate the Cellar Tracker scores as being "every bit as good as" the other four ratings (although it might validate Robinson's self-proclaimed doubts about the whole business of rating wines using numbers).

Sunday, December 25, 2016

James Bond, certified alcoholic

Merry Christmas
God jul och gott nytt år!

On my other blog I have a new post, which you may enjoy reading. It is about the alcohol intake of James Bond, the world's least secret secret agent, and well-known "label drinker".

Monday, December 19, 2016

The Rheingau — the grand-daddy of all vintage charts

Most of us probably think that vintage charts, which give a quality score for each vintage in a particular wine region, are a fairly modern thing, along with the idea of giving a quality score to each producer's wines.

Nevertheless, I have previously discussed long-term continuous records of vintage quality for several vineyard regions, including century-long recording for Bordeaux, in southern France (Two centuries of Bordeaux vintages — Tastet & Lawton) and Piemonte, in northern Italy (A century of Barolo vintages — Fontanafredda; More than a century of Barolo vintages — Marchesi di Barolo).

Intriguingly, the oldest known continuous vintage-quality record is for the Rheingau region in southern Germany, covering the years 1682-1884 CE, which thus includes scores for 203 consecutive vintages.

The oldest known vintage chart

The Rheingau

The Rhine River generally flow north from the Alps to the North Sea. However, at one point it turns west, having encountered the southern part of the Taunus plateau. After 20 km or so it breaks northwards again, between the Taunus and Hunsrück plateaus, forming the best known part of the river, the Romantic Rhine so beloved of tourists, with the old castles on the tops of the river gorge, and even in the river itself.

The east-west part of the river is the Rheingau, with most of the vineyards on the gently sloping south-facing slopes next to the river itself.

As Stuart Pigott recently noted about the period covered by the vintage chart:
The Rheingau may be much older than the Medoc in Bordeaux, for example, but the most decisive period of its history came in the 18th century, beginning with the world’s first varietal plantings of the Riesling grape, the introduction of late harvesting, and the selective harvesting of bunches. All of this happened at the same property: Schloss Johannisberg, in 1720-21, 1775 and 1787, respectively. [Down the road, Schloss Vollrads is the oldest operating commercial winery in the world, with its first documented release of wine in 1211 CE.]

For a century following the breakthrough vintage of 1811, Rheingau Rieslings were the most sought-after and expensive wines in the world. By the 1850s, the Rheingau was on a roll. The majority of the region’s wines were dry, but those that wrote the headlines were sweet wines made from nobly rotten grapes. Then, at the end of the 19th century, it was overtaken by the Mosel.
Vintage chart

The vintage chart in question appears as Table V of a book called Karte und Statistik des Weinbaues im Rheingau, compiled in 1885 by Heinrich Wilhelm Dahlen. This book is available online at the Landesbibliothekszentrum Rheinland-Pfalz.

The chart itself is entitled Uebersicht von Menge und Güte der Wein-Erträge in dem vormaligen Herzogthume Nassau in den Jahren 1682 bis 1884 (Overview of the quantity and quality of the wine-income in the former duchy of Nassau in the years 1682 to 1884). A direct link to the chart is available here.

The chart uses a color code to indicate the wine quality for each vintage, along with a written indication of the quantity of the harvest. The quantity is indicated by words in the first three columns of the chart, but there are actual volumes (in hectoliters) in the final column; the length of the colored bars in the final column also indicates the quantity. The 4-point quality color code is:
Gering und schlecht
mediocre (or fair)
poor and bad
light green
dark green

In the rest of this post, I provide a transcription of this vintage chart, along with some analysis of the data. Thanks to the Hogshead blog (Buy 1684, avoid 1687: an historic German vintage chart) for drawing my attention to this extraordinary historical record.


Here is a summary of the harvest-quality data for the 203 vintages:
Excellent  26
Good/excellent  1
Good 50
Mediocre/good  14
Mediocre 45
Poor/mediocre 9
Poor  63

Here are the same data presented as a frequency histogram of increasing quality. For random data his would follow what is known as a binomial probability distribution. It approximately does so, but for a perfect fit there are actually a few too many vintages of the "poor and bad" sort relative to the "mediocre" sort.

In the next graph I have shown the harvest-quality data as a time series, with the quality codes converted to the scores 1-4. Each data point represents a vintage, and the pink line is a running average (it shows the average value across groups of 9 consecutive years, thus smoothing out the long-term trends). [Technical note: the data are of ordinal type but not necessarily interval type, and so calculating an average may not actually be valid. I have simply assumed that it is appropriate, given the relatively close fit to the binomial probability distribution.]

Rheingau vintage quality scores 1682-1884

Using the scale 1-4, the average vintage score is 2.2, whereas it would be 2.5 for random data, so that the average harvest across the 203 years was slightly below expectation (as also noted above for the frequency distribution). There is no general long-term trend in vintage quality across these two centuries, which cover the second half of the global cold period known as the Little Ice Age (1300-1850 CE).

There are, however, remarkably regular peaks in quality every 25-30 years (as shown by the peaks and valleys of the pink line). The cause of this is not immediately obvious, although it is presumably related to cyclical weather patterns. The first two of the quality peaks actually run together (ie. there is no intermediate dip in quality), so that the vintages were generally good from 1700-1730.

Rheingau vintage quality and quantity 1830-1884

The second graph shows the relationship between vintage quality (vertically) and vintage quantity (horizontally), with each point representing a vintage from 1830-1884. There is a general positive association between quality and quantity (correlation r=0.59), so that, for example, small numbers of grapes are never associated with the best quality score. Mark Matthews, in Terroir and Other Myths of Winegrowing (University of California Press, 2016) points out that this is often true of wine making.

Interestingly, this vintage chart is not the only presentation of the Rheingau wine quality from this time period. Karl Storchmann (2005. English weather and Rhine wine quality: an ordered probit model. Journal of Wine Research 16:105-120) has transcribed a set of verbal descriptions of vintages into a set of quality scores. His data are for a single vineyard, Schloss Johannisberg (mentioned above), covering the period 1700-2000 CE. I have not yet obtained a copy of these data, to make a direct comparison with the data shown above.


Notes: The following is a transcription of the original Gothic script into modern German. I have translated the quality color codes using the 1-4 scores. For some of the years the score is shown as being a mixture of two different codes (eg. 1/2), as explained in the Remarks (Bemerkungen) below. The first part of the chart has only abbreviated comments about the harvest quantity (Menge = amount). The middle part of the chart also provides a score for the amount (xx). The final part of the chart provides the estimated harvest quantity, in hectoliters. If you are interested, Google Translate does a reasonable job of translating the German text.

Übersicht von Menge und Güte der Wein-Erträge in dem vormaligen Herzogthume Nassau in den Jahren 1682 bis 1884.


Wenn allgemeine Angaben über die Menge bis zum Jahre 1829 in den benüßten Chroniken*  nicht vorhanden waren, wurden dieselben weggelassen und folche nur für die Jahre eingefeßt, über welche entsprechende Mittheilungen sich vorsanden. Stimmten die diesbezüglichen Auszeichnungen nucht überein, so sind die sich widersprechenden Angaben einander gegenübergestellt.

Die Menge für die Jahre 1830 bis 1884 ist nach den officiellen Erhebungen für das Gebiet des vormaligen Herzogthumes Nassau in hektolitern angegeben und deren Berschiedenheit graphisch dargestellt. Bis 1868 wurden die Angaben benüßt, welche Bolizeirath Höhn in Weisbaden bereits in einer für die Wiener Weltausstellung 1873 zusammengestellten Tabelle ausgesührt hatte.

Die Güte ist entsprehend der Qualität der Rheingauer Weine im Allgemeinen durch die nachstehend ernähnten Farben ausgedrükt. Da die Darstellung den Charakter der Weine im Allgemeinen ausbrüken soll, so ist natürlich nicht ausgeschlossen, baß in speciellen Jällen d. h. engeren Bezirken in den betressenden Jahren auch bessere oder geringere Qualitäten erzielt wurden, als es den gewählten Farben entspricht.

Haben bis zum Jahre 1829 für denselben Jahrgang zwei Farben Berwendung gesunden, so stimmten die Angaben der Chroniken nicht überein, sondern wichen in der Weise von einander ab, wie die betressenden Farben veranschaulichen.

Die Güte für die Jahre 1830 bis 1884 wird wie oben durch Farben veranschaulicht und ist die Darstellung aus Grund der diesbezüglichen Mittheilungen eines der hervorragensten Rheingauer Weinkenner, dessen Ersahrungen bis zu dem zweiten Decenium dieses Jahrhunderts hinausreichen, ersolgt. Sind in besagtem Zeitraum für einen Jahrgang zwei Farben benußt, so bemegt sich die Güte innerhalb des hierdurch angedeuteten Werthes.

Die Qualität ist durch solgende Farben ausgedrüßt.

* Es wurden hierbei solgende Quellen benüßt:
1. Rheingauer Geschichts- und Wein-chronik. Von Dr. Rob. Haas. Weisbaden 1854.
2. Der Weinbau in Nassau. Von O. Sartorius. Weisbden 1871.
3. Der Weinbau der leßten hundert Jahre im Rheingau. Von T. B. Weinbau und Weinhandel 1885, S. 51.
4. Über das Schäßen der Weinernten. Von W. Rasch. Ebdenda S. 60.

Jahr Score  Menge
1682   1 wenig
1683 2 wenig
1684 4 voller herbst
1685 1 viel
1686 3
1687 1 viel
1688 1 viel
1689 3
1690 2
1691 2 sehr wenig
1692 1 sehr wenig
1693 1 wenig
1694 3
1695 1 wenig
1696 1
1697 2
1698 1 wenig
1699 3 viel
1700 4
1701 3
1702 2
1703 2
1704 4
1705 1 wenig
1706 4 voller herbst
1707 3
1708 2
1709 1 starker winterfrost
1710 3
1711 3
1712 4 sehr viel
1713 1
1714 2
1715 3
1716 1
1717 2
1718 4
1719 4
1720 2
1721 1
1722 1/2
1723 4 viel
1724 3
1725 1 sehr wenig
1726 4 voller herbst
1727 3 sehr viel
1728 3
1729 3 viel
1730 1
1731 2
1732 1
1733 2
1734 2
1735 1
1736 3
1737 3
1738 4
1739 2 viel
1740 1 frühsr. viels. nicht gel
1741 2/3 wenig
1742 1 wenig
1743 3
1744 3
1745 2/3 wenig
1746 4 sehr wenig
1747 4
1748 4
1749 4 wenig
1750 4 viel
1751 1 wenig
1752 1 viel
1753 3 gehr viel
1754 2/3
1755 3 wenig
1756 1 mittelertrag
1757 2 halber herbst
1758 1 viel
1759 3 viel
1760 3 viel
1761 3 viel
1762 3/4 gehr viel
1763 1 vielfach nicht gelesen
1764 2 wenig
1765 1 sehr wenig
1766 3 viel
1767 1 sehr wenig
1768 2 wenig
1769 1 viel
1770 2 wenig
1771 1/2 viel
1772 2 viel
1773 2
1774 3 viel
1775 3 viel
1776 1 wenig
1777 1 mittelertrag
1778 2/3 wenig
1779 3 viel
1780 3 viel
1781 4 sehr viel
1782 1 biemlich viel, fruhfr.
1783 4 hauptjahr
1784 3 sehr wenig (18)
1785 1 biemlich viel (12)
1786 1 wenig (14)
1787 1 viel (12)
1788 3 viel (0)
1789 2 wenig, spätfrost (18)
1790 2 wenig (18)
1791 1/3 frühfrost, 12 herbst
1792 1/3 feblj. & spätfr. (18)
1793 1 sehr wenig (18)
1794 3 mittelertrag (48)
1795 1/2 wenig (18)
1796 1/3 wenig (18)
1797 1 wenig (18)
1798 3 viel (12)
1799 1 sr. hagelbeichädg. (18)
1800 3 sehr wenig (13)
1801 3 wenig bis viel (12)
1802 3 sehr wenig bis ziemt. viel (18)
1803 2 (28)
1804 3 sehr viel (11)
1805 1 sehr wenig (0), frühfr.
1806 4 biemlich viel (13)
1807 3 biemlich viel (13)
1808 2 viel (11)
1809 1 wenig (18)
1810 2 wenig (12)
1811 4 sehr viel (13)
1812 2 nicht viel (23)
1813 1 sehr wenig (18)
1814 2 sehr wenig (18)
1815 3 wenig (13)
1816 1 vielfach nicht gel. (0)
1817 1 wenig (18)
1818 3 mittelertrag (11)
1819 3 viel bis sehr viel (11)
1820 1 rein (12), herbst (1/3)
1821 1 unbedeutend (13)
1822 4 voller herbst (13)
1823 1 halber herbst (13)
1824 1 wenig (14)
1825 3 viel (12)
1826 3 voller herbst (11)
1827 3 sehr wenig (14)
1828 2 voller herbst (11)
1829 1 sehr wenig (16), herbst (13)
1830 1 2,700
1831 3 32,412
1832 1/3 33,840
1833 2/3 95,472
1834 4 106,368
1835 3 87,120
1836 2/3 42,768
1837 1 31,236
1838 2 21,768
1839 1/3 43,644
1840 1/2 39,660
1841 2/3 28,572
1842 2/3 67,728
1843 1 34,486
1844 1/2 34,392
1845 1 34,548
1846 4 117,000
1847 1/2 102,804
1848 3 63,264
1849 1 44,916
1850 1/2 51,216
1851 1 51,300
1852 2 53,232
1853 2 53,256
1854 1/2 9,516
1855 3 43,968
1856 1 27,888
1857 4 109,968
1858 3 97,104
1859 3 71,040
1860 1 64,800
1861 3 24,624
1862 4 96,480
1863 1 54,960
1864 1 33,612
1865 4 89,220
1866 1 99,000
1867 1/2 77,676
1868 4 129,485
1869 2 57,552
1870 2 62,616
1871 1 25,874
1872 1 11,612
1873 2 27,839
1874 2/3 84,284
1875 2/3 131,088
1876 2/3 75,070
1877 1 61,827
1878 2 37,416
1879 1 13,928
1880 2/3 14,452
1881 2/3  67,691
1882 1 38,392
1883 2/3 74,220
1884 3 76,820

Monday, December 12, 2016

Summarizing multi-dimensional wine data as graphs, Part 2: networks

In a previous blog post I introduced ordination analyses, as a mathematical technique for summarizing multivariate data — Summarizing multi-dimensional wine data as graphs, Part 1: ordinations.

Multivariate data have measurements of multiple characteristics for each of a set of "objects". The objective of the analysis is to mathematically summarize the multiple dimensions down to something manageable, which can then bw displayed as a picture. This picture will give is an overview of the patterns in the data, rather than displaying the details of each characteristic.

Ordination analyses put the objects in some order along one or two dimensions (hence the name). In the resulting graph the points represent the objects, and the relationships of the points in the graph represent their similarity, based on the summary of the original data. Points close together are more similar than are points further apart.

In this post, I introduce the use of networks as an alternative way to summarize multivariate data. In particular, the networks that I introduce are called phylogenetic networks. I have a special interest in these networks, since it is what I have worked on in my professional life (some of this work is discussed at the end of this post).

Phylogenetic networks

Phylogenetic networks differ from what are called interaction networks, which many people are familiar with — a food web is a classic example of an interaction network, where the objects in the network are different organisms, which are connected by lines indicating who eats whom. This type of network uses lines to connect objects based on relationships that are directly observed.

A phylogenetic network, on the other hand, connects the objects with multiple lines showing a summary of their similarities. That is, the network is a bit like an ordination, which uses space to display the relationships among the objects, but the points are now connected by lines. Objects that are close together in the network are more similar to each other than are points further apart, but the relationships are traced only along the lines, not directly from point to point. That is, the lengths of the connecting lines contain information about the data summary, and the ways in which they connect also contain information.

The advantage here is that the lines will often indicate clusters of points in the multidimensional data that cannot be displayed in a 2-dimensional ordination. So, the network can be more informative.

[Add forward link to previous post]

An example

I have compiled some multidimensional data relating to the quality scores of a collection of Australian wines, as provided by a number of raters — Jeremy Oliver, James Halliday, the Wine Front, and Cellar Tracker. The first two raters are individual people, the third one is a group of three people (Mike Bennie, Campbell Mattinson, Gary Walsh), any one of whom may have rated the wine, and the fourth one is a community site that provides ratings averaged across many people.

In this example, the 4 wine-raters are the objects, and their multiple characteristics are the quality scores given to the wines (there are 114 wines and thus 114 dimensions to the data). We wish to see how similar are the raters, by summarizing the multidimensional data down to a single picture.

This can be done using a NeighborNet network, as shown in the first figure. Note that the four raters are connected by a set of lines, and it is these connections that summarize the data.

Interpreting a network 1

The lengths of the lines are important. The longest line (technically called an edge) separates the Cellar Tracker ratings from the rest of the network, indicating that the Cellar Tracker ratings are often quite different to those of the other three raters. Indeed, the original data shows that these scores are usually much smaller than are the other three scores, for any given wine.

Interpreting a network 2

A similar thing applies to the scores from Jeremy Oliver, which are separated by the second-longest line — his scores are also often quite different to the others. Indeed, he has acknowledged this in his writings, pointing that as a wine commentator he is often more critical than are other (unnamed) commentators.

In the middle of the network there is a box-like structure connecting the four raters together in various ways. This box is not square, which means that some of the raters are more similar to each other than are others, as represented by the distances connecting them along the edges.

Interpreting a network 3

For example, the shortest distance along the lines from James Halliday to the Wine Front (401 units) is less than the distance from Jeremy Oliver to the Wine Front (506 units). This means that the former pair are more similar than are the latter pair.

Furthermore, the distance along the edges from Jeremy Oliver to James Halliday is much longer (667 units), indicating that this pair of people produce the least similar scores of the three Australian raters. All of these relationships can be seen at a glance, which is what makes the network useful as a summary of the original data.

Interpreting a network 4

This is a relatively simple example, because there are only four objects. Clearly, a network will get more and more complex as we add more objects to the analysis. There are examples of this phenomenon in the links below.

Some other examples

Based on this explanation of phylogenetic networks, you might like to look at a few of the interesting examples from my professional blog, showing you the range of possible uses of networks. By "interesting" I mean that the subject matter of the blog post is interesting, not necessarily the networks themselves! The complete list of my network analyses is in this blog page: Analyses.

Simple datasets:
More complex datasets:
There is also a somewhat different explanation of how to interpret these networks in this blog post (which uses the results from a few Australian federal elections):
The use of phylogenetic networks was formally introduced in this research publication, which also contains a range of example analyses:
  • Morrison D.A. (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312.

Monday, December 5, 2016

Are there biases in community wine-quality scores?

In an earlier blog post I considered Biases in wine quality scores. This issue had been previously pointed out by Alex Hunt (What's in a number? Part the second), who presented graphs of the data from various well-known wine commentators showing that the scores they give wines have certain biases. In particular, some scores are consistently over-represented, such as 90 compared to 89 (using the 100-point scale) and 100 compared to 99. In my post, I performed a detailed analysis of a particular dataset (from the Wine Spectator), showing how we could mathematically estimate the extent of this bias.

This point immediately raises the question as to whether these biases also occur in community-based wine scores, or whether they are restricted to scores that are compiled from individual people. Community scores are compiled from the users of sites such as Cellar Tracker, where they simply pool all of their individual scores for each wine. It is possible that, even if each individual scorer shows biases towards certain scores, these biases might average out across all of the scorers, and thus be undetectable. Alternatively, the biases might be widespread, and thus still be evident even in the pooled data.

To find out, I downloaded the publicly available scores from Cellar Tracker for eight wines (for my data, only 55-75% of the scores were available as community scores, with the rest not being shared by the users). These eight wines included red wines from several different regions, a sweet white, a still white, a sparkling wine, and a fortified wine. In each case I searched for a wine with at least 300 community scores; but I did not succeed for the still white wine, and in that case the best I could find had only 189 scores.

Below, I have plotted the frequency distribution of the Cellar Tracker scores for each of the eight wines. As in my previous post on this topic, the height of each vertical bar in a graph represents the proportion of wines receiving the score indicated on the horizontal axis.

As you can see, these graphs do show distinct biases, although some of the graphs are much less biased than are others.

The principal bias, when it occurs, is most commonly an over-representation of scores in units of five: 70, 75, 80, 85, 90, and 95. In particular, the first five graphs show this pattern to one extent or another. So, it seems that a lot of the users are actually scoring their wines on a 20-point scale, and then simply multiplying them by 5 to get to the 100-point scale required by Cellar Tracker.

The final three graphs show an over-representation of the score 88, compared to scores of 87 and 89 (and the first graph also has this pattern). This seems to be a manifestation of the same bias shown by the professional commentators, in which a score of 90 occurs more commonly than 89. That is, a score of 88 is used to indicate a wine that is well-liked, while a score of 90 represents a wine that is great, thus leaving a score of 89 in limbo.

Finally, the last graph shows an under-representation of the score 96, compared to scores of 95 and 97. There seems to be no obvious reason for this.

CellarTracker scores for Caymus Cabernet Sauvignon 2012
CellarTracker scores for Viña Tondonia Reserva 2001
CellarTracker scores for Veuve Clicquot Ponsardin Champagne Brut NV
CellarTracker scores for Alvear Pedro Ximénez Montilla-Moriles Solera 1927
CellarTracker scores for Produttori del Barbaresco Barbaresco 2006
CellarTracker scores for Merry Edwards Sauvignon Blanc 2012

CellarTracker scores for Château Pontet-Canet 2003

CellarTracker scores for Château Rieussec 2001

Monday, November 28, 2016

Summarizing multi-dimensional wine data as graphs, Part 1: ordinations

When collecting data, it is quite common to record several characteristics for each of a set of "objects". For example, a wine (the object) might come from a particular region, and be based on a particular grape type, have a particular winemaker, and be of a particular quality (four characteristics). Such data are referred to as being multi-dimensional.

When dealing with multi-dimensional data, we could analyze each characteristic separately. However, this would not give us an overview of the whole dataset, but merely tell us about each of the details. If we want an overview, then we need to summarize the multiple dimensions down into something that we can illustrate as a graph.

This summarization process is part of multivariate data analysis, sometimes also called pattern analysis. There are many mathematical techniques for doing this, because there are many ways of summarizing anything, data included. In particular, the result of each summary may be unique, because there may be many possible patterns in the data that could be included in a summary, and each analysis technique may pick a different part of the data to summarize. After all, a summary must lose information, by definition, and there can be different opinions about which bits not to lose — each analysis technique can be seen as having its own "opinion".

I will be illustrating two different types of multivariate data analysis, using examples of data from the wine world. In this post I will look at ordination analyses, and in the next post I will look at network analyses.


Ordination analyses try to put the objects in some sort of rank order (hence the name), which can then be displayed as a one or two-dimensional graph. In the graph, each point represents an object, and their positions relative to each other illustrate their similarity based on the original multi-dimensional characteristics. That is, the many original dimensions are reduced to one or two dimensions, and we then get a picture of the result. Points close together in the picture are more similar than are points further apart.

The specific example shown here is taken from this research paper:
María-Pilar Sáenz-Navajas, Eva Campo, Angela Sutan, Jordi Ballester, Dominique Valentin (2013) Perception of wine quality according to extrinsic cues: the case of Burgundy wine consumers. Food Quality and Preference 27: 44-53.
As part of their work, these authors showed 23 wine bottles to each of 48 people, and asked them to subjectively assess what they thought was the likely quality of the wines (ie. based solely on looking at the bottle and its label). Their responses were categorized as Low quality, Average quality, or High quality.

In this example, there are 23 objects (the wines), and the characteristics are the three quality outcomes. For each object, we have a count of how many people placed it in each of the three quality classes (ie. we have three dimensions).

An example ordination summary

We wish to summarize the three-dimensional data down to one dimension, showing us the order of assessed quality of the wines, averaged across the 48 people. The authors chose to produce this summary with an ordination technique called Correspondence Analysis, which is certainly appropriate for their data. The resulting order of the wines is shown at the top of the first graph, with each dot representing a single wine, ordered along the dotted line from lowest quality at the left to highest quality at the right.

This is all very well, as we now have the wines in order, but obviously this isn't all that we want to know — we want to know what features of the wine labels led the participants to put the wine bottles in this particular order. This is easy to do for ordinations, and it is shown in the bottom five rows of the graph. Each row represents a different feature of the bottle labels, as indicated in the legend. The location of the colored dots within each row represents the average position along the dotted line of the wines with that feature.

For example, the second row indicates that the wines from the Pay d'Oc region are mostly down the left-hand (low quality) end of the graph, while the Burgundy and Jura wines are preferentially at the right (assessed as likely to be of high quality). Similarly, the fifth row indicates that wines bottled by a co-operative are preferentially at the low-quality end of the order, while wines bottled by the winemaker are at the high-quality end.

We can thus see at a glance which label features are associated with the decision that a wine might be of high quality, as assessed by the participants. This is what ordinations are all about — producing a picture of data once it has been arranged in some relevant order.

Ordinations seem to be rarely used in wine research, but I think that a case can be made that they should be used more often, as a very convenient way of summarizing complex data.

Monday, November 21, 2016

Gattinara 1958

The northern Italian region of Piedmont is famous for, among other things, the long-lived Barolo and Barbaresco wines made from the Nebbiolo grape. What is less well known is that in the northern part of Piedmont is a gathering of other DOC and DOCG areas, in the Vercelli hills a long way north-east of Turin. Perhaps the best of these is the Gattinara DOCG. Here, they also make Nebbiolo wines, but under the local name of Spanna. In the past, the Gattinara wines have been at least as admired as those of Barolo for their longevity; but this reputation has slipped in the modern world. A recent visit to the region is described by Quentin Sadler on his wine page.

Old bottles of Spanna can still be found, as indicated by the recent tasting of some 1964s recorded on the Barolista blog.

A bottle of 1958 Berteletti Spanna

Gattinara 1958, Spanna del Castello di Lozzolo, from Fratelli Berteletti
Purchased on eBay (in July), for €35 delivered to my door from Italy

When first opened there was only a faint aroma, but after a few hours with a cork lightly inserted in the neck (the François Audouze method for opening old bottles of wine) the aroma had increased remarkably. This wine was very much still in its prime. On pouring, the wine had a pale amber hue, fading significantly towards the edge of the glass. The aroma showed plum, honey and toast, with hints of prune, peach and plum jam, along with an earthy tone. In the mouth, there was still plenty of fresh acidity, with low tannins of course, balanced perfectly with flavors of plum, lemon, prune, almond and tobacco, all complemented by a long aftertaste.

This wine was among the best old wines I have tasted. It was every bit the equal of a more expensive 1958 Giacomo Borgogno Barolo Riserva, tasted back in August 2009. Indeed, 1958 was among the best of the old Piedmont vintages, and there are still quite a few bottles available even now.

The wine was drunk with a dinner of meatballs in tomato sauce and parmesan cheese, but it went especially well with the pre-dinner Pecorino Smeraldo, which is a Sardinian sheep's cheese.

[Tasting notes by Susanne Stenlund.]

Note: Among the 1964 Spanna wines tasted by Barolista was also a bottle from Fratelli Berteletti, which received a tasting note very similar to the one above:
A stunner from the first pour until the last sip. The nose is big, mature and velvety with notes of dried black cherries, liquorice, asphalt and dried flowers. Very deep. Some coffee ground notes after a while. Very much alive and kicking. The taste is broad and steady with notes of black olives, dried mushrooms, rosehips and warm gravel. Long and rich. This is really good. Given this blind I would have guessed it to be from the 80s. Will go on for another 20 years.
Viva Spanna!

Monday, November 14, 2016

Can non-experts distinguish anything about wine?

Roman L. Weil is a professor of accounting, with an interest in wine. During the early 2000s he conducted three similar experiments to assess the ability of non-experts (primarily educated, upper middle-class individuals who were experienced and enthusiastic wine drinkers) to distinguish various characteristics of wine. These distinctions included:
  • vintages rated by an expert as good versus poor
  • wines selected for a special "reserve" bottling versus the normal wine
  • different taste descriptors provided by an expert.
Here, I summarize the results of those experiments, as they seem not to be widely known, and yet they provide very interesting conclusions. In my usual fashion, I present pictures of the results (ie. graphs) rather than the original tabulated numbers, because it is then much easier to see the patterns in the data and thus to appreciate the conclusions.


All of the experiments were designed in the same way. Several different pairs of wines were chosen for each experiment, the pairing being determined by the particular objective of each experiment; these wine pairs constitute the experimental replication. The paired wines were presented to several hundred different tasters, spread over a number of different places and occasions; these people constitute the replicate sample units.

In each case, each taster was presented with three unlabeled glasses, one glass containing one of the wines, and two glasses with the other wine from the same pair. In this triangular experiment, the taster was asked to distinguish the singleton wine (ie. one of the glasses should taste different to the other two glasses). The taster was then asked to identify certain characteristics of the two wines. On any one occasion, tasters received 1–3 of the wine pairs.

The results were summed for each wine pair separately, listing the number of people who correctly distinguished the two wines in each pair, and then how many of those successful people correctly identified the chosen characteristics. Note that distinguishing the characteristics is not relevant unless the taster could actually distinguish the paired wines in the first place!

By random chance, the tasters should be able to distinguish the paired wines one-third of the time (ie. identifying the singleton glass out of three). So, our "expected" result is 33% if the tasters can do no better than random (ie. guessing). Then, for the two characteristics the expectation is 50%, if the tasters can do no better than random (ie. there are two characteristics to identify).

Distinguishing different vintages of the same wine

Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.

The hypotheses being tested in this experiment are that the amateurs:
  • cannot distinguish in blind tastings the wines of years rated by an expert as high from those of years rated low, and
  • if they can, they do not agree with the vintage chart's preferences.
To test these hypotheses, Weil selected six "pairs of wines with the following characteristics: the pairs have identical features (such as shipper, vineyard, and producer) except vintage, and Robert Parker rated one the vintages of those two wines Average to Appalling while he ranked the other Excellent to The Finest in The Wine Advocates Vintage Guide 1970-1999." So, the only difference between the paired wines should be that they came from vintages that Parker thought were very different from each other.

There were 593 tasters. One of the wine pairs was presented to wine professionals ("experts") on two occasions, as well as to the amateurs on the other occasions, and so these experts are treated separately in the results. The pairs of wine were tasted by 54-119 tasters each.

The results of the first hypothesis test are shown in the next graph. For each of the graphs presented below, the interpretation is as follows. Each wine-pair is represented by a horizontal line, as indicated by the legend. The central point on each of the lines represents the percentage of the tasters who succeeded at the task for that wine pair. The two end points on each line are the boundaries of the estimated 95% confidence interval (formally: the Score binomial 95% confidence interval). This interval gets smaller as the sample size (the number of tasters) gets larger, as it represents our statistical "confidence" in the results of the experiment. The dashed line represents the expected results if the tasters are performing in a random manner — the idea of the experiment is to see whether people can do better than random. So, if the dashed line passes through the 95% confidence interval for a particular wine pair, then the tasters have done no better than random for that pair, whereas if the dashed line lies outside the 95% confidence interval then the tasters have done better than random.

Results of Roman Weil's experimental test of wines from different vintages

For the first graph, only the two groups of tasters receiving the Bordeaux wine performed better than random chance. Formally: for five of the wine pairs, the experiment provides no evidence that amateur wine tasters can distinguish between good and poor vintages any better than taking a guess. For the Bordeaux wine pair, both the amateurs and experts did better than taking a guess, with the wine experts doing slightly better than the amateurs.

This outcome calls into serious question the alleged difference of quality between different vintages in the modern world. Remember, Robert Parker (or his delegate) detected big differences in the vintages within a wine pair, but the amateurs could not consistently detect this for themselves when presented with actual examples of the wines. The different result for the Bordeaux wines may reflect the common conception that vintages really do still differ in Bordeaux.

The results of the second hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 30-50% of the tasters. The sample sizes therefore refer to only 21-60 tasters per wine pair.

Results of Roman Weil's experimental test of wines from different vintages

Note that in all cases the tasters behaved in a random manner. That is, there was no consistent preference for the wine from the highly rated vintage compared to the poorer vintage, for any of the wines. We may conclude from this that expert vintage ratings are not related to wine preferences among wine drinkers. The wine from an allegedly poor vintage can taste just as good to an amateur drinker as a wine from a supposedly better vintage.

Distinguishing reserve and normal bottlings of the same wine

Roman L. Weil (2005) Analysis of reserve and regular bottlings: why pay for a difference only critics claim to notice? Chance 18(3):9-15.

The hypotheses being tested in this experiment are that the amateurs:
  • cannot distinguish in blind tastings the wines of reserve bottlings (or first wines) from the normal wines (or second wines), and
  • if they can, they do not prefer the reserve wine.
To test these hypotheses, Weil selected fourteen "pairs of wines based on the following characteristics: the pairs had identical features in all respects, except that one was a regular bottling and one was a reserve bottling. Common features included all label items (e.g. shipper, vineyard, and producer), retail source, and date of purchase." So, the only difference between the paired wines should be that the winemaker specially selected the reserve or first wine for separate bottling, at a much higher price (there was a price ratio of 1.13-3.57 for Weil's choices).

The results of the first hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. There were 855 tasters, with the pairs of wine being tasted by 38-136 tasters each. The two pairs of Champagne wines were each tasted by a small number of people only, and so I have pooled their results here (they did not differ from each other).

Results of Roman Weil's experimental test of wines from different bottlings

Note that the tasters do very much better here than in the previous experiment. That is, for six of the thirteen wine pairs the tasters did better than random when asked to distinguish the more expensive bottle of wine from the same winemaker. Mind you, they rarely did better than 50%, as opposed to 30%. Interestingly, there are three wine types that are repeated in the experiment: the cabernet blend from Bordeaux, the cabernet wine from the western USA, and the white wine from California; and in all three cases the tasters succeeded with one wine but not the other.

Nevertheless, the results do indicate that, for tasters, there is often a bigger difference between what the winemaker does with the wine (selects wine for different bottlings, to be charged at different prices) than between what nature does with the wine (produces different climatic conditions in different years).

The results of the second hypothesis test are shown in the next graph. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 30-50% of the tasters. The sample sizes therefore refer to only 13-56 tasters per wine pair.

Results of Roman Weil's experimental test of wines from different bottlings

Here, the tasters did not consistently prefer the reserve wine over the normal wine, except in two cases. We may conclude from this that winemakers are, indeed, generally selecting wines of different taste for their different bottlings, but that this is not necessarily related to wine preferences among wine drinkers. The wine from an expensive bottle can taste just as good to an amateur drinker as one from a supposedly inferior bottle of the same wine.

The two exceptions are informative. For one of the California chardonnays there was a strong preference for the more expensive wine. This suggests that the winemaker succeeded in this particular case — they charged more ($26 versus $13) for a wine that drinkers actually prefer. In the opposite manner, for one of the Bordeaux wines there was actually a preference for the cheaper wine. It may surprise you to reveal that this was a preference for the 1994 Les Forts de Latour ($56 at the time) over 1994 Château Latour ($200), the most expensive wine in the experiment. The Bordeaux first-growth chateaux might like to take note of this result (as might your wallet!). (Note: in general, the first wines of the Bordeaux first growths cost 3-4 times as much as their second wines; see the Liv-Ex blog.)

Matching wines and their descriptions

Roman L. Weil (2007) Debunking critics' wine words: can amateurs distinguish the smell of asphalt from the taste of cherries? Journal of Wine Economics 2:136-144.

The hypotheses being tested here are that the amateurs:
  • cannot distinguish in blind tastings wines that are described by an expert using different word, and
  • if they can, they cannot match the descriptions to the wines.
To test these hypotheses, Weil selected ten "pairs of wines with the following characteristics: the pairs have similar features, and the same writer / critic wrote about these two wines with disjoint word sets. That is, the reviewer used different words in describing the two wines." Note that the wines could actually come from different vintages or even continents, provided that they had similar grapes, etc.

The results of the first hypothesis test are shown in the next graph, which is interpreted in the same manner as described above. There were 321 tasters, with the pairs of wine being tasted by 13-86 tasters each, which means much smaller sample sizes than for the other experiments.

Results of Roman Weil's experimental test of wines with different descriptions

Since the objective was to choose wines that differ in description by an expert, it is hardly surprising that the tasters succeeded in distinguishing the wine pairs in six out of the ten cases. However, in only one case did they do better than 60-70%, which does call into question the experts' abilities to describe wine in any quantitative way. After all, there are many examples in wine lore of different experts also describing exactly the same wine in completely disjunct words.

The results of the second hypothesis test are shown in the next graph. Remember that the data here refer only to those tasters who successfully distinguished the wine pairs, which is 40-60% of the tasters. The sample sizes therefore refer to only 5-45 tasters per wine pair.

Results of Roman Weil's experimental test of wines with different descriptions

Sadly, in only one case could the tasters consistently match the wines to the expert descriptions. So, we may conclude that reading a description of a wine does not necessarily tell you what it will taste like to you.


Combined, these three experiments do not paint a happy picture of the wine business. Amateur wine tasters cannot consistently distinguish wines from different vintages or different bottlings, or with different descriptions. And when they can do so, their preferences do not necessarily agree with the professionals' assessments of quality —  they are about as likely to prefer the one as the other. So, what is it that these professionals are doing? Whatever it is, it seems to be somewhat divorced from their customer base. In any case, there seems to be little reason to pay more for a "special" wine (a better year or a better selection), unless you have already checked it out and decided that you prefer it.

Quality versus preference

One potentially confusing aspect of Weil's experiments is that in two of his three experiments his second hypothesis is not actually related to the first one. In the first experiment his second question concerns which wine the tasters prefer, not which one they think is from the higher-rated vintage; and similarly for the second experiment, they are asked which wine they prefer rather than which one is the reserve wine. Only in the third experiment is the second question directly related to objective — which wine matches which description.

It is important to recognize the distinction between "prefer / like" and "high quality" (otherwise, one of the two expressions would be redundant!). These are often treated as though they both mean "better", as in the expression "if you like it then it is good". However, these are two very different ideas — supposedly better quality does not mean that you should prefer it in any personal sense. Personal preference is all in your head, but differences in quality also exist outside of it.

For example, one does not need to like opera in order to recognize a poor opera singer, nor does one have to be a practicing christian to appreciate the architectural and artistic merits of a church. So, recognition of quality is not necessarily related to personal choice. For example, I can accept that there are high-quality characteristics of Champagne, but I do not actually like the taste of those distinctive characteristics — I actually prefer the crémant wines from Alsace, Die or the Loire, or the sparkling wines of southern Australia. Financially, of course, this is to my benefit!

This point is important for a wine drinker. The ability to recognize which wine the professionals think has higher quality is a separate issue from whether you actually like that wine. Do I like the wines recommended by Robert Parker? Perhaps so, or perhaps not, but either way I can probably recognize them, because they have a similar set of characteristics. He sees those characteristics as denoting high quality, but I may well see them as something I don't particularly care for.

Weil is probably right to focus on "prefer / like", since that is of most practical relevance to a consumer; but we should not confuse this with "quality". It would be of interest to experimentally examine the latter, also.

Sunday, November 6, 2016

Modern wine vintage charts: pro or con?

Vintage charts, which provide a quality score for each wine vintage in some specified wine-making region, have been a conspicuous part of the wine landscape for many decades. However, there has also been an increasing amount of criticism in recent years.

For your reading pleasure, at the bottom of this post I have included links to a selection of online commentaries (mostly negative) about this issue. The principal objections seem to be one or more of these:
  • They are broad generalizations — they do not account for within-region variation in quality
  • The ratings over simplify — there is also between-vineyard variation within local areas
  • There is no recognition of site selection - there is even within-vineyard variation
  • Modern wine-making (along with global warming) produces reasonably consistent quality, so that vintage variation mainly concerns quantity, instead
  • Do the charts rate wine longevity or drinkability?
  • Vintage variation influences style, but not necessarily quality
  • Charts produced by different people inevitably differ, often strongly disagreeing
  • Do wine drinkers actually prefer highly rated vintages?
Many of these points are easy to quantify, and most of them make vintage charts redundant in the modern world. Here, I present specific examples to illustrate some of these points.

Different people, different charts

Most of the well-known wine magazines produce vintage charts, which are available online. The first graph below compares two of these charts for the vintages from 2000-2011. The dots represent the vintage scores from the Wine Advocate (vertically) and the Wine Enthusiast (horizontally) pooled for the following Italian regions: Barolo, Barbaresco, Brunello di Montalcino, and Chianti. If the two magazines gave each vintage the same score, then the dots would all be along the pink line.

Wine Advocate versus Wine Enthusiast vintage scores for Italy

As you can see, there is a great deal of disagreement between these two charts, as only four of the dots are actually on the line, and another five differ by 1 point. But more importantly, the eight Wine Enthusiast scores between 80 and 88 form two clusters of quality scores as far as the Wine Advocate is concerned, with four of the vintages scoring much lower (74-77) than the other four (89-93).

As an alternative example, Jancis Robinson has organized some blind tastings of the red Bordeaux vintages from this century (C21 Bordeaux vintages - a ranking). During the tastings in 2015 and 2016, the attending wine professionals were "asked to rank the last 13 vintages in qualitative order." We thus have a total of 18 (2016) and 15 (2015) rankings for the same 12 vintages (2000-2011). These are compared in the next graph, where each dot represents a single vintage, located according to the sum of ranks from 2015 (horizontally) and 2016 (vertically). Note that a smaller rank indicates a "better" vintage.

Jancis Robinson vintage assessment from 2015 and 2016 for Bordeaux

There is obviously a lot of agreement here. However, there are four vintages in the middle of the graph that all had very similar ranks in 2015 but had two very different ranks in 2016, so that two of the dots are a long way below the line. That is, the 2006 and 2008 vintages were evaluated similarly in the two tastings, but the 2003 and 2004 vintages dropped significantly in the ranking between 2015 and 2016.

Andrew Jefford has a comment on these rankings at Decanter (Kicking the hell out of Bordeaux 2011).

Within-region variation

Wine vintage charts must apply to specified wine-making regions, with a score for each vintage in each region. Unfortunately, these regions are often unconscionably large, so that a single number cannot possibly describe the wine quality across the whole region. While countries like France, Spain and Italy usually get divided into several wine-making regions, even somewhere as large as California sometimes gets treated as a single region.

However, to me, the classic example of silliness is trying to treat an entire continent like Australia as a single region, or even "south-eastern Australia". The following maps compare the size of Australia to both Europe (minus Scandinavia and the Baltic states) and the USA. As you can see, even south-eastern Australia is as large as Spain + Portugal, or California + Oregon + Washington. Moreover, the variation in wine-growing climates throughout south-eastern Australia is at least as large as any of these other conglomerations.

Within-location variation

Traditional wine-making regions sometimes get subdivided, on the grounds that the within-region climate variation produces different wines. Thus, Bordeaux red wine is sometimes divided into the Right Bank (Saint Emilion and Pomerol) and the left Bank (the Médoc).

The next graph compares the vintage rankings for these two Banks, from the Wine Cellar Insider, for the vintages from 1982-2014. Each dot represents a single vintage, located according to the quality score for the Left Bank (horizontally) and the Right Bank (vertically). Once again, smaller ranks indicate "better" vintages; and if the vintages had the same rank in both Banks then the dots would lie along the pink line. Not all vintages made it into the rankings (ie. some were not considered good enough to be worth ranking).

Wine Cellar Insider vintage ranks for Left and Right Bank Bordeaux

While there is some consistency in the rankings there are many anomalies, where the two Banks had very different qualities in the same vintage. In particular, there are six vintages (shown as red dots) where a vintage made it into the ranking for one Bank but not the other.

The Global Wine Score blog has a similar analysis for these two Banks (Bordeaux 2015 vintage: Right Or Left Bank?).

Modern consistency of vintage quality

I have published several recent blog posts that illustrate the changing nature of vintage scores over the past 25 years (Two centuries of Bordeaux vintages — Tastet & Lawton; A century of Barolo vintages — Fontanafredda; More than a century of Barolo vintages — Marchesi di Barolo). The bottom line is that the scores have increased during that time, as well as becoming less variable from year to year.

A similar point is made for Australian vintages in this paper:
     V.O. Sadras, C.J. Soar, P.R. Petrie (2007) Quantification of time trends in vintage scores and their variability for major wine regions of Australia. Australian Journal of Grape and Wine Research 13:117-123.

Preference for high versus low vintage ratings

This issue of whether wine drinkers actually prefer the vintages recommended by the wine charts is addressed in another published article:
     Roman L. Weil (2001) Parker v. Prial: the death of the vintage chart. Chance 14(4):27-31.
A free copy is available here.

The paper discusses an empirical test of the claim (specifically by Frank Prial; see the link below) that the modern vintage chart is redundant. The author got many people to do tastings of paired wines, one from a good vintage as decreed by the Wine Advocate chart and one a poor vintage; and his conclusion is:
The 240 wine drinkers on whom I’ve systematically tested Prial’s hypothesis cannot distinguish between wines of good and bad vintages, except for Bordeaux, and even when they can distinguish, their preferences and the chart’s do not match better than a random process would imply.
In other words, a high vintage score in a chart is no guarantee that you will actually like the wines.

Selected commentaries

Frank J. Prial, The New York Times
So who needs vintage charts?

Paul Gregutt, The Seattle Times
Rating vintage ratings; not high

Paul Kaan, Filthy Good Vino blog
Using a vintage chart to pick wines sucks … here’s a better way!

W. Blake Gray, The Gray Report blog
Vintage charts for California are worthless

Dan Berger, Vintage Experiences newsletter
Vintage chart fallacies

Richard Hemming, Jancis Robinson blog
Vintage nonsense

Decanter staff
Are official vintage charts meaningless?

Monday, October 31, 2016

Precision and accuracy of numbers — getting it right

Every day we are bombarded with numbers, usually from the media. Unfortunately, the people presenting these numbers often do not understand the relationship between the precision and the accuracy of their numbers, and so they are prone to mislead both themselves and their readers.

In my experience as a scientist, this is true even in the professional literature; and my recent experience in the professional literature of the wine world suggests that it is no different there, either. So, it is worth explaining this situation, to see if I can't encourage people to get it right.

The accuracy of a number refers to how close it is to the truth. If I claim that something costs $10 when it actually costs $15 dollars, then I am not being very accurate.

The precision of a number refers to how many digits I am using, or how many decimal places I present. If I claim that something costs $10.11 rather than $10, then I am being more precise (I've used four digits rather than two).

This distinction is often illustrated using the idea of shooting at a target, as shown above. Precision refers to how close together are repeated shots, while accuracy refers to how close the shots are to the center.

A problem occurs when the precision of any number is greater than its accuracy, because that will be misleading. For example, if I claim that something costs $10.11 when it actually costs $15, then the precision of my number (to the nearest cent) gives a spurious sense of accuracy (I am not even accurate to the dearest dollar). This is bad; and it can be easily avoided.

I can illustrate this using the following example from the recent wine literature. In this case, the data summarize some of the characteristics of 48 people who were sampled. When presented as percentages, the numbers cannot be more accurate than to the nearest 2% — after all, the only numbers possible are 0 people out of 48 = 0%, 1 / 48 = 2%, 2 / 48 = 4% .... 47 / 48 = 98%, 48 / 48 = 100%.

However, the numbers as presented in the paper were to the nearest 0.1%, which is 1 out of 1000 not 1 out of 48, as shown in the first table. The 60.4% actually refers to 29 out of 48 people, not to 604 out of 1000. This is misleading.

In this case, presenting the numbers to the nearest 1% (ie. dropping the decimal places) would be better, because the precision would more nearly represent the accuracy.

As an alternative example, the next table shows two different sample sizes, 136 and 50. A sample size of 136 may well justify an accuracy of one decimal place but not 2 such places; and a sample size of 50 probably does not justify even one decimal place. Just because we can calculate a number to many decimal places (lots of precision) does not mean that the accuracy justifies this.

These situations are easy to avoid — precision should simply never exceed the accuracy.

Note: I have not identified the authors of either of the examples illustrated here. I agree with Bjørn Andersen (in his book Methodological Errors in Medical Research) that we should not "pillory a few for errors which many commit with impunity".