Some time ago I published a post about getting the question right when analyzing data. I pointed out that the question usually leads to the choice of an appropriate mathematical model, which we then use to answer that question. We fit the model to the data (or the other way around), and reach some conclusions from what the model tells us. So, asking the

**question will usually tell us something useful.**

*right*However, we need to think about the purpose of the model. Are we actually trying to create some model that helps us understand our data, or are we just trying to draw some line through a graph of the data? This is the difference between explaining our data and summarizing it, respectively (see An introduction to data modeling, and why we do it). Here, I will draw some different lines through a couple of wine-related data sets, representing different models, to show you what I mean.

**Bordeaux wine production**

Consider the following graph, which is taken from the

*American Journal of Enology and Viticulture*51: 249-261, 2000. It shows the total production of the Bordeaux wine region (vertically) over 60 years (horizontally). Each point represents one vintage.

The authors have added a polynomial line to their empirical data, to illustrate the trend in wine production. The line fits the data quite well, with 89% of the variation in the data being fitted by the model.

This model may well be adequate for the authors' purpose (in writing the paper). However, it cannot be a realistic model for the data. For example, the model suggests that production decreased during the first third of the 20th century — indeed, it implies that wine production in 1905 was the same as in 1995, which is not what happened (see the actual Bordeaux history as discussed by the Wine Cellar Insider).

So, this is simply a "line of best fit" to the data, used as a convenience. It cannot be used to discuss wine production outside the range of years shown on the graph. That is, the line does not model wine production, but merely summarizes the available data.

If we wanted to model the actual wine production, then we would need a model (and line) that shows a small increase in wine production from the mid-1700s until the mid-1900s (since that is what actually happened).

As an example, consider the following graph of the same data, to which I have added two straight lines. One line fits the data until the mid-1960s and the other fits the data from then onwards. This is called a piecewise model (ie. it consists of a series of straight lines).

The two lines of this piecewise model happen to intersect in 1968, which turns out to be the last year in which Bordeaux had a poor vintage. This intersection may thus not be coincidence. Indeed, Pablo Almaraz (2015. Bordeaux wine quality and climate fluctuations during the last century: changing temperatures and changing industry.

*Climate Research*64: 187-199) suggests that production management in Bordeaux changed during the 1960s, under which circumstances this new model would have some realistic basis.

However, this piecewise also cannot be correct, because it suggests that there would be a continuing increase in production during the 2000s, and we know that this did not subsequently happen. A sigmoid model would be needed, instead.

To illustrate what I mean by this type of model, let's look at the wine production of a single château.

**Single château wine yield**

This next graph plots some data for an unnamed wine producer, although only Château Latour fits the description given in the

*American Economic Review, Papers and Proceedings*101: 142-146, 2011. This time, wine production is shown for the 1840s until the early 2000s.

We can see that wine production increased during the 160 years of data, and we could, if we so inclined, fit a straight line as a "best fit". However, this line would fit only 50% of the variation in the data.

A more realistic model would be one that suggests little change in production until the 1950s, and little change in production from the 1990s onwards. Such a model is shown as the thick line in this final graph. Such models are known as sigmoid (the lines are shaped like the letter S) — technically, this one is a logistic model.

The model indicates that the long-term average production from 1850-1950 was c. 17 hL/ha. Production then rapidly increased to 45 hL/ha by 1990 (ie. a 3-fold increase). The mid-point of the increase was between the 1967 and 1968 vintages. This model thus fits the conclusions from the piecewise model quite nicely.

However, this model is probably not entirely correct, because it implies that Bordeaux wine production was unchanged in prior centuries, when it probably increased somewhat, from the 1700s.

**Discussion**

There is a difference between fitting a line to the data (curve fitting) and trying to model the biology represented by the data. Both types of analysis fit an equation to a set of data, which is then visualized as fitting a line to a set of points on a graph. However, curve fitting focuses on finding the best-fitting equation, while modeling focuses on finding a model with a realistic biological interpretation.

Fitting a line is a mathematical procedure of convenience — it summarizes the data. However, the resulting equation may not have much direct biological relevance — the parameters of the model equation need to have a reasonable biological interpretation. In particular, a model should have relevance outside the range of the observed data — if the equation predicts values that are known to be incorrect, then it cannot be a good model for the biology, and nor can it be good if it predicts outrageous unknown values. A fitted curve is relevant only within the range of the data.

It is thus important to understand the purpose the author(s) had fitting the line, because this determines how we interpret the meaning.

Sorry, but I will NOT be the first comment here.

ReplyDelete(Oh, snap . . . I just was!)