Monday, July 2, 2018

An introduction to data modeling, and why we do it

Among all of the current hype about quantitative data analysis (eg. The arms race for quants comes to the world’s biggest asset managers), especially with regard to what are called Big Data, I have noted a few negative comments about the idea of modeling data. Not unexpectedly, non-experts are often wary of things beyond their own expertise. (I know I am!) So, I thought that I might write a post outlining just what people are trying to do when they do the sorts of things that I normally do in this blog.

Background

If the world is a non-random place, then it is likely that we can find at least a few patterns in it that we can describe and explain in a simple way. We refer to this process of description and explanation as modeling. In this process, we try to find simple models that can be used for both describing and explaining the world around us; and, if we get it right, then these models can be used for forecasting (prediction), as well.

This is not data modeling

The main issue is that life cannot be entirely predictable — there are predictable components and unpredictable components. For example, we all know that we have a limited life-span, although we do not know where or when we will depart. Nevertheless, there is a measured average life-span (which is the predictable component), along with variation around that average (the unpredictable component). We are thus all thinking that we might live for 80-85 years, and we can plan a future for ourselves on that basis.

Think of it this way: the predictable component gives us optimism, because we can make plans, while the unpredictable component makes our plans go astray.* Models are our formal, mathematical way of trying to identify the two components. The idea is to find out whether the predictable part dominates, or not. If it does, then forecasting is a viable activity for us all.

Another way of thinking about this is the classic question as to whether the glass of water is half full or half empty. The full part is the predictable component, and the empty part is the unpredictable component. Of course, the glass is both half full and half empty; and we should actually be interested in both components — why is it half full, and why is it half empty? Each will tell us something that might be of interest.

Modeling

So, models try to formalize this basic idea mathematically. If we have some quantitative data, then we can try to find an equation that tells us about the predictable component of the data, and about how much the real data deviate from the model in some (possibly unpredictable) way. For example, we anticipate that each person's life-span is not random, and we can thus model it by assuming that it deviates in some unpredictable way from the (predictable) average lifespan. Similarly, tomorrow's weather is not random, but instead it deviates from today's weather in more or less unpredictable ways.

To get a picture of what is happening, we often draw a graph. For example, our data might be shown as a series of points, and we can then fit a line to these data. This line represents the model, and the closeness of the points to the line tells us how well our model fits the data. The line is the predictable component, and the deviation of the points from the line represents the unpredictable component.

A couple of examples

Here are two wine-related examples, of a type of modeling that I have used in previous blog posts. In both cases, I will be fitting an Exponential Model to a set of data, as this seems to be the simplest model that fits these data sets well (see the discussion at the end of the post).

The first set of data come from the EuroStat database. It lists the average size of a vineyard holding for the 18 European Union countries with the most vineyard area. Each point in the first graph represents a single country, with the countries ranked in decreasing order horizontally, and the average vineyard size shown vertically.

Average vineyard holdings in the European Union

The line represents our model (the Exponential). Note that the vertical axis is on a logarithmic scale, which means that our model will form a straight line on the graph. Also, the model fits 97% of the data, which means that our model fits very well (see the discussion later in the post).

Using our glass metaphor, the graph shows us that the glass is almost full for all of the countries — the predictable component of the data is by far the largest (ie. the points are close to the line). However, for France our glass is not at all full, and there is a large unpredictable component (the point is not particularly near the line). Both of these conclusions should interest us, when studying the data. We should be happy that such a simple model allows us to describe, explain and forecast data about vineyard sizes across Europe; and we should wonder about the explanation for the obviously different situation in France.

The second example is very similar to the first one. This time, the data set comes from the AAWE. It lists the average 2015 dollar value of wine exports for 23 countries. As above, each point in the graph represents a single country, with the countries ranked in decreasing order horizontally, and the export value shown vertically.

Wine export values per country

Everything said for the first example applies here, as well, except that this time the country with the greatest deviation from the model is the lowest-ranked one. We might ask ourselves: Is it important that Romanian wine exports do not fit? We do not know; but the model makes it clear that we might find something interesting if we look into it. This is the point of modeling — it tells us which bits of the data fit and which bits don't; and either of these things could turn out to be interesting.

Models

There is an old adage that models should be relatively simple, because otherwise we lose generality. Indeed, this idea goes back to Aristotle, although William of Ockham is usually given the most credit (ie. Occam's razor). So, simpler is better. The alternative is called "over-fitting" the data, which is bad.

We could try to model things exactly — for example, we could think in detail about the things that cause weather to vary from today's or people's lives to deviate from the average. However, it should be obvious that this would be unproductive, because there are simply too many possibilities. So, we try to use models that are as simple and general as possible.

The main practical issue is that there are lots of mathematical models, which differ from each other in oodles of way, and many of them might fit our data equally well. This sometimes leads to unnecessary arguments among the experts.

However, we do have various ways of helping us measure how well our data fit any given model. In my case, I usually provide the percentage fit, as shown in the above two examples. This is sometimes described as the amount of the data that is "explained" by the model, although it is better to think of it as the amount of the data that is "described" by the model. Either way, it represents the percentage of the data that the model claims is predictable, with the remainder being unpredictable by the model.

We would, of course, do well to pick a model that provides as high a fit as possible. However, the best-fitting model might actually be over-fitting the data. To guard against this, we should also have some reasonable explanation for why we think that our chosen model is suitable for the data at hand.


Simply trying an arbitrary range of models, and then choosing the best-fitting one, is almost guaranteed to over-fit the data. At the other extreme, simply fitting a straight line to your graph can also be a very poor procedure — and I will discuss this in a later post.



* Robert Balzer: "Life is what happens to you ... when you are planning other things."
[Quote provided by Bob Henry.]

1 comment:

  1. ". . . we all know that we have a limited life-span . . ."

    A topic recently discussed in the pages of The Wall Street Journal.

    Since the two-part article ("Yes" and "No") may reside behind a "paywall," I will provide some introductory text, which through a key word search might redirect readers to alternate websites (such as affiliated newspapers) where this text might simultaneously be found.

    From The Wall Street Journal "Health Care" Special Report
    (June 25, 2018, Page R1ff):

    "Is There a Limit To the Human Lifespan?;
    Some researchers say there is a natural limit and we’ve reached it.
    Others say it might be possible to extend longevity by focusing on bioresilience."

    URL: https://www.wsj.com/articles/is-there-a-limit-to-the-human-lifespan-1529892420

    [Two-part article. Responses "Yes" and "No."]

    Article preface:

    "Average life expectancy around the world has climbed steadily over the past 100 years. But longevity seems to have topped out at about 120 years.

    "Whether there is an absolute limit to how long humans can live is a hotly disputed topic. Here are the facts: The all-time verified age record was set by Jeanne Calment of France, who died in 1997 at age 122. No one has come closer than age 119 since. Yet the ranks of people older than 110 continues to grow.

    "To some researchers, this suggests there’s a natural limit to how long humans can live—and we’ve pretty much reached it. Yes, average life expectancy has increased, thanks to things like clean water, improved living conditions and modern medicine. But these improvements can only do so much, and eventually the body wears out.

    "Others say, in effect, that past performance doesn’t guarantee future results. New and emerging medical technologies, they say, might be able to slow aging to such an extent that not only will we live much longer, but we’ll stay biologically 'younger' well into what used to be old age.

    "Brandon Milholland ['Yes'], co-author of several research papers on aging and longevity and a research associate at pharmaceutical-consulting firm Michael Allen Co., says that 125 is probably the upper limit for the human lifespan. Joon Yun ['No'], the president of Palo Alto Investors and the $2 million founding sponsor of the National Academy of Medicine’s Grand Challenge for Healthy Longevity, says it may be possible to extend the human lifespan by increasing the body’s ability to respond to stress."

    ReplyDelete