Monday, July 9, 2018

The ups and downs of wine-blog posting

A couple of weeks ago I wrote a post about How long can wine bloggers keep it up?. At the time, I mentioned that I recorded the number of posts per month for all of the Australian wine-related blogs that I could locate. This allows me to look at changes in the rate of blog posting throughout the life of each blog. In this new post, I will show you some of the more obvious patterns. I will use individual blogs as my examples, but I will group them into sets with similar patterns of posts — what types of wine blogs are there?


The individual qualities of wine blogs have long interested people. For example, back in 2013, Lettie Teague searched for Five wine blogs I really click with. She searched "not just a handful of blogs here and there but hundreds and hundreds of wine blogs from all over the world." However, the fate of blogs is almost always the same — only one of her chosen blogs has posted since the middle of 2017. Unusually, one of the bloggers did actually put up a "good-bye" post (Brooklynguy's Wine and Food Blog).

In my previous post on the subject, I illustrated the coming and going of the Australian wine blogs from the beginning of 2006 until May 2018 (150 months). In all of the graphs shown here, "Time 0" is the time of the first blog post for each blog, so that the graphs illustrate what happened to the blogs through their lifetime. I have excluded the three most prolific blogs, which all started long before 2006 (these would fit into the last two graphs below).

The first graph simply shows the number of blogs (in pink), illustrating that the number of blogs decreases through time (ie. many blogs last a short time and only a few make it for a long lifetime). For the cognoscenti, this is called a Type I survivorship curve (note the logarithmic vertical axis).

Number of Australian wine blogs and their posts

The blue line shows the average number of monthly posts for those blogs still surviving at any one time. The average remains steady at 4-5 posts per month for c. 5 years, by which time the number of blogs has halved. Thereafter, the average becomes much more variable, depending on which blogs are still going. The longest-lived blogs keep up a high average monthly number of posts (eg. >10 years = >10 monthly posts) — if the blogger is still going after 6 years, then they really have something to say!

We can now look at the individual blogs, looking not at how long they last but at what happened along the way. The blogs are arranged in groups, although there is nothing definitive about the following groupings. They are merely examples of patterns that appear in the data. Not all of the blogs are actually shown here.

The next graph shows a few blogs that burst out the blocks with a flurry of activity but then slowed down over the first year, followed by a slower stream of activity.

Australian wine blog postings

The next group of blogs did the opposite, starting relatively slowly but followed by a burst of activity later on. This burst could take up to 2 years to kick in. In all cases the burst was not sustained by the blogger.

Australian wine blog postings

For the next group, each blog shows a series of episodes of bigger activity, rather than a single burst. These bursts usually represent different topics of interest to the writer; for example, reporting on travels to wine regions. It is easy to see these blogs as extensions of those in the previous graph — some bloggers get a second or third wind, but some do not.

Australian wine blog postings

We now move on to a group of blogs that all have regularly had a relatively high number of posts (eg. >3 per week). Some of these bloggers decreased their activity after an initial burst, but they still maintained their prolific rate of posting. For example, at one point Full Pour simply halved the number of posts from one month to the next, but then continued at the new rate.


Australian wine blog postings


The Intrepid Wino was the most erratically posting blogger I encountered — on some occasions wine-tasting notes were uploaded in bulk, with a maximum of 171 posts in one month (off the top of the graph) — the nearest competitor was Wine Will Eat Itself, with a maximum of 98 (see below).

We now move on to those blogs that have consisted mostly of wine-tasting notes. Obviously, these notes are relatively short, and so there can be a lot of posts in any given month — here, we are talking of up to 1 per day, or even more. However, you will note that the bloggers illustrated in this next graph all decreased their activity after an initial burst.

Australian wine blog postings

The final graph shows those blogs consisting mostly of wine-tasting notes but where the number of posts increased dramatically at a particular time. You can all guess what that time was — the blogger started receiving large numbers of wine samples, for free, rather than basing their comments on their own drinking habits or on group tastings. The most blatant example is Wine Will Eat Itself — sadly, here the prolific activity was stopped by the death of the blogger.

Australian wine blog postings

This sort of activity by wine writers has long been questioned. For example, David Shaw wrote a pair of articles for the Los Angeles Times way back in August 1987 (Wine writers: squeezing the grape for news, and Wine critics: influence of writers can be heady), revealing what was then presumably unknown to much of the reading public — many if not most newspaper and magazine wine writers were paid very little money, and relied on wine producers and marketers in a way that could easily be seen as a conflict of interest.

The main issue, of course, is that the writers usually prefer to write favorable reviews, and therefore simply ignore all wines that they view unfavorably. This means that some of the Australian wine blogs simply catalog (mostly) Australia's wines, one bottle at a time, but actually ignoring most of them. This may not be of much help to the reader, who is not being warned about what to avoid.

This also produces an uncritical view of the world. We all know what a 5-star review says before we read it (as we also do for a 1-star review), so why read it? These reviews provide an unrelenting tone, which ultimately becomes tedious. The real interest lies in the 2- and 3-star reviews, because something went wrong, and we need to assess whether it would also be a deal-breaker for us. Wine bloggers, please take note.

Monday, July 2, 2018

An introduction to data modeling, and why we do it

Among all of the current hype about quantitative data analysis (eg. The arms race for quants comes to the world’s biggest asset managers), especially with regard to what are called Big Data, I have noted a few negative comments about the idea of modeling data. Not unexpectedly, non-experts are often wary of things beyond their own expertise. (I know I am!) So, I thought that I might write a post outlining just what people are trying to do when they do the sorts of things that I normally do in this blog.

Background

If the world is a non-random place, then it is likely that we can find at least a few patterns in it that we can describe and explain in a simple way. We refer to this process of description and explanation as modeling. In this process, we try to find simple models that can be used for both describing and explaining the world around us; and, if we get it right, then these models can be used for forecasting (prediction), as well.

This is not data modeling

The main issue is that life cannot be entirely predictable — there are predictable components and unpredictable components. For example, we all know that we have a limited life-span, although we do not know where or when we will depart. Nevertheless, there is a measured average life-span (which is the predictable component), along with variation around that average (the unpredictable component). We are thus all thinking that we might live for 80-85 years, and we can plan a future for ourselves on that basis.

Think of it this way: the predictable component gives us optimism, because we can make plans, while the unpredictable component makes our plans go astray.* Models are our formal, mathematical way of trying to identify the two components. The idea is to find out whether the predictable part dominates, or not. If it does, then forecasting is a viable activity for us all.

Another way of thinking about this is the classic question as to whether the glass of water is half full or half empty. The full part is the predictable component, and the empty part is the unpredictable component. Of course, the glass is both half full and half empty; and we should actually be interested in both components — why is it half full, and why is it half empty? Each will tell us something that might be of interest.

Modeling

So, models try to formalize this basic idea mathematically. If we have some quantitative data, then we can try to find an equation that tells us about the predictable component of the data, and about how much the real data deviate from the model in some (possibly unpredictable) way. For example, we anticipate that each person's life-span is not random, and we can thus model it by assuming that it deviates in some unpredictable way from the (predictable) average lifespan. Similarly, tomorrow's weather is not random, but instead it deviates from today's weather in more or less unpredictable ways.

To get a picture of what is happening, we often draw a graph. For example, our data might be shown as a series of points, and we can then fit a line to these data. This line represents the model, and the closeness of the points to the line tells us how well our model fits the data. The line is the predictable component, and the deviation of the points from the line represents the unpredictable component.

A couple of examples

Here are two wine-related examples, of a type of modeling that I have used in previous blog posts. In both cases, I will be fitting an Exponential Model to a set of data, as this seems to be the simplest model that fits these data sets well (see the discussion at the end of the post).

The first set of data come from the EuroStat database. It lists the average size of a vineyard holding for the 18 European Union countries with the most vineyard area. Each point in the first graph represents a single country, with the countries ranked in decreasing order horizontally, and the average vineyard size shown vertically.

Average vineyard holdings in the European Union

The line represents our model (the Exponential). Note that the vertical axis is on a logarithmic scale, which means that our model will form a straight line on the graph. Also, the model fits 97% of the data, which means that our model fits very well (see the discussion later in the post).

Using our glass metaphor, the graph shows us that the glass is almost full for all of the countries — the predictable component of the data is by far the largest (ie. the points are close to the line). However, for France our glass is not at all full, and there is a large unpredictable component (the point is not particularly near the line). Both of these conclusions should interest us, when studying the data. We should be happy that such a simple model allows us to describe, explain and forecast data about vineyard sizes across Europe; and we should wonder about the explanation for the obviously different situation in France.

The second example is very similar to the first one. This time, the data set comes from the AAWE. It lists the average 2015 dollar value of wine exports for 23 countries. As above, each point in the graph represents a single country, with the countries ranked in decreasing order horizontally, and the export value shown vertically.

Wine export values per country

Everything said for the first example applies here, as well, except that this time the country with the greatest deviation from the model is the lowest-ranked one. We might ask ourselves: Is it important that Romanian wine exports do not fit? We do not know; but the model makes it clear that we might find something interesting if we look into it. This is the point of modeling — it tells us which bits of the data fit and which bits don't; and either of these things could turn out to be interesting.

Models

There is an old adage that models should be relatively simple, because otherwise we lose generality. Indeed, this idea goes back to Aristotle, although William of Ockham is usually given the most credit (ie. Occam's razor). So, simpler is better. The alternative is called "over-fitting" the data, which is bad.

We could try to model things exactly — for example, we could think in detail about the things that cause weather to vary from today's or people's lives to deviate from the average. However, it should be obvious that this would be unproductive, because there are simply too many possibilities. So, we try to use models that are as simple and general as possible.

The main practical issue is that there are lots of mathematical models, which differ from each other in oodles of way, and many of them might fit our data equally well. This sometimes leads to unnecessary arguments among the experts.

However, we do have various ways of helping us measure how well our data fit any given model. In my case, I usually provide the percentage fit, as shown in the above two examples. This is sometimes described as the amount of the data that is "explained" by the model, although it is better to think of it as the amount of the data that is "described" by the model. Either way, it represents the percentage of the data that the model claims is predictable, with the remainder being unpredictable by the model.

We would, of course, do well to pick a model that provides as high a fit as possible. However, the best-fitting model might actually be over-fitting the data. To guard against this, we should also have some reasonable explanation for why we think that our chosen model is suitable for the data at hand.


Simply trying an arbitrary range of models, and then choosing the best-fitting one, is almost guaranteed to over-fit the data. At the other extreme, simply fitting a straight line to your graph can also be a very poor procedure — and I will discuss this in a later post.



* Robert Balzer: "Life is what happens to you ... when you are planning other things."
[Quote provided by Bob Henry.]