It would be nice for the consumer if wine prices varied in some predictable way. From the viewpoint of a data analyst, this implies that there is some specifiable model of price variation that can be used to describe the variation in observed prices.
From the viewpoint of the consumer, having such a model is valuable, because it can be used to make a rational decision about whether a particular wine is a bargain or a rip-off (as explained in Choosing value-for-money wines). That is, models suggest that prices have a predictable component (an "average value") and a random component, and too much deviation from the predictable component indicates either a good deal or a bad one.
I have suggested in previous blog posts that one model that seems to fit wine prices reasonably well is what is known as the Lognormal Model (eg. see The relationship of wine quality to price). This model indicates that prices are expected to increase exponentially in response to winemaker effort. That is, rather than one unit of effort leading to the addition of one unit of price, the prices are multiplied, instead.
One consequence of this model is that prices do vary around some average value but the variation is not symmetrical — prices vary much more above the average than below it. I am sure that you have all noticed this in real life; and not just for wines, but for most consumer products. There are plenty of very expensive wines but not so many very cheap ones.
The most expensive wines are often featured in the media as being "luxury" products, beyond the purchase ability of mere mortals. On the face of it, it seems unlikely that the pricing of these wines is in any way related to the pricing of the wines bough by the rest of us. Recently, Thach, Olsen, Cogan-Marie & Charters suggested dividing these wines into the following price categories per bottle (see What Price is Luxury Wine?): Affordable Luxury (US$ 50-100), Luxury wines (US$ 100-500), Icon wines (US$ 500-1,000) and Dream wines (US$ >1,000). Below these, we might also recognize Everyday wines (US$ <10), Better wines (US$ 10-20) and Premium wines (US$ 20-50).
Maybe there are different pricing models for each of these wine groups? I thought that it might be worthwhile to try modeling some real data, to see how all of these ideas fit together.
The data come from the online database of the national liquor chain in Sweden, known as Systembolaget. Being government owned, the complete product information is freely available, as both an XLS file and an XML file. I have used the prices of the bottled red wines that were available in May 2016, when there were 5,487 such wines listed in the database. [Note that bag-in-box wines are not included.]
This first graph shows the frequency distribution of how many wines (vertically) fit into each of the bottle prices (horizontally). [Note the logarithmic scale for the prices.] Obviously, the prices are given in Swedish crowns (SEK). To convert to other common currencies you can divide the SEK by approximately 10 — if you want more precision, for USD divide by 9, for EUR divide by 9.5, and for GBP divide by 11.
The graph is rather spiky, because of the worldwide practice of setting prices at particular "desirable" values (99, 149, 199, etc), but there is otherwise a clear general pattern to the data. The minimum price is 40 SEK (US$ 4.50) and the maximum is 22,500 SEK (US$ 2,500). The most common price is 100 SEK (US$ 11), with the median at 190 SEK (US$ 20) — that is, half the wines cost US$ 20 or less.
Now, we can compare these data with the pricing categories outlined above. This next graph superimposes them onto the frequency distribution.
We can then see how many wines fit into each category:
Modeling the price data
The frequency distribution does, indeed, look very much like what would be expected for a lognormal model. So, we can start or analysis by trying to fit the data to such a model.
This next graph shows the probability distribution of the best-fitting lognormal model in pink. [Technical note: the model was fitted using maximum likelihood, with the Regress+ program; I subtracted 35 SEK from each price before fitting the model, and then added it back for the probability distribution.]
In this sort of analysis, we interpret the pink line as representing the predictable component of the variation in wine prices, and the differences between the blue line and the pink line represents the random component of the wine prices.
This lognormal model fits the price data reasonably well, and the predicted parameters are close to the observed ones (eg. the predicted median and mode are within 7% of the observed values). However, the fit is not good enough, because the pink line (the model) deviates too far from the blue line (the data) in too many places.
I suggest that the basic problem here is that we are trying to fit a single price model to the data, when the data actually follow several different models.
First, the highest prices do not fit the model, notably those for the Icon wines and Dream wines. It should not surprise us that these wines have their own price structure, which comes from cloud-cuckoo land, and is tolerated only by businessmen with more money than common sense. So, we should not be trying to cram these wines into our model. Indeed, there is a big price gap at US$ 500, and so it is easy to decide where to divide the model in two — we simply drop the Icon and Dream wines from our analysis.
Second, it looks to me like the remaining frequency distribution (US$ <500) still cannot be modeled by a single lognormal model. That is, the shape of the pink line cannot be made to move closer to the blue line. Even these data are more complicated than our simple lognormal model.
So, let's try fitting two lognormal models to the data, instead. This is straightforward to do; and it implies that the wine prices actually live in two different worlds. This next graph shows the probability distributions of the two best-fitting lognormal models in pink, along with their combination shown in black. [Technical note: the Akaike Information Criterion for two models versus one improves by 231.7 units.]
The black line clearly fits the blue line much better than did the pink line in the previous graph. So, this is now a very good fit of the data and the overall model. We could, of course, keep fitting even more complex models, but there seems to be no good practical reason to do so. [Technical note: by definition both lognormal curves must start at the same left-hand point on the horizontal axis. This is an unfortunate inadequacy of using the lognormal, because the smaller curve would make more sense if it started further to the right. That is, a model for the more expensive wines should not start at US$4!]
This model means that the two pink lines represent two different pricing structures for the red wines. One of the structures covers the cheaper wines (principally US$ <20) and the other one covers somewhat more expensive wines (US$ 20-500). Roughly speaking, there is one pricing structure for the Everyday and Better wines, and another one for the Premium, Affordable Luxury, and Luxury wines.
The majority of wine consumers are likely to be dealing with the first model, which actually covers 65% of the red wines. However, the cognoscenti will principally be interested in the second model, which covers almost all of the remaining 35% — these are the wines that the wine media tend to write about most often. The Icon and Dream categories, with their own (third) model, include only 0.5% of the wines.
This has important practical consequences for buying wine. For example, there is little point in a consumer trying to compare the quality:price ratios (QPR) of US$10 wines and US$50 wines, because they probably won't be the same — the pricing structures are not actually connected. You need to be wearing a different hat when you shop for Premium wines rather than Everyday wines!
For these data (bottled red wines available in Sweden), there appear to be at least three price models. One model covers the most expensive wines (US$ >500), one principally covers the cheapest wines (US$ <20), and the third model covers most of the wines in between. Price variation within each of these models is unrelated to price variation within the other models.
In practice, this means that price comparisons only make sense within each of these three groups — for example, the quality:price ratios (QPR) will be different between the groups. There seems to be no good reason why these conclusions would not apply to the wines sold in other countries with similar wines available, as well, although the details may be different.