Monday, November 6, 2017

The dangers of over-interpreting Big Data (in the wine business)

In order to understand complex sets of information, we usually summarize them down into something much simpler. We extract what appear to be the most important bits of information, and try to interpret that summary. Only the simplest pieces of information can be left alone, and grasped on their own. This creates an inherent problem — data summaries also leave information out, and that information may actually be very important. Sadly, we may never find this out, because we left the information out of the summary.

Clearly, the biggest danger with what are known in the modern world as Big Data is that, in order to understand it, we first turn it into Small Data by ignoring most of it. That is, the bigger the dataset then the more extreme is the summary process, because of our desire to reduce the complexity. Data summaries tend to be all the same size, no matter how big the original dataset was. Unfortunately, most of the discussion about Big Data has involved only the technical aspects, along with the optimistic prospects for using the data, without much consideration for the obvious limitations of data summarizing.

One of the most common ways that we have historically used to summarize data is to organize the data into a few groups. We then focus on the groups, not on the original data. In this post, I will discuss this in the context of understanding wine-buying customers.


By summarizing data, we are looking for some sort of mathematical structure in the dataset. That is, we are looking for simple patterns, which might then mean something to us, preferably in some practical sense.

Putting the data into groups is one really obvious way to do this; and we have clearly been doing it for millenia. For example, we might group plants as those that are good to eat, those that are poisonous, those that are good as building material, etc.

The biggest limitation of this approach is that we can end up treating the groups as real, rather than a mathematical summary, and thus ignore the complexity of the original data. For example, groups can overlap — a plant can be both poisonous and good for making house walls, for example; and focusing on one group or the other can make us forget this.

Groups can also be fuzzy, which means that the boundaries between the groups are not always clear. Dog breeds are a classic example — pure-bred dogs clearly fit into very different groups, and we cannot mistake one breed for another. But dogs of mixed parentage do not fit neatly into any one group, although we often try to force them into one by emphasizing that they are mostly from one breed or another. That is, the breeds are treated as real groups, even though they overlap, and thus are not always distinct.

Examples of grouping

Let's consider two examples, one where the groups might make sense and one where they are more problematic.

When considering customers, one obvious grouping of people is gender, male versus female. In science, this is simply a genetic grouping (based on which genes you have), but elsewhere it is usually treated as also being a behavioral grouping. Businesses are therefore interested in what any gender-associated differences in behavior might mean for them.

Consider this example of using Twitter hashtags to quantify gender differences: The hard data behind how men and women drink. The data come from "half a million tweets collected over the course of a year (June 2014 - July 2015), with the gender detected from the first name of the tweeter." The first graph shows the frequency of 104 drink-related hashtags, arranged according to how often they originated from male versus female tweeters.

Note that no hashtags are used exclusively by either males or females — indeed, only two exceed 80% gender bias (homebrew, malt). Equally, no hashtags are used equally by males and females — the closest are: cachaca, patron, caipirinha. We thus might be tempted to recognize two groups, of 40 "female" words and 64 "male" words.

However, we have to be careful about simply confirming our starting point. We pre-defined two groups that represent observed differences (in genetics), and then we have demonstrated that there are other differences (in behavior). The data are essentially continuous, with some words having less than 47% vs. 53% gender distinction. In this case, gender still forms indistinct groups.

Moving on, this situation becomes even more complex when we start to consider situations with many possible groups, based simultaneously on lots of different characteristics. In an earlier post, I discussed the mathematical technique of using ordinations to summarize this type of data (Summarizing multi-dimensional wine data as graphs, Part 1: ordinations).

This next graph shows an example of the resulting data summary, called an ordination diagram. If each point represents a person, then the spatial proximity of the points would represent how similar they are. So, points close together are similar based on the measured characteristics, while points progressively further apart are progressively more different.

This ordination diagram does not contain any obvious groups of people — they are spread pretty much at random. However, that does not mean that we cannot put the people into groups! Consider this next version of the same diagram, in which the points are now colored. The five different colors represent five groups, one in each corner of the diagram and one in the center.

Clearly, these groups do not overlap. More to the point, the centers of each group are quite distinct. Thus, the groups do have meaning as a summary of the data — combining the descriptions of each group of people would create an easily interpreted summary of the whole dataset.

However, these are fuzzy groups — the boundaries are not distinct, and the groups of people are not discrete. Thus, I am also losing a lot of information, as I must in a summary of complex data; and I need to care about that lost information as well. I cannot treat the groups as being real — they are a convenience only. As a technical aside, it is worth noting that the groups are not an illusion — they are an abstraction.

The point of this blog post is to make it clear that this problem must especially be addressed when dealing with Big Data, because that is where techniques like ordination come into play.

Big Data

Wikipedia has this to say about Big Data:
Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them ... Lately, the term "big data" tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data ... Analysis of data sets can find new correlations to spot business trends, prevent diseases, combat crime, and so on.
In business, the use of information from social media is the most obvious source of Big Data. People are often perceived as being much more honest in their online social interactions than they are in formal surveys; and so this relatively recent source of information could potentially be much more useful to modern business practices.

As this infographic indicates, the social media can generate some really big datasets. Making sense of these data involves some pretty serious summarizing of the data. Therefore, the principles that I have discussed above become particularly important — we have to be very careful about how we interpret those summaries, especially if we have summarized the data into groups.

An example from the world of wine

So, let's conclude with a real example from the world of wine buying: the 2016 Digital Wine Report: the Five Tribes of Online Wine Buyers, prepared by Paul Mabray, Ryan Flinn, James Jory and Louis Calli. (Thanks to Bob Henry for getting me a copy of the "Academic edition" of this report.)

This study was produced by a group originally called VinTank, and who at the time were a subsidiary of W2O (who subsequently closed them down!). The objective of the report was to combine data about wine drinkers, based on the social media, with data about wine buyers, based on online purchases. This is a perfect example of using Big Data to help businesses understand their customers.

The social data were for 12,500 individuals, based on 183,000 Twitter posts assessed by the TMRW Engine software. The buying data were for 53,000 online wine purchases, recorded by Vin65. So, the report attempts to summarize the wine behavior of people who use both social media to discuss wine and online shopping to purchase wine, in the USA. Clearly, this does not attempt to represent all US wine drinkers and buyers — the people summarized "buy directly from wineries, they are digitally savvy and use both e-commerce and social media, and they like wine more than the casual consumer."

The crux of the report's methodology is this:
Using a methodology built upon the foundations of demographic and psychographic market research techniques, we segmented [= grouped] online wine customers according to their psychographic profiles: including hobbies, preferences, activities, and political outlooks ... We were [then] able to apply this segmentation to purchasing behavior and demographic profile at the individual customer level. As a result we've identified 5 common "tribes" of online wine buyers.
To personalize these five tribes, we've given each one a name, a theme and a personality description.
You can immediately see what I am warning you about here — these five tribes are not real, even though they have names and distinct personalities. The psychographic and demographic characteristics of the people vary continuously, and grouping them is merely a convenient mechanism for data summary.

In order to get a sense of what these groups look like, refer to the colored version of the ordination diagram shown above, where the group centers are different but the boundaries are fuzzy. I have carefully analyzed the data presented in the report, and I can assure you that the five "tribes" really do have different behavioral "centers"; but I would hate to have to assign anyone to one group or another. At a personal level, I can't see myself as being in any of these five tribes.

Part of the problem here is that categorizing people in this manner simply perpetuates cultural stereotypes. In this case we have: Anna, the sophistocrat; Graham, the info geek; Sofia (or Sophia), the digital native; Don, the southern conservative; and Kevin, the trophy hunter. If none of these people sounds like you, then you are probably right.


Big Data are useful, there is no doubt about it. However, big data can potentially have big problems, as well, and we need to guard against the consequences of this. One of the most common ways to summarize Big Data is to assign the study objects to groups, but these groups are not real — they are a conceptual convenience, nothing more. Hopefully, grouping their customers will help businesses provide services to those customers, but that does not mean that the businesses should ignore those people who do not fit neatly into any of their groups.


VinTank has reappeared as AveroBuzz, which is intended for the hospitality industry as a whole, not just the wine part.

No comments:

Post a Comment