## Monday, December 12, 2016

### Summarizing multi-dimensional wine data as graphs, Part 2: networks

In a previous blog post I introduced ordination analyses, as a mathematical technique for summarizing multivariate data — Summarizing multi-dimensional wine data as graphs, Part 1: ordinations.

Multivariate data have measurements of multiple characteristics for each of a set of "objects". The objective of the analysis is to mathematically summarize the multiple dimensions down to something manageable, which can then bw displayed as a picture. This picture will give is an overview of the patterns in the data, rather than displaying the details of each characteristic.

Ordination analyses put the objects in some order along one or two dimensions (hence the name). In the resulting graph the points represent the objects, and the relationships of the points in the graph represent their similarity, based on the summary of the original data. Points close together are more similar than are points further apart.

In this post, I introduce the use of networks as an alternative way to summarize multivariate data. In particular, the networks that I introduce are called phylogenetic networks. I have a special interest in these networks, since it is what I have worked on in my professional life (some of this work is discussed at the end of this post).

Phylogenetic networks

Phylogenetic networks differ from what are called interaction networks, which many people are familiar with — a food web is a classic example of an interaction network, where the objects in the network are different organisms, which are connected by lines indicating who eats whom. This type of network uses lines to connect objects based on relationships that are directly observed.

A phylogenetic network, on the other hand, connects the objects with multiple lines showing a summary of their similarities. That is, the network is a bit like an ordination, which uses space to display the relationships among the objects, but the points are now connected by lines. Objects that are close together in the network are more similar to each other than are points further apart, but the relationships are traced only along the lines, not directly from point to point. That is, the lengths of the connecting lines contain information about the data summary, and the ways in which they connect also contain information.

The advantage here is that the lines will often indicate clusters of points in the multidimensional data that cannot be displayed in a 2-dimensional ordination. So, the network can be more informative.

An example

I have compiled some multidimensional data relating to the quality scores of a collection of Australian wines, as provided by a number of raters — Jeremy Oliver, James Halliday, the Wine Front, and Cellar Tracker. The first two raters are individual people, the third one is a group of three people (Mike Bennie, Campbell Mattinson, Gary Walsh), any one of whom may have rated the wine, and the fourth one is a community site that provides ratings averaged across many people.

In this example, the 4 wine-raters are the objects, and their multiple characteristics are the quality scores given to the wines (there are 114 wines and thus 114 dimensions to the data). We wish to see how similar are the raters, by summarizing the multidimensional data down to a single picture.

This can be done using a NeighborNet network, as shown in the first figure. Note that the four raters are connected by a set of lines, and it is these connections that summarize the data.

The lengths of the lines are important. The longest line (technically called an edge) separates the Cellar Tracker ratings from the rest of the network, indicating that the Cellar Tracker ratings are often quite different to those of the other three raters. Indeed, the original data shows that these scores are usually much smaller than are the other three scores, for any given wine.

A similar thing applies to the scores from Jeremy Oliver, which are separated by the second-longest line — his scores are also often quite different to the others. Indeed, he has acknowledged this in his writings, pointing that as a wine commentator he is often more critical than are other (unnamed) commentators.

In the middle of the network there is a box-like structure connecting the four raters together in various ways. This box is not square, which means that some of the raters are more similar to each other than are others, as represented by the distances connecting them along the edges.

For example, the shortest distance along the lines from James Halliday to the Wine Front (401 units) is less than the distance from Jeremy Oliver to the Wine Front (506 units). This means that the former pair are more similar than are the latter pair.

Furthermore, the distance along the edges from Jeremy Oliver to James Halliday is much longer (667 units), indicating that this pair of people produce the least similar scores of the three Australian raters. All of these relationships can be seen at a glance, which is what makes the network useful as a summary of the original data.

This is a relatively simple example, because there are only four objects. Clearly, a network will get more and more complex as we add more objects to the analysis. There are examples of this phenomenon in the links below.

Some other examples

Based on this explanation of phylogenetic networks, you might like to look at a few of the interesting examples from my professional blog, showing you the range of possible uses of networks. By "interesting" I mean that the subject matter of the blog post is interesting, not necessarily the networks themselves! The complete list of my network analyses is in this blog page: Analyses.

Simple datasets:
More complex datasets:
There is also a somewhat different explanation of how to interpret these networks in this blog post (which uses the results from a few Australian federal elections):
The use of phylogenetic networks was formally introduced in this research publication, which also contains a range of example analyses:
• Morrison D.A. (2014) Phylogenetic networks — a new form of multivariate data summary for data mining and exploratory data analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 4: 296-312.