Monday, November 16, 2020

How close are repeated wine-quality scores?

This year has been awkward for the purchase of wine, as many people have commented in the media. We can all trust a global pandemic to disrupt international, national and local events.

Here in Sweden, our national liquor chain (Systembolaget) has had to change the way it releases new wines, in order to maintain social distancing among their staff in the main warehouse. This has meant that the wine release schedule has been drastically changed from the previous steady state.

In turn, this change has affected the wine commentators, as well as the wine customers. In particular, the commentators have almost all had problems at some time during this year, tasting new wines and publishing their subsequent quality assessments. For example, both BKWine and Vinbanken did not publish their usual wine-quality scores for the new releases during May and June (I have used these score sources in previous posts; eg. Are wine scores from different reviewers correlated with each other?).

While compiling this year's new-release scores for an upcoming post, I noticed that for the BKWine reports a few of the wines appeared more than once (ie. in the reports for different months). Moreover, some of these repeated wines did not receive the same scores. This situation allows us to comment quantitatively on the repeatability of wine scores from the same person.

This is a topic that I have commented on before, notably in my post on: The sources of wine quality-score variation. In that post, I briefly discussed a dataset from Rusty Gaffney, who re-tasted 21 Pinot noir wines 16-26 months after their first tasting.

Well, in the current case, we have a much shorter period of time than that; and this gives us a much better insight into the process of scoring wines, which is, after all, rather subjective. The August and September BKWine commentaries were published much later than usual, in the middle of the month rather than at the beginning. This presumably reflects pandemic-induced problems, which lead to what is presumably an unintended mix-up. The same person was responsible for all of the actual wines scores (Jack Jakobsson).

This graph shows us the scores for those 16 wines that were repeated in both the August and September wine commentaries. Note that the scores have a maximum of 20 points (not 100).
Repeated wine-quality scores from BKWine

Only 4 of the 16 wines have the same score on both occasions; but 12 of them are within half a point (the smallest possible difference). However, 3 of the wines have a difference of 1 point; and 1 wine differs by 1.5 points. Nine of the wines had an increased score on the second occasion, while only 3 decreased. Circa 39% of the variation in scores is shared between the two occasions.

The differences in scores are somewhat disappointing. Although the similarities are much better than would be expected from random chance (p<0.01), we are still faced with a situation where the differences are slightly bigger than the similarities.

However, this situation is vastly better than the previous one that I reported (see above), where only 6% of the variation in scores was shared between the two occasions (which were much further apart in time, of course). Tastings close together in time are expected to be more consistent; so we at least get that.


  1. It's pretty typical to have a variation in scores. The reviewer will be in a different physiological state from one time to another, but so will the wine. If sealed with natural cork, one bottle will definitely age differently from another as well.

    I know a lot of people want to cry "Foul!" when they read that as it would infer that scoring wines is just BS, but what's important is that the scores be quite close to one another. On a 100 point scale, a variance of +/- one point shouldn't be a problem and I would in fact think the reviewer is being honest if as such.

    But this is why I also use a three-star system on my site along with the numerical scores because any wine should consistently taste within the same star bracket. If not, there's a problem.

    If someone has properly trained on tasting analysis and not just picked it up as they went, there should ultimately be little variation.



    1. There are, indeed, many reasons for variation in wine-quality scores, which I tried to look at in my previous post on the topic (linked above). Expecting perfecting replication would therefore be very naive. Indeed, I am impressed that commentators can provide fairly consistent scores, at all.

      So, discussing score variation should not be seen as a disparagement of wine commentators in general, nor a personal criticism of any one commentator.

      On the other hand, publishing actual numbers is risky, because people will take them seriously, and might expect them to have actual mathematical properties! Stars can certainly help obviate this potential problem.

  2. David:

    Does wine reviewer Jack Jakobsson or Systembolaget publish the 20-point scale?

    If his/their scale is built upon components ("dimensions") like the U.C. Davis scale -- appearance, color, aroma & bouquet, volatile acidity, total acidity, sweetness, body, flavor, bitterness, and general quality -- are the discrete component scores publicized?

    That could reveal whether upon retasting the wine's score improved and fell based on (say) its aromatic or flavor appeal.

    ~~ Bob

    Additional reading: "The fundamental problem with wine scores"


    1. Systembolaget tastes all of the wines, but provides descriptions only (no scores). So, all scores come from independent wine critics. There is, as far as I know, no breakdown of the BKWine score components.

  3. Are you familiar with the Saturnalia work on Bordeaux Vintages and wine writer scoring variation ?
    Clearly reasons for scores are influenced by various factors . Notably in recent history the obvious difference between North American and European criteria of evaluation . Different scoring systems usefully serve the purpose of not allowing too much critical evaluation of comparison which in themselves have been a significant effect on pricing hierarchy

    1. I have not looked at the Saturnalia work in any detail, but may one day do so. Comparing critics scores is a tricky business, in any case, because there is no actual reason to expect different systems to correlate — other than via the supposed underlying quality of the wines. Sadly, the variation between critics is often larger than the variation between wines.

  4. In this essay by Caltech lecturer (on randomness) Leonard Mlodinow, he quotes Robert Parker who prides himself for consistently being “within a 2-3 point deviation” on his scores for wines tasted repeatedly over time.

    The Wall Street Journal “Weekend” Section
    (November 20, 2009, Page W6):

    “A Hint of Hype, A Taste of Illusion”


    1. Parker built his reputation around his scores. No-one since then has been able to repeat this, at least to the same extent, because scores have too many personal characteristics that do not mean much to the wine consumers.