Score inflation, and my failed quest to find 88s

Last month I posted a request on Instagram for 88-point coffees, with the promise to promote any samples that I (and a Q-Grader friend) scored 88 in a blind cupping. The purpose of the offer was not to get lots of free coffee (that’s the last thing I need), but rather to understand more about how roasters score, and what they know about scoring. 
The experience was eye-opening in unexpected ways

The results

First, no coffee scored 88; the highest score was 87.5 (cottage Colombian pink bourbon from Escape in Montreal), and that was for a coffee the roaster said he didn’t think was 88, but given that it was a pink bourbon recommended by Jonathan Gagné, i was keen to try it. It was delicious. And I’m impressed that Escape didn’t try to oversell it. 

The other candidates scored in a range of 83–87. The 83 may have once been a great coffee, but the coffee tasted old and baggy. 


The process

I only accepted samples from roasters who offered credible reasons to believe the coffees might be 88. Several roasters offered coffees they were confident were 88 because “the green seller said it was 92” or “the coffee is so fruity that it has to be 88” or “it must be 88 because my customers love it so much.” I rejected those offers to save those roasters the time and expense of sending the coffee to me. It’s apparent that most roasters don’t understand the mechanics of scoring. 

All coffees were cupped blindly, multiple times, by my friend Mark (Q Grader) and me. All cupping sessions contained coffees from multiple roasters. All cupping sessions included “anchor” coffees we had tasted multiple times, whose scores were consistent and known. Usually one of us had no idea what coffees would be on the table each day. 

 
 


Calibrating with Sey

During this process, Matt from Sey Coffee reached out and offered some samples. Matt didn’t claim any of the samples would be 88, but he promised they would be tasty, and he was interested in calibrating, which I appreciated. I reciprocated by sending various samples to Sey. We compared our respective scores, as well as the scores of some importers with which Sey deals. 

While my scores on all samples were a tad lower than those of Sey and the importers, we generally agreed on how we ranked various coffees. I believe everyone involved enjoyed and benefitted from the experience, though I doubt Sey enjoyed that baggy sample much :0 (sorry guys). 


How roast quality affects score

When Ryan Brown and I ran Facsimile, a cupping-oriented subscription service, I occasionally sent multiple sample roasts of the same coffee to Ryan. Ryan preferred that all samples he received were not marked in any identifiable way (Not even country of origin). Every sample roast we cupped was well within the range of “typical” third-wave roast quality, ie nothing was grossly flicked or underdeveloped. 

Our scores for a given coffee roasted a variety of ways would land in a 1.5-point range. Since then I have assumed semi-competent roasting could influence a cup score by up to 1.5 points. One way to look at that would be a “perfect” roast would capture the full potential of a coffee, while a flawed-but-not-awful roast would cause a deduction of up to 1.5 points.


Score inflation is real

As with everything, it seems, at some point marketers hijack something good, exaggerate claims of quality, and eventually make quality claims meaningless. Coffee scoring is no different. Green sellers and roasters are under competitive pressure to inflate scores. After all, if two green suppliers offer nearly identical coffees from a given origin at roughly the same price, but one importer scores theirs “88” and a competitor claims theirs is “86,” most roasters will be more inclined to buy the “88.” 

In my experience, score inflation can be reasonably predicted based on the source of a score: 

  • Coffee-review websites will habitually over-score by 6-9 points (no joke; if they scored conservatively, no one would send them samples)

  • Green importers will over-score by 1-5 points depending on the audience

  • Roasters inflated scores by an average of 1-3 points

  • Cup of Excellence scores seem quite accurate

  • CQI scores are accurate, by definition


Even if you disagree with my estimates, the simple fact that the main sources of public scores disagree by such large amounts mean the system is full of bias and inaccuracy. 

The vast majority of the world’s roasters are quite small, and as far as i can tell, the majority of small roasters don’t sample roast and cup green samples blindly before purchase. A great deal of small and new roasters get their introductions to scoring from importers, which may compound score inflation over time.

A friend of mine who has been a Cup of Excellence judge since the beginning of COE says he is frustrated by scores creeping higher in COE competitions as well. I doubt score creep is nearly as bad there as it is in most of the industry, but that was still concerning to hear. 

Thankfully the industry has a scoring “anchor” in the form of the CQI, which offers Q–grader scoring of coffees for a fee. Unfortunately that fee is rather high for most small roasters, but it would behoove roasters and importers to send the occasional samples to the CQI for calibration. 


Scoring should be semi-logarithmic 

I recommend interpreting seemingly linear scoring systems, such as that used in coffee, as non-linear. Although 88 is just over 1% higher than 87, I don’t view 88s as approximately 1% better than 87s; 88s are obviously much more than 1% better. Scoring is not linear. 

I assume that over the course of cupping an extremely large number of samples (at least 10,000, preferably more) from a great variety of sources and price points, there should be approximately 10x more 87s than 88s. In my experience, that has been the approximate trend.*

Of course in a smaller sample size, or a set of samples with some sort of selection bias (eg you only taste washed coffees that cost above a certain price), this 10x relationship won’t hold. But it’s a good mental model; if you score only 2x more coffees 87 than 88, something is wrong. If you are scoring or offering 90-point coffees on a regular basis, there is a calibration problem. 

*Think of this 10x trend as a bit like the stock market. The stock market may not return 10% in any given year, and its annual returns may fluctuate between -10% and +40%, but over time the market may return 10%. In 2018, I probably scored 2-3 coffees (all Kenyans) at 90 and 15 of them at 88-89. In 2022, due primarily to the collapse in Kenyan-coffee quality, so far I have scored nothing 89 or 90. Much like the rare years in which the stock market returns 40%, coming across coffees that score above 90 is so rare that it’s hard to know what their frequency would be over 100s of 1000s of samples over many decades. My guess is the 10x relationship would have decent predictive power up to about 93-94 points. To be honest, I have no idea what a 95+ coffee would taste like. If I ever taste one, I’ll probably cry and make that the last cup of coffee I ever drink. (Note to the haters: serve me a 95 and you’ll never have to hear from me again. Thanks)


The future of scoring

I fear that a few current trends will contribute to further score inflation: 

  • The dramatic fall in the quality of coffee from Kenya and Ethiopia, historically the highest-scoring origins, over the past few years.

  • The continued explosion in the number of new roasters who learn scoring calibration from their green suppliers

  • The explosion in popularity of funky processing methods that are objectively dirtier than washed but somehow get a pass. (The phrase “clean for a natural” alone should make one suspicious.)

The first trend is most critical: Literally every 90-point coffee I have tasted over 30-year career was from Kenya. If there are no more 90-point Kenyas, “90” will lose its historical meaning, and today’s 87s may become tomorrow’s 90s. Once that happens, how will we score the Kenyans sitting in George Howell’s freezer? 105? 

This slide is already happening, most obviously in 2022 due to “framing” effects and roasters “grading on a curve.” I’m frankly a little frustrated with the number of 88–90 scores roasters have claimed for their Kenyans this year. Literally twenty roasters this year have said to me a version of “I know Kenyans aren’t as good as they were, but I found one that tastes like the Kenyans of old.”  Right, and everyone is a better-than-average driver :0. 


Bias is rampant in all human pursuits. 

If you want to know what a true 90+ coffee is, try some Kenyan from George’s freezer. It’s probably the only current, reliable source of such coffees. If you want to know what an 88 is, perhaps purchase a coffee that scored 90+ in a Cup of Excellence competition. I say 90+ because the competition coffees will have been scored when they were optimally fresh. By the time they land in your local roaster’s warehouse, they likely lost one or two points. 

How to combat score inflation

If you are interested and can afford it, consider taking the Q Exam. It’s probably the quickest route to learning to score accurately. 

Whether you are Q-certified or not, the most important practice is to calibrate with other experienced scorers who have no motivation to inflate the scores of the coffees you cup together. I didn’t know how to score until cupping with Ryan Brown, and I’m grateful to have been introduced by someone so competent and objective. 

In recent years, I have found myself well-calibrated with Lance Hedrick, the entire, well-calibrated team (!) at Nomad in Barcelona, Jaroslav of Doubleshot in Prague, my daily co-cupper Mark Benedetto, Elliot at Steady State, and a few others. 

Over time I hope to expand my circle of calibrated cuppers, as I believe it is the best way to prevent bias or score drift. In the future I plan to send samples to CQI periodically to calibrate with them. 
I welcome your comments. 

 

Sign up for my ONLINE ROAST COACHING FORUM

Cancel anytime.

It’s the most affordable way to get professional coaching for your roasting, and the group is 100% polite, supportive, and a pleasure to work with. Just having access to the vast archive of posts is worth the price of admission. Get more info HERE

 
Scott Rao