Why is the harmonic mean used for the F-score?

· Amanda Ng

Precision and recall are common metrics used for assessing the performance of binary classification models. Due to their definition, however, there is usually a trade-off point between precision and recall. At this trade-off point the higher the precision, the worse the recall, or vice versa. The F-score is an alternative metric that represents both the precision and recall. It is often defined as the harmonic mean between precision and recall. What is the harmonic mean and why is it used in the F-score?

TL;DR

Precision and recall share a common numerator but have different denominators. To ‘unify’ them, we need to calculate the sum of the reciprocal of precision and recall. The mean of reciprocals is the harmonic mean, which is why the F-score uses the harmonic mean.

Table-of-contents

What’s in a mean?

Before diving into what a harmonic mean is, let’s re-look at what the definition of a mean is. Growing up, in mathematics classes, I was taught that the mean is taking the total divided by the number of samples. This definition, however, is a specific type of mean called the arithmetic mean. I’m not alone in growing up with this definition. I’ve come across quite a few blog articles, many of which still use the arithmetic mean as a definition for what a mean is. The Wikipedia page on means does a slightly better job:

https://en.wikipedia.org/wiki/Mean
A mean is a quantity representing the "center" of a collection of numbers and is intermediate to the extreme values of the set of numbers.[1] There are several kinds of means (or "measures of central tendency") in mathematics, especially in statistics. Each attempts to summarize or typify a given group of data, illustrating the magnitude and sign of the data set. Which of these measures is most illuminating depends on what is being measured, and on context and purpose.[2]...

[1] “Mean | mathematics”. Encyclopedia Britannica. Retrieved 2020-08-21.
[2] Why Few Math Students Actually Understand the Meaning of Means (YouTube video). Math The World. 2024-08-27. Retrieved 2024-09-10.

Wikipedia’s definition emphasizes that a mean is a representation of central tendency and that there are multiple means (not just the arithmetic mean). Depending on the context, different means should be used. This definition is better than just the arithmetic mean but wrapping my head around what a representation of central tendency is felt ambiguous to me.

I personally found the second reference in the Wikipedia page particularly useful for re-framing this definition. It points to a YouTube video by Math The World (which I highly recommend watching) that goes through some simple problems that actually require different means to solve and, importantly, brings up the following definition of what a mean is:

“The mean of a set of values applies to a situation, where we have a set of values that we combine in some way to get a total. The mean of the set of values is the single value we can replace all of the values with and still get the same total." The mathematical notation looks something like this:

Let a set of values be ${x_1, x_2, \dots}$. To calculate the total, we can carry out different operations on the values.

$$\text{Total} = x_1 \square x_2 \square \dots$$

where $\square$ can be any operation.

The mean is a single value we can replace all the $x$ values with that retains the same total i.e.

$$\text{Total} = \text{mean} \square \text{mean} \square \dots$$

The type of mean used depends on the operation (the $\square$) we’re interested in. For example, the arithmetic mean is used when the operation is a sum (where $\square$ is $+$). The harmonic mean is used when the operation is a sum reciprocals, i.e.

$$\text{Total or sum of reciprocals} = \frac{1}{x_1} + \frac{1}{x_2} + \dots$$

To calculate the harmonic mean ($H$):

$$ \begin{aligned} H &= \left( \frac{1}{n} \left( x_1^{-1} + x_2^{-1} + \dots + x_n^{-1} \right) \right)^{-1} \\ &= \left( \frac{x_1^{-1} + x_2^{-1} + \dots + x_n^{-1}}{n} \right)^{-1} \\ &= \frac{n}{x_1^{-1} + x_2^{-1} + \dots + x_n^{-1}} \\ &= \frac{n}{\sum_{i=1}^{n} \frac{1}{x_i}} \\ &= n \times \left( \text{Sum of reciprocals} \right)^{-1} \end{aligned} $$

where $n$ is the number of $x$ values.

Why is the harmonic mean used in the F-score?

Now that we know what the harmonic mean is, a better question is: how is the sum of reciprocals relevant to the F-score? Let’s recall the formulas for precision and recall.

$$\text{Precision} = \frac{TP}{TP+FN}$$

$$\text{Recall} = \frac{TP}{TP+FP}$$

where:

  • $TP$ is the number of true positives
  • $FN$ is the number of false negatives
  • $FP$ is the number of false positives

The F-score is a single value representing precision and recall. We can’t directly add precision and recall together because they are calculated under different contexts (represented by the different denominators). What we can do, however, is calculate the sum of the reciprocal of precision and recall because they share the same numerator ($TP$).

$$ \begin{aligned} \text{Sum of reciprocals} &= \text{Precision}^{-1} + \text{Recall}^{-1} \\ &= \left(\frac{TP}{TP+FN}\right)^{-1} + \left(\frac{TP}{TP+FP}\right)^{-1} \\ &= \frac{TP+FN}{TP} + \frac{TP+FP}{TP} \\ &= \frac{2TP+FN+FP}{TP} \end{aligned} $$

Since we’re representing two values i.e. $n = 2$, the F-score would be:

$$ \begin{aligned} \text{F-score} &=2 \times \left( \frac{2TP+FN+FP}{TP} \right)^{-1} \\ &=2 \times \frac{TP}{2TP+FN+FP} \\ &=\frac{2TP}{2TP+FN+FP} \end{aligned} $$

Alternatively, we can calculate the F-score from precision and recall, which would be:

$$ \begin{aligned} \text{F-score} &=2 \times \left( \text{Precision}^{-1} + \text{Recall}^{-1} \right)^{-1} \\ &=2 \times \left( \frac{1}{\text{Precision}} + \frac{1}{\text{Recall}} \right)^{-1} \\ &=2 \times \left( \frac{\text{Recall} + \text{Precision}}{\text{Precision} \times \text{Recall}} \right)^{-1} \\ &=2 \left( \frac{\text{Precision} \times \text{Recall}}{ \text{Recall} + \text{Precision}} \right) \end{aligned} $$

Reflections

Writing this blog post has been quite an enriching experience. My conception of what a mean is has been fundamentally shifted. The mean is a representative value that depends on how the total is calculated from a set of values. As precision and recall have different denominators, we cannot add them directly. They do, however, share a common numerator which allows us to aggregate them by using their reciprocals. The mean of a sum of reciprocals is the harmonic mean. In short, the F-score isn’t just a formula; it’s designed based on a formulaic understanding of recall and precision.

I can’t help but wonder what other operations require a careful selection of the mean? 💡