How is the log loss function derived?
Loss functions are used to guide the development of machine learning (ML) models and each model type has one or more preferred loss functions. I started learning about them in a more structured manner recently to strengthen my foundation in ML. I started with linear regression and their loss functions and moved on to logistic regression. The log loss function is used for training logistic regression models which looks much more complicated than the loss functions used in linear regression, and I couldn’t help but wonder why. How is the log loss function derived? Why is it the way it is?
TL;DR
Logistic regression models predict the probability of a positive outcome. These models works with prediction values running from 0 to 1 and actual values reflecting a binary outcome (0 or 1). To calculate the loss between this type of prediction-actual value pairs, we use the log loss function which is derived from the probability mass function of the Bernoulli distribution.
Table-of-contents
- Table-of-contents
- Loss functions of linear regression vs log loss
- Binary classifications can be modelled by the Bernoulli distribution
- Maximum likelihood
- Converting likelihood to loss
- Prologue: How does log loss perform when we know the ground truth?
Loss functions of linear regression vs log loss
Loss functions are used to calculate the difference between a prediction by a machine learning model and the actual observed value. It is used for model training and assessing the progress of the training. In model training, the loss function guides how the model updates its parameters (weights and bias) automatically without human intervention. For assessing the progress of the training, the loss values are read by humans. They’re the benchmark for checking model training and guide decision making on the model like: Does the model require more training? Should the hyperparameters (influencing training) be adjusted? Is the model overfitted? Loss functions are, thus, designed for reporting losses in an interpretable manner (for assessing model training) and in a way that can be implemented computationally (for model training).
Here’s a recap of the loss functions in linear and logistic regression:
| Loss | Formula |
|---|---|
| L1 Norm | $$\Vert \mathbf{y} - \mathbf{y'} \Vert_1$$ |
| Mean Absolute Error (MAE) | $$\frac{1}{n} \sum_{i=1}^{n} \vert y_i - y'_i \vert$$ |
| L2 Norm | $$\Vert \mathbf{y} - \mathbf{y'} \Vert_2$$ |
| Mean Squared Error (MSE) | $$\frac{1}{n} \sum_{i=1}^{n} (y_i - y'_i)^2$$ |
| Root Mean Squared Error (RMSE) | $$\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - y'_i)^2}$$ |
| Log loss | $$-\frac{1}{n} \sum_{i=1}^{n} [y_i \ln(y'_i) + (1 - y_i) \ln(1 - y'_i)]$$ |
where $y’$ is the predicted value and $y$ is the actual value.
The first five loss functions can be used in linear regression, while the log loss is for logsitic regression. Feels like there’s a bit of a jump, right?
Let’s think about the predicted and actual values being compared in linear and logistic regression models.
Linear regression models predict a numerical value on a linear scale. For example, we could have a model that predicts the price of a house based on its location. The loss would be the difference between the predicted value and actual value of a house. Regardless of whether the predicted value is higher or lower than the actual value, this difference should be treated the same i.e. the direction of the difference is not important for loss representation. As such, the loss functions for linear regression negate the sign of the difference (positive/negative) by taking the absolute or squaring the difference.
Logistic regression models predict the probability of a positive outcome. This predicted probability is a continuous value between 0 and 1. The actual value it is compared to, however, is a binary outcome (0 or 1). For example, we have a bag of blue and red balls, and we have a model that predicts the probability of us drawing a blue ball. The probability is any value from 0 to 1, but the actual value is ‘blue ball’ (1) or ‘red ball’ (0).
This distinction between linear and logistic regression models is crucial. When I first came across the definition of loss functions as a means to calculate the difference between the predicted and actual values, the first thought that came to mind is difference is merely $y - y’$. This thought, however, assumes that $y$ (actual value) and $y’$ (prediction value) are on the same scale. The way difference/loss is calculated depends on the scales of the actual and prediction values. Since linear and logistic regression handle different values (especially in terms of the actual values they work with), the way loss is calculated is different. That’s the intuition behind why the log loss function looks so different from the five loss functions in the table.
The log loss function was derived from answering a fundamental question: how can we calculate the loss between a probability ranging from 0 to 1 and a binary outcome (0 or 1)?
As it turns out, this loss can be reframed as what is the probability of outcome $y$ given probability $y’$? that can be modelled with the probability mass function of a Bernoulli distribution. That’s our starting point that eventually evolves into our log loss function as we’ll see. In the subsequent sections, I’ll use $p$ to represent the probability calculated by the model instead of $y’$ so that distinguishing between the actual value/outcome $y$ from the predicted value/probability $p$ is easier.
Rabbit hole 🐰
The log loss function reframes loss calculation into a probablistic problem, but there are other ‘frames’ you can use too. I’ve been told, for example, that support vector machines reframe the calculation into a geometric problem.
Binary classifications can be modelled by the Bernoulli distribution
Logistic regression models work on binary classification problems that can be modelled by the Bernoulli distribution. The probability of outcome $y$ given probability $p$ can be captured with the probability mass function:
where:
- $y$ is the actual outcome; a positive class is 1 and a negative class is 0
- $p$ is the probability of the positive class
In our context, if the probability of a positive class is $p$, the probability of a negative class must be $1-p$. This equation mirrors this outcome:
- When $y = 1$ (positive class), the equation simplifies to $p^1(1-p)^0=p$.
- When $y = 0$ (negative class), the equation simplifies to $p^0(1-p)^1=1-p$.
Maximum likelihood
The probability mass function calculates the difference between one prediction and one actual outcome. During training and evaluation, we typically assess more than one pair of predictions and outcomes. With our bag of balls, the assessment is similar to drawing a ball from the bag $n$ times with replacement. We want to run our prediction model $n$ times and check how the model does overall compared to our draws (the actual outcomes). This overall is known as the “likelihood”, which is the product of the $P(y|p)$ at each draw:
where:
- $L(p)$ is the likelihood of the predicted probability.
- $y$ values are the actual outcomes. Either 0 or 1.
- $p$ is the predicted probability.
The more probabilities you multiply, the longer the decimal of the likelihood is. This trailing decimal becomes a problem computationally known as “numerical underflow”. Computers store decimals as floats like float64 so 0.5 and 0.000000005 use exactly 64 bits. Trailing decimals eventually hit the lower limit of what 64 bits can store and we run into a computational problem where we’ve run out of precision. To avoid this problem, we can calculate the natural log of likelihood:
⭐️ Tip
Some formulas use $\log$, which produces an identical result. From a coding perspective, most packages like
TensorFlow,PyTorchandScikit-Learnuse $\ln$.
The closer the predicted probability is to the ground truth, the higher the likelihood. If we drew blue-blue-red, we would expect that the probability of drawing a blue ball would be $p > 0.5$:
- If $p = 0.9$, the likelihood would be $0.9 \times 0.9 \times (1-0.9) = 0.081$.
- Conversely, if $p = 0.1$, the likelihood would be $0.1 \times 0.1 \times (1-0.1) = 0.009$
In other words, the model does well when it maximises the likelihood.
Converting likelihood to loss
Finally, to convert likelihood into loss, we can multiply the natual log of the likelihood with $-1$. To get the average loss per sample, we can further divide it by $n$ samples:
The purpose of this conversion is to align with the general idea of model performance. In linear regression, we saw that lower loss values are indicative of better model performance. If we used likelihood for assessing logistic regression, the opposite would be true; the higher the likelihood, the better the model performance. By making this conversion from likelihood to loss, we keep a similar grading system for model performance—the lower the loss score, the better the model performance.
That’s how the log loss formula is derived. Or at least how I think it’s derived.
Prologue: How does log loss perform when we know the ground truth?
What we’ve discussed so far stems from what happens in practice for logistic regression models. The data we have access to are the predicted probability (any value from 0 to 1) and the actual outcome (0 or 1). We can use the log loss function for loss calculation because it allows us to compare the predicted probability and actual outcome even though they’re on different scales.
Here’s the nifty part: the log loss function also performs better at guiding the model towards the ground truth probability that underlies the data (better than the loss functions of linear regression models at least).
Let’s say we know our bag of balls contains 6 blue balls and 4 red balls. We, thus, know the ground truth probability of drawing a blue ball ($y = 0.6$).
Since we have access to the actual probability, which is on the same scale of 0 to 1 as the predicted probability, we can compare the difference between the actual and predicted probabilities using any of the loss functions of the linear regression model or the log loss function.
For $n = 1$, the loss calculated between the predicted probability from 0 to 1 and the actual probability of 0.6 is:
I’ve chosen two of the linear regression loss functions to represent the two common approaches used to calculate loss (either by taking the absolute or square of the difference) and simulate the loss values calculated compared to the log loss function. We can see how the minimum loss of each function is at a predicted probability of 0.6, which coincides with the actual probability. What these functions differ in is their curves.
During model training, the loss function guides the model toward the minimum. I like picturing the model as a ball that I’ve dropped at any given predicted value; where it settles is where the model has ‘reached’ a predicted probability most similar to the actual probability (the ‘best prediction’). The log loss function has a U-shape curve, which has large losses when the predicted probability significantly deviates from the actual probability. In contrast, the MAE and MSE output smaller losses for such deviations, creating a gentler sloping V- or U-shape curve. A model tends toward the best prediction or minimum loss more efficiently when ‘guided’ along the log loss curve than the curves of MAE or MSE.
In other words, aside from the log loss function being able to handle difference calculations in practice (where we don’t know the actual probability), it is a more effective function for guiding model training where we do know the actual probability.