How is the log loss function derived?

· Amanda Ng

Loss functions are used to guide the development of machine learning (ML) models and each model type has one or more preferred loss functions. I started learning about them in a more structured manner recently to strengthen my foundation in ML. I started with linear regression and their loss functions and moved on to logistic regression. Logistic regression uses the log loss function which looks much more complicated than the loss functions used in linear regression, and I couldn’t help but wonder why. How is the log loss function derived? Why is it the way it is?

TL;DR

Logistic regression models predict the probability of a positive outcome. These models works with prediction values running as a continuous scale from 0 to 1 and actual values/outcomes that are binary (either 0 or 1). To calculate the loss between this type of prediction-actual value pairs, we use the log loss function which is derived from the probability mass function of the Bernoulli distribution.

Table-of-contents

Loss functions of linear regression vs log loss

Loss functions are used to calculate the difference between a prediction by a machine learning model and the actual observed value. It is used during model training and evaluation. In model training, it guides how the model updates its parameters (weights and bias). In evaluation, the loss values are read by humans. They’re the benchmark for checking model performance and guide decision making on the model like: Does the model require more training? Should the hyperparameters (influencing training) be adjusted? Is the model overfitted? Loss functions are, thus, designed for reporting losses in an interpretable manner (for evaluation) and in a way that can be implemented computationally (for model training).

Here’s a recap of the loss functions in linear and logistic regression:

LossFormula
L1 Norm$$\Vert \mathbf{y} - \mathbf{y'} \Vert_1$$
Mean Absolute Error (MAE)$$\frac{1}{n} \sum_{i=1}^{n} \vert y_i - y'_i \vert$$
L2 Norm$$\Vert \mathbf{y} - \mathbf{y'} \Vert_2$$
Mean Squared Error (MSE)$$\frac{1}{n} \sum_{i=1}^{n} (y_i - y'_i)^2$$
Root Mean Squared Error (RMSE)$$\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - y'_i)^2}$$
Log loss$$-\frac{1}{n} \sum_{i=1}^{n} [y_i \ln(y'_i) + (1 - y_i) \ln(1 - y'_i)]$$

where $y’$ is the predicted value and $y$ is the actual value.

The first five loss functions can be used in linear regression, while the log loss is for logsitic regression. Feels like there’s a bit of a jump, right?

Let’s recall what prediction and actual values these models work with.

Linear regression predicts a continuous numerical value and the actual values are in the same scale. For example, we could have a model that predicts the value of a house based on its location. The difference between the predicted and actual values can simply be calculated by $y - y’$ because they are within the same scale. Since the goal is to report a loss representative of both negative and positive differences between the predicted and observed values, the loss functions negate the sign of the difference (positive/negative) by taking the absolute or squaring the difference.

Logistic regression, on the other hand, predicts the probability of a positive outcome. This probability is a continuous value between 0 and 1. The actual value it is compared with, however, is not a continuous value. It can only be 1 (positive outcome) or 0 (negative outcome). For example, we have a bag of 10 blue and red balls, and we have a model that predicts the probability of us drawing a blue ball from the bag. The probability can range from 0 to 1, but the outcome is binary: blue (1) or red (0). The predicted and actual values are on different scales, which affects how we calculate the difference between the predicted and actual values.

This distinction between linear and logistic regression models is crucial. When I first came across the definition of loss functions as a means to calculate the difference between the predicted and actual values, the first thought that came to mind is difference is merely $y - y’$. This thought, however, assumes that $y$ (actual value) and $y’$ (prediction value) are on the same scale. The way difference/loss is calculated depends on the scales of the actual and prediction values. Since linear and logistic regression handle different values (especially in terms of the actual values they work with), the way loss is calculated is different. That’s the intuition behind why the log loss function looks so different from the five loss functions in the table.

The log loss function was derived from answering a fundamental question: how can we calculate the loss between a probability ranging from 0 to 1 and a binary outcome (0 or 1)?

As it turns out, this loss can be reframed as what is the probability of outcome $y$ given probability $y’$? that can be nicely calculated with the probability mass function of a Bernoulli distribution. That’s our starting point that eventually evolves into our log loss function as we’ll see. In the subsequent sections, I’ll use $p$ to represent the probability calculated by the model instead of $y’$ so that distinguishing between the actual value/outcome $y$ from the predicted value/probability $p$ is easier.

Binary classifications follow the Bernoulli distribution

Logistic regression models work on binary classification problems that follow the Bernoulli distribution. The probability of outcome $y$ given probability $p$ can be captured with the probability mass function:

$$ P(y|p) = p^y (1-p)^{(1-y)} $$

where:

  • $y$ is the actual outcome; a positive class is 1 and a negative class is 0
  • $p$ is the probability of the positive class

In our context, if the probability of a positive class is $p$, the probability of a negative class must be $1-p$. This equation mirrors this outcome:

  • When $y = 1$ (positive class), the equation simplifies to $p^1(1-p)^0=p$.
  • When $y = 0$ (negative class), the equation simplifies to $p^0(1-p)^1=1-p$.

Maximum likelihood

The probability mass function calculates the difference between one prediction and one actual outcome. During training and evaluation, we typically assess more than one pair of predictions and outcomes. With our bag of balls, the assessment is similar to drawing a ball from the bag $n$ times with replacement. We want to run our prediction model $n$ times and check how the model does overall compared to our draws (the actual outcomes). This overall is known as the “likelihood”, which is the product of the $P(y|p)$ at each draw:

$$ \begin{aligned} L(p) &= P(y_1|p_1) \times \dots \times P(y_n|p_n) \\ &= \prod_{i=1}^{n} P(y_i|p_i) \\ &= \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{(1 - y_i)} \end{aligned} $$

where:

  • $L(p)$ is the likelihood of the predicted probability.
  • $y$ values are the actual outcomes. Either 0 or 1.
  • $p$ is the predicted probability.

The more probabilities you multiply, the longer the decimal of the likelihood is. This trailing decimal becomes a problem computationally known as “numerical overflow”; the likelihood has so many decimal positions that it ends up occupying a lot of memory to represent in bits. To manage the memory usage, we can calculate the natural log of likelihood:

$$ \begin{aligned} \ln(L(p)) &= \sum_{i=1}^{n} \ln(p_i^{y_i}) + \ln(1 - p_i)^{(1 - y_i)} \\ &= \sum_{i=1}^{n} y_i\ln(p_i) + (1-y_i)\ln(1-p_i) \end{aligned} $$

⭐️ Tip

Some formulas use $\log$, which produces an identical result. From a coding perspective, most packages like TensorFlow, PyTorch and Scikit-Learn use $\ln$.

The closer the predicted probability is to the ground truth, the higher the likelihood. If we drew blue-blue-red, we would expect that the probability of drawing a blue ball would be $p > 0.5$:

  • If $p = 0.9$, the likelihood would be $0.9 \times 0.9 \times (1-0.9) = 0.081$.
  • Conversely, if $p = 0.1$, the likelihood would be $0.1 \times 0.1 \times (1-0.1) = 0.009$

In other words, the model does well when it maximises the likelihood.

Converting likelihood to loss

Finally, to convert likelihood into loss, we can multiply the natual log of the likelihood with $-1$. To get the average loss per sample, we can further divide it by $n$ samples:

$$ \text{Log loss} = -\frac{1}{n} \sum_{i=1}^{n} y_i\ln(p_i) + (1-y_i)\ln(1-p_i) $$

The purpose of this conversion is to align with the general idea of model performance. In linear regression, we saw that lower loss values are indicative of better model performance. If we used likelihood for assessing logistic regression, the opposite would be true; the higher the likelihood, the better the model performance. By making this conversion from likelihood to loss, we keep a similar grading system for model performance—the lower the loss score, the better the model performance.

That’s how the log loss formula is derived. Or at least how I think it’s derived. 🌸