Skip to content

05. Loss Functions

Introduction

In machine learning (ML), a loss function, also known as a cost function or objective function, measures the difference between the predicted values generated by a model and the actual ground truth values. Loss functions play a crucial role in training models by quantifying how well the model is performing. The goal during training is to minimize this loss function, thereby improving the model's ability to make accurate predictions.

A loss function or cost function \(L(\phi)\) returns a single number that describes the mismatch between the model predictions \(f[x_i, \phi]\) and their corresponding ground-truth outputs \(y_i\). During training, we seek parameter values \(\phi\) that minimize the loss and hence map the training inputs to the outputs as closely as possible.

Maximum Likelihood

Consider a model \(f[x, \phi]\) with parameters \(\phi\) that computes an output from input x. Consider the model as computing a conditional probability distribution \(Pr(y|x)\) over possible outputs y given input x. The loss encourages each training output \(y_i\) to have a high probability under the distribution \(Pr(y_i | x_i)\) computed from the corresponding input \(x_i\).

Computing a distribution over outputs

This raises the question of exactly how a model \( f[x, \phi] \) can be adapted to compute a probability distribution. The solution is simple. First, we choose a parametric distribution \( Pr(y | \theta) \) defined on the output domain y. Then we use the network to compute one or more of the parameters θ of this distribution.

Maximum likelihood criterion

The combined probability term is the likelihood of the parameters, and hence equation 5.1 is known as the maximum likelihood criterion.

Here we are implicitly making two assumptions. First, we assume that the data are identically distributed. Second, we assume that the conditional distributions \( Pr(y_i | xi_i) \) of the output given the input are independent, so the total likelihood of the training data decomposes as

$$ Pr(y1,y2,...,yI|x1,x2,...,xI) = \prod\limits_{i=1}^{I} Pr(y_i | x_i) $$ In other words, we assume the data are independent and identically distributed (i.i.d.).

Maximizing Log likelihood

The maximum likelihood criterion (equation 5.1) is not very practical. Each term Pr(yi|f[xi,φ]) can be small, so the product of many of these terms can be tiny. It may be difficult to represent this quantity with finite precision arithmetic. Fortunately, we can equivalently maximize the logarithm of the likelihood.

\[ \begin{align} \hat \phi &= \underset{\phi}{\operatorname{argmax}} \left[ \prod\limits_{i=1}^{I} Pr(y_i | f[x_i, \phi]) \right ] \\ &= \underset{\phi}{\operatorname{argmax}} \left[ log \left [\prod\limits_{i=1}^{I} Pr(y_i | f[x_i, \phi]) \right ] \right ] \\ &= \underset{\phi}{\operatorname{argmax}} \left [\sum\limits_{i=1}^{I} log \left [ Pr(y_i | f[x_i, \phi]) \right ] \right ] \end{align} \]

Since logartihmic is monotonically increasing function, It follows that when we change the model parameters φ to improve the log-likelihood criterion, we also improve the original maximum likelihood criterion. However, the log-likelihood criterion has the practical advantage of using a sum of terms, not a product, so representing it with finite precision isn’t problematic.

Minimizing Negative log likelihood

Finally, we note that, by convention, model fitting problems are framed in terms of minimizing a loss. To convert the maximum log-likelihood criterion to a minimization problem, we multiply by minus one, which gives us the negative log-likelihood criterion

\[ \begin{align} \hat \phi &= \underset{\phi}{\operatorname{argmin}} \left [ - \sum\limits_{i=1}^{I} log \left [ Pr(y_i | f[x_i, \phi]) \right ] \right ] \\ &= \underset{\phi}{\operatorname{argmin}} [L[\phi] ] \end{align} \]

Inference

The network no longer directly predicts the outputs y but instead determines a proba- bility distribution over y. When we perform inference, we often want a point estimate rather than a distribution, so we return the maximum of the distribution

\[ \hat y = \underset{y}{\operatorname{argmax}} \left [ Pr(y | f[x, \phi])\right ] \]

Univariate Regression

Least Squares

The goal is to predict a single scalar output \( y \in R\) from input x using a model \( f(x, \phi ) \) with parameters \( \phi \) . Lets select the probability distribution over the output domain \( y \) to be an univariate normal distritbution. This distribution has two parameters \(\mu\) , \(\sigma\) and has probability density function as follows $$ Pr(y | \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{(y - \mu)^2}{2\sigma^2}} $$ We set our machine learning model to predict one or more parameters of the distribution. Here we just compute the mean \(\mu = f(x, \phi)\). So above equation becomes $$ Pr(y | f(x, \phi), \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \cdot e^{-\frac{(y - f(x, \phi))^2}{2\sigma^2}} $$ Now we aim to find the paramters \(\phi\) to make the training data \({x_i, y_i }\) most probable under this distribution. We choose a loss function \(L(\phi)\) based on the negative log likelihood

\[ %underset{<constraints>}{\operatorname{<argmax or argmin>}} %\underset{c\in C}{\operatorname{argmin}} \begin {align} L(\phi) &= \sum_{i=1}^{I} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x, \phi))^2}{2\sigma^2}\right) \\ &= \underset{\phi}{\operatorname{argmin}}(\sum_{i=1}^{I} log [\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x, \phi))^2}{2\sigma^2}\right)) ] \\ &= \underset{\phi}{\operatorname{argmin}}(\sum_{i=1}^{I} log \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right) - \left(\frac{(y - f(x, \phi))^2}{2\sigma^2}\right)) \\ &= \underset{\phi}{\operatorname{argmin}}(\sum_{i=1}^{I} - \frac{(y - f(x, \phi))^2}{2\sigma^2})) \\ &= \underset{\phi}{\operatorname{argmin}}(\sum_{i=1}^{I} (y - f(x, \phi))^2) \end{align} \]
  • We have removed the first term because it doesnt depend on model parameters \(\phi\)
  • We have removed the denominator as this is just a scaling factor and it doesnt affect the optimization.
  • We observe that least square loss function follows naturally from the assumptions that prediction errors are independent and drawn from a normal distribution with \(\mu = f(x, \phi)\)

Inference

The network no longer directly predicts y but instead predicts the mean \(μ = f[x, \phi]\) of the normal distribution over y. When we perform inference, we usually want a single “best” point estimate \(\hat y\), so we take the maximum of the predicted distribution:

$$ \hat y = \underset{y}{\operatorname{argmax}} \left[ Pr(y | f(x, \hat \phi, \sigma^2))\right] $$ For the univariate normal, the maximum position is determined by the mean parameter μ. This is precisely what the model computed, so \(\hat y =f[x, \hat \phi ]\).

Estimate Variance

To formulate the least squares loss function, we assumed that the network predicted the mean of a normal distribution. In inference, the model predicts the mean \(\mu = f[x, \hat \phi ]\) from the input, and we learned the variance \(\sigma^2\) during the training process. The former is the best prediction. The latter tells us about the uncertainty of the prediction. Minimizing equation becomes

\[ \hat \phi, \hat \sigma^2 = \underset{\phi, \sigma^2}{\operatorname{argmin}}(\sum_{i=1}^{I} log [\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y - f(x, \phi))^2}{2\sigma^2}\right)) ] \]

Heteroscedastic regression

The model above assumes that the variance of the data is constant everywhere. However, this might be unrealistic. When the uncertainty of the model varies as a function of the input data, we refer to this as heteroscedastic (as opposed to homoscedastic, where the uncertainty is constant).

Binary Classification

In binary classfication the goal is to assign the data x to one of the discrete classes \(y \in {0,1}\). First we choose a probability distribution over the output space \(y \in {0,1}\). A suitable choice for this is bernouli distribution which has a single parameter \(\lambda\) . \(\lambda\) represents the probability of \(y=1\).

\[ Pr(y | \lambda) = \begin{cases} \lambda & \text{if } y = 1 \\ 1 - \lambda & \text{if } y = 0 \end{cases} \]

Above equation can also be represented as $$ Pr(y | \lambda) = (1-\lambda)^{1-y}. \lambda^y $$

We set the machine learning model to predict single paramter \(\lambda\). \(\lambda\) can take values in the range [0, 1]. We pass the network outputs throught the sigmoid function which maps outputs to be in rangeof [0, 1]. Likelihood is defined as $$ L(\phi) = (1-sig[f[x, \phi]]))^{1-y} . sig[f[x, \phi]]^y $$

Negative log likelihood for a training set would be

\[ L[\phi] = \sum\limits_{i=1}^{I} - (1-y_i) log \left [(1-sig[f[x, \phi]])) \right] - y_i log \left[sig[f[x, \phi]]\right] \]

The transformed model output \(sig[f[x, \phi]]\) predicts the parameter \(\lambda\) of the Bernoulli distribution. This represents the probability that y = 1, and it follows that 1 − \(\lambda\) represents the probability that y = 0. When we perform inference, we may want a point estimate of y, so we set y = 1 if \(\lambda\) > 0.5 and y = 0 otherwise.

Cross-entropy loss

The cross-entropy loss is based on the idea of finding parameters \(\theta\) that minimize the distance between the empirical distribution \(q(y)\) of the observed data y and a model distribution \(Pr(y|\theta)\)

\[ D_{\text{KL}}[q || p] = \int_{-\infty}^{\infty} q(z) \log \left( \frac{q(z)}{p(z)} \right) \, dz \]

Now consider that we observe an empirical data distribution at points \({y_i}_{i=1}^{I}\). We can describe this as a weighted sum of point masses:

$$ q(y) = \frac{1}{I} \sum\limits_{i=1}^{I} \delta[y-y_i] $$ where \(\delta\) is the delta function. We want to minimize the KL divergence between the model distribution \(Pr(y | \theta)\) and this emperical distribution

\[ \begin {align} \hat \theta &= \int_{-\infty}^{\infty} q(z) \log \left( \frac{q(z)}{p(z)} \right) \, dz \\ &= \int_{-\infty}^{\infty} q(y) \log \left( q(y) \right) \, dy - \int_{-\infty}^{\infty} q(y) \log \left( Pr(y | \theta) \right) dy \\ &= - \int_{-\infty}^{\infty} q(y) \log \left( Pr(y | \theta) \right) dy \\ \end{align} \]

First term disappears since it doesnt depent on model paramters \(\theta\). Second term is knows as cross entropy. It can be interpreted as the amount of uncertainty that remains in one distribution after taking into account what we already know from the other.

References

  1. Understanding Deep Learning
  2. Maximum Likelihood