08. Measuring Performance
Introduction
With sufficient capacity (i.e., number of hidden units), a neural network model will often perform perfectly on the training data. However, this does not necessarily mean it will generalize well to new test data. We will see that the test errors have three distinct causes and that their relative contributions depend on (i) the inherent uncertainty in the task, (ii) the amount of training data, and (iii) the choice of model. We discuss how to select both the model hyperparameters and the learning algorithm hyperparameters.
Sources of Error
Consider a quasi-sinusoidal function; both training and test data are generated by sampling input values in the range [0, 1], passing them through this function, and adding Gaussian noise with a fixed variance.
- Noise
- The data generation process includes the addition of noise, so there are multiple possible valid outputs y for each input x.
- Noise may arise because there is a genuine stochastic element to the data generation process, because some of the data are mislabeled, or because there are further explanatory variables that were not observed.
- However, noise is usually a fundamental limitation on the possible test performance.
- Bias
- A second potential source of error may occur because the model is not flexible enough to fit the true function perfectly.
- For example, the three-region neural network model cannot exactly describe the quasi-sinusoidal function, even when the parameters are chosen optimally (figure 8.5b). This is known as bias.
- Variance
- We have limited training examples, and there is no way to distinguish sys tematic changes in the underlying function from noise in the underlying data.
- When we fit a model, we do not get the closest possible approximation to the true underlying function.
- This additional source of variability in the fitted function is termed variance. In practice, there might also be additional variance due to the stochastic learning algorithm, which does not necessarily converge to the same solution each time.
Reducing Error
We saw that test error results from three sources: noise, bias, and variance. The noise component is insurmountable; there is nothing we can do to circumvent this, and it represents a fundamental limit on model performance. However, it is possible to reduce the other two terms.
Reducing Variance
Recall that the variance results from limited noisy training data. Fitting the model to two different training sets results in slightly different parameters. It follows we can reduce the variance by increasing the quantity of training data. This averages out the inherent noise and ensures that the input space is well sampled. In general, adding training data almost always improves test performance.
Reducing Bias
The bias term results from the inability of the model to describe the true underlying function. This suggests that we can reduce this error by making the model more flexible. This is usually done by increasing the model capacity. For neural networks, this means adding more hidden units and/or hidden layers.
Bias-Variance trade-off
For a fixed-size training dataset, the variance term increases as the model capacity increases. Consequently, increasing the model capacity does not necessarily reduce the test error. This is known as the bias-variance trade-off.
As we add capacity to the model, the bias decreases, but the variance increases for a fixed-size training dataset. This suggests that there is an optimal capacity where the bias is not too large and the variance is still relatively small.
Double Descent
Test error decreases as we add capacity to the model until it starts to overfit the training data. However, then it does something unexpected; it starts to decrease again. Indeed, if we add enough capacity, the test loss reduces to below the minimal level that we achieved in the first part of the curve. This phenomenon is known as double descent.
The first part of the curve is referred to as the classical or under-parameterized regime, and the second part as the modern or over-parameterized regime. The central part where the error increases is termed the critical regime.
Explanation
The discovery of double descent is recent, unexpected, and somewhat puzzling. It results from an interaction of two phenomena. First, the test performance becomes temporarily worse when the model has just enough capacity to memorize the data. Second, the test performance continues to improve with capacity even after the training performance is perfect.
Once the model fits the training data almost perfectly, it implies that adding further capacity to the model cannot help to fit the training data better. The tendency of a model to prioritize one solution over another as it extrapolates between data points is known as its inductive bias.
The tendency of the volume of high-dimensional space to overwhelm the number of training points is termed the curse of dimensionality. There are small regions of input space where we observe data with significant gaps between them. The putative explanation for double descent is that as we add capacity to the model, it interpolates between the nearest data points increasingly smoothly.
It’s certainly true that as we add more capacity to the model, it will have the capability to create smoother functions. When the number of parameters is very close to the number of training data examples, the model is forced to contort itself to fit the training data exactly, resulting in erratic predictions. As we add more hidden units, the model has the ability to construct smoother functions that are likely to generalize better to new data.
However, this does not explain why over-parameterized models should produce smooth functions. The answer to this question is uncertain, but there are two likely possibilities. First, the network initialization may encourage smoothness, and the model never departs from the sub-domain of smooth function during the training process. Second, the training algorithm may somehow “prefer” to converge to smooth functions
Choosing Hyper parameters
In the classical regime, we don’t have access to either the bias (which requires knowledge of the true underlying function) or the variance (which requires multiple independently sampled datasets to estimate). In the modern regime, there is no way to tell how much capacity should be added before the test error stops improving. This raises the question of exactly how we should choose model capacity in practice.
For a deep network, the model capacity depends on the numbers of hidden layers and hidden units per layer as well. Furthermore, the choice of learning algorithm and any associated parameters also affects the test performance.
The hyperparameter space is generally smaller than the parameter space but still too large to try every combination exhaustively. Hyperparameter optimization algorithms intelligently sample the space of hyperparameters, contingent on previous results. This procedure is computationally expensive since we must train an entire model and measure the validation performance for each combination of hyperparameters.