01/06/2018
So far, we’ve looked at linear regression and K-Nearest Neighbors as potential models for estimating real-valued data. But how do we know which model is the best to use? In this post, we discuss overfitting, bias-variance decomposition, and regularization as factors when considering models.
When choosing a model, it might be tempting to select the model with the minimum loss. After all, if our model matches our data perfectly, why wouldn’t we use it? Unfortunately, this leads to the problem of overfitting, the issue where the model adheres too closely to the data we possess. Overfitting can occur for several reasons:
Insufficient data: the data doesn’t capture enough variation. If we roll a die 6 times and all of them happen to be exclusively odd numbers, then our model might think we have 0 probability of getting an even number.
Noise: we might mistake noise to be a property of the data. Perhaps, because of high noise, we might want to use a high-degree polynomial regression to decrease the loss, even though a linear model would work perfectly fine.
Irrelevant features: our model might rely on correlated features that have little to do with the output. For example, if a teacher’s socks are red and it tends to rain, we still wouldn’t care about the teacher’s sock color since it’s absurd to think the two are related (unless the teacher chooses to wear red socks on rainy days).
To prevent overfitting, we need to make decisions about what data is relevant to our problem and how much we should believe in our judgment.
To calculate our model’s performance, we often split our data into a training set, validation set, and test set. The training set is used to train a model, the validation set is used to calculate the performance of a model, and the test set is used to measure the error in our model. The idea is that we train different models with the training set, compare them based on the error of the validation set (called the validation error), and finally select a model (usually the one with the minimum validation erorr), reporting its error on the test set (the test error). Notice that we do not look at the test set until the model has been selected. Our test set is meant to simulate the performance of the model on new, unforeseen data. If we were to evaluate the performance of our model on our test set, we would essentially be cheating on the efficacy of our model. This is known as peeking and should be avoided. The error in the test error is an unbiased estimate of the generalization error, the prediction accuracy on unseen data, whereas the validation error is a biased estimate of the generalization error.
When considering how exactly to split your data, there are two factors to keep in mind. If your training set is too small, then none of your models will learn very well. However, if your validation set is too small, then the performance statistic will have high variance. A safe split of the data would be 60% train, 20% validation, 20% test.
Sometimes, you might not have enough data to split your data into the three groups. One such solution to this problem would be to perform k-fold cross validation. The idea is you split your data into $k$ partitions, assign $k-1$ groups to training set, assign the remaining group to the validation set, and measure your model’s accuracy. The average of your accuracies over the $k$ different configurations would be your model’s validation error. This is a standard procedure, even though the experiments now lose independence.
$K$-fold cross validation improves the generalizability of the model, but we’d then have to train our model $k$ times, one per data configuration, greatly increasing the model generation time. For a better method in improving generalization, we need to understand the tradeoff between bias and variance.
Bias is the systematic error, or error that arises because the model tends deviate from the observed data. Variance is the sensitivity of prediction, or how much influence a data point has on the model. Simpler models tend to underfit the data because they tend to deviate from the provided data (high bias) but are robust to outliers in the data (low variance). More complex models tend to overfit the data because they adhere to the data (low bias) but are sensitive to outliers in the data (high variance).
The generalization error can be summed up as the following:
The right tradeoff between bias and variance depends on the amount of data. With more data, we can afford to use more complex models.
Here, we detail the mathematics behind the bias-variance decomposition. We’ll define the following variables:
$D$: the data as a random variable
$h_D: \mathcal{X} \to \mathbb{R}$, the trained model
$\mathbf{x}$, a new input feature vector
$y$, a true target value as a random variable
We would like to calculate the generalization error for $\mathbf{x}$:
\[\mathbb{E}_{D, y|\mathbf{x}} [(y - h_D(\mathbf{x}))^2]\]This is the expectation of the residual error between the true $y$ and the model’s estimate $y$ with respect to the data and the true $y$.
We now define the true conditional mean:
\[\bar{y} = \mathbb{E}_{y|\mathbf{x}} [y]\]We then do a bit of arithmetic:
\[\mathbb{E}_{D, y|\mathbf{x}} [(y - h_D(\mathbf{x}))^2] = \mathbb{E}_{D, y|\mathbf{x}}[(y-\bar{y}+\bar{y} - h_D(\mathbf{x}))^2]\] \[=\mathbb{E}_{y|\mathbf{x}}[(y-\bar{y})^2]+\mathbb{E}_D[(\bar{y}-h_D(\mathbf{x}))^2] + 2\mathbb{E}_{D,y|\mathbf{x}}[(y-\bar{y})(\bar{y}-h_D(\mathbf{x}))]\]The last term can be written as:
\[2\mathbb{E}_{D, y|\mathbf{x}}[\bar{y} - h_D(\mathbf{x})] \cdot \mathbb{E}_{y|\mathbf{x}}[y-\bar{y}] = 2\mathbb{E}_{D, y|\mathbf{x}}[\bar{y} - h_D(\mathbf{x})] \cdot 0 = 0\]So the generalization error simplifies to:
\[=\underbrace{\mathbb{E}_{y|\mathbf{x}}[(y-\bar{y})^2]}_\text{noise}+\underbrace{\mathbb{E}_D[(\bar{y}-h_D(\mathbf{x}))^2]}_\text{bias+variance}\]Let’s now split up the bias/variance term. Let the prediction mean be:
\[\bar{h}(\mathbf{x}) = \mathbb{E}_D[h_D(\mathbf{x})]\]Using similar arithmetic from before, we get:
\[\mathbb{E}_D[(\bar{y}-h_D(\mathbf{x}))^2] = \mathbb{E}_D[(\bar{y}-\bar{h}(\mathbf{x}) + \bar{h}(\mathbf{x}) - h_D(\mathbf{x}))^2]\] \[=\underbrace{(\bar{y} - \bar{h}(\mathbf{x}))^2}_\text{bias squared} + \underbrace{\mathbb{E}_D[(\bar{h}(\mathbf{x}) - h_D(\mathbf{x}))^2]}_\text{variance} + \underbrace{2\mathbb{E}_D[(\bar{y}-\bar{h}(\mathbf{x}))(\bar{h}(\mathbf{x}) - h_D(\mathbf{x}))]}_0\]Plugging this back into the generalization error, we get:
\[=\underbrace{\mathbb{E}_{y|\mathbf{x}}[(y-\bar{y})^2]}_\text{noise}+\underbrace{(\bar{y} - \bar{h}(\mathbf{x}))^2}_\text{bias squared} + \underbrace{\mathbb{E}_D[(\bar{h}(\mathbf{x}) - h_D(\mathbf{x}))^2]}_\text{variance}\] \[=\text{noise}(\mathbf{x})+(\text{bias}(h(\mathbf{x}))^2+\text{Var}_D(h_D(\mathbf{x}))\]The expectation of this over $\mathbf{x}$ is the generalization error. In conclusion, the generalization error is a function of the noise, bias, and variance of the data. Notice that as $n \to \infty$, the variance falls, allowing us to reduce generalization error and use a more complex model.
To control this tradeoff, we can incorporate a regularization term in the loss function. Regularization terms penalize loss based on the magnitude and number of weights. Take, for example, the least-squared loss:
\[\mathcal{L}(\mathbf{x}) = \sum_{i=1}^n (y_i - h(\mathbf{x}_i; \mathbf{w}))^2\]One such regularization term is known as ridge regression where we add the squared $\ell_2$ norm:
\[||\mathbf{x}||^2_2 = (\ell_2(\mathbf{x}))^2 = (\sqrt{\mathbf{x} \cdot \mathbf{x}})^2 = <\mathbf{x}, \mathbf{x}>\]So our new loss to minimize would be:
\[\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) + ||\mathbf{w}||^2_2\]Fortunately, ridge regression is convex and differentiable, so the loss function can be optimized by matrix inversion. However, the inversion could be expensive for a large number of features. This is not the case for the LASSO regression which uses the $\ell_1$ norm:
\[\ell_1(\mathbf{w}) = ||\mathbf{w}||_1 = \sum_{i}^{|\mathbf{w}|}|w_i|\]for the loss function:
\[\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) + ||\mathbf{w}||_1\]Instead, we can always use stochastic methods/gradient descent to optimize these loss functions.
Note that we can use both regularization terms together or even add hyperparameter constants to adjust the strength of each like so:
\[\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) + \lambda \mathbf{w}^\top\mathbf{w}\]Finally, usually a decrease in variance results in an increase in bias, but we can also reduce the variance without increasing bias by using ensemble methods. The ensemble method simply trains multiple models and takes the average (for regression, majority vote for classification) of predictions. This is known as bagging. In boosting, we train data sequentially, giving more weight to incorrectly classified examples.
Phew, that was a lot about understanding how to choose good models. Once we get to neural networks, we don’t have to worry about any of this. :)
08/22/2018
Toward the Light: Behind the Scenes
07/01/2018
Arch Linux: Chromebook C720 Webcam Microphone Disappeared
06/21/2018
SSH: How to Set Up a Simple Local Media Server
02/28/2018
Pacman: File Conflicts
01/17/2018
Making an Arch Linux USB Flash Install Medium
01/17/2018
Arch Linux: Post-Install Notes
01/15/2018
Binary Classification Metrics
01/14/2018
Probabilitistic Classification
01/09/2018
Classification and Perceptron
01/08/2018
Linear Regression: Bayesian Approach, Normal Conjugacy
01/08/2018
Nonlinearity: Basis Functions
01/04/2018
Linear Regression: A Probabilistic Approach
12/30/2017
Linear Regression: A Mathematical Approach
12/20/2017
2017 Reflections: A Year of Curating
12/19/2017
Introduction to Regression: K-Nearest Neighbors
12/18/2017
Welcome to my Miscellaneous Blog!
12/18/2017
A Definitive Arch Linux Install Guide for the Chromebook C720
10/01/2017
C4D: Volume Effector
09/18/2017
Algorithms: Maximum Sliding Window
09/10/2017
Introduction to Inference: Coins and Discrete Probability
09/05/2017
C4D: Unreliable Booles
08/30/2017
Welcome to my Tech Blog!
08/30/2017
Welcome to my Problem Solving Blog!
Previous: Linear Regression: A Probabilistic Approach | Next: Nonlinearity: Basis Functions