01/04/2018
Today, we look at the regression under a probabilistic modeling context that help us understand the reasons behind the least squares loss function.
Recall that in linear regression, we assume that our data is a linear combination of weights and our feature vector with an prepended bias term:
\[\hat{y} = h(\mathbf{x}; \mathbf{w}) = \sum_{i=1}^{m+1} x_i w_i = \mathbf{w^\top x}\] \[\mathbf{x}, \mathbf{w} \in \mathbb{R}^{m+1}, \mathbf{x} = \begin{bmatrix}1 \\ \mathbf{x^1}\end{bmatrix}, \mathbf{w} = \begin{bmatrix}w_0 \\ \mathbf{w^1} \end{bmatrix}\]We posited that the residual sum-of-squares was a good loss function, but its use seemed rather ad hoc. Let’s show some of the theoretical underpinnings behind this loss function.
To help understand the reasons for the least squares loss, we’ll need to review some fundamentals of probability theory. This is by no means a comprehensive overview of these concepts, but it hopefully brings back some of the most important ideas from your probability theory class. Consider a single data point. We can write the conditional probability of seeing its output $y$ given its feature vector $\mathbf{x}$ as:
\[p(y|\mathbf{x})\]If we assume that each data point is independent from one another, we can express the joint probability, the probability of seeing all of the data’s $y$ given each of the $\mathbf{x}$’s, as:
\[p(y_1, y_2, \ldots, y_n | \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n) = p(y_1|\mathbf{x}_1)p(y_2|\mathbf{x}_2)\ldots p(y_n|\mathbf{x}_n) = \prod_{i=1}^n p(y_i|\mathbf{x}_i)\]If the probabilities depend on parameters, we can also define this as the likelihood $L(\mathbf{w})$ which is a joint probability based on parameters $\mathbf{w}$.
The value of a probability is defined by its distribution. We can think of a distribution as being the probability of occurrence for all possible values in its support, the range of possible outputs. For instance, a biased coin with probabilty 0.6 of being heads would have a probabilty distribution of:
\[\begin{cases} 0.4 & \text{tails} \\ 0.6 & \text{heads} \end{cases}\]The distribution we will be discussing is the Normal/Gaussian distribution. The Normal distribution has the following probability density function:
\[\mathcal{N}(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)\]The multivariate case can be written as:
\[\mathcal{N}(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}) = \frac{1}{\sqrt{|2\pi \mathbf{\Sigma}|}} \exp\left( -\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^\top\mathbf{\Sigma}^{-1} (\mathbf{x}-\mathbf{\mu}) \right)\]Note that the probabilities of a distribution must add up to 1.
Now to tie it all together. In the linear regression, we think of the data as coming from a generative model of normally distributed data. A generative model is a probabilistic statement that describes the joint distribution over feature vectors and their output values. Here, the outputs $y$ are modeled to be generated by inputs $\mathbf{x}$. In the general regression form, we can say that our data follows a true function of the data $h$ with noise $\epsilon$:
\[y = h(\mathbf{x}, \mathbf{w}) + \epsilon, \epsilon \sim \mathcal{N}(0, \beta^{-1})\]Another way to view this is that $y$ follows a Normal distribution with mean $h$ and variance $\beta^{-1}$.
\[p(y|\mathbf{x}, \mathbf{w}) = \mathcal{N}(y|h(\mathbf{x}, \mathbf{w}), \beta^{-1}) \longleftrightarrow y \sim \mathcal{N}(h(\mathbf{x}, \mathbf{w}), \beta^{-1})\]Plugging in our assumption of linearity $(h(\mathbf{x}, \mathbf{w}) = \mathbf{w}^\top \mathbf{x})$, we get:
\[p(y|\mathbf{x}, \mathbf{w}) = \mathcal{N}(y|\mathbf{w}^\top \mathbf{x}, \beta^{-1}) \longleftrightarrow y \sim \mathcal{N}(\mathbf{w}^\top \mathbf{x}, \beta^{-1})\]The moral of the story is that we are currently making the following assumptions:
In this generative model, we would like to find a $\mathbf{w}^\star$ that maximizes the likelihood of seeing our data. Not surprisingly, we can estimate the optimal $\mathbf{w}^\star$ with the maximum likelihood estimator (MLE) $\hat{\mathbf{w}}_{MLE}$:
\[\hat{\mathbf{w}}_{MLE} = \text{arg}_\mathbf{w}\text{max }p(y_1, \ldots y_n | \mathbf{x}_1, \ldots, \mathbf{x}_n) = \text{arg}_\mathbf{w}\text{max } L(\mathbf{w})\]Using our model for the linear regression and converting it to its joint Gaussian form, we get:
\[\hat{\mathbf{w}}_{MLE} = \text{arg}_\mathbf{w}\text{max } \prod_{i=1}^n \mathcal{N}(y_i|\mathbf{w}^\top \mathbf{x}_i, \beta^{-1}) = \text{arg}_\mathbf{w}\text{max } \mathcal{N}(\mathbf{y}|\mathbf{Xw}, \mathbf{\beta}^{-1}I_{n})\]Notice the likelihood is a product of probabilities. Since probabilities are between 0 and 1, with enough data points, floating point imprecision will cause the joint probability to approach 0. To avoid these numerical issues, we often work in log-space. Furthermore, it’s often convention to work on minimization problems, so instead of maximizing the log-likelihood, we minimize the negative log-likelihood:
\[\ell(\mathbf{w}) = - \log L(\mathbf{w})\]So the new equation is:
\[\begin{align*}\hat{\mathbf{w}}_{MLE} &= \text{arg}_\mathbf{w}\text{min } \ell(\mathbf{w})\\ &= \text{arg}_\mathbf{w}\text{min } - \log \mathcal{N}(\mathbf{y}|\mathbf{Xw}, \mathbf{\beta}^{-1}I_{n}) \\ &= \text{arg}_\mathbf{w}\text{min } -\frac{1}{2} \log|2\pi\mathbf{\beta}^{-1}| - \frac{1}{2}(\mathbf{y} - \mathbf{Xw})^\top\beta(\mathbf{y} - \mathbf{Xw}) \end{align*}\]To find the argmin, we take the derivative of the the negative log-likelihood and solve for $\mathbf{w}$.
\[\begin{align*} \frac{\partial \ell(\mathbf{w})}{\partial \mathbf{w}} &= \frac{\beta}{2}(-2\mathbf{X}^\top\mathbf{y} + 2\mathbf{X}^\top\mathbf{Xw}) \\ \hat{\mathbf{w}}_{MLE}&= (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y} \end{align*}\]The maximum likelihood estimator is precisely the projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$ which we found last week! In conclusion, using the least squares loss is equivalent to the probabilistic linear regression model where data is assumed to independently come from a Normal distribution.
08/22/2018
Toward the Light: Behind the Scenes
07/01/2018
Arch Linux: Chromebook C720 Webcam Microphone Disappeared
06/21/2018
SSH: How to Set Up a Simple Local Media Server
02/28/2018
Pacman: File Conflicts
01/17/2018
Making an Arch Linux USB Flash Install Medium
01/17/2018
Arch Linux: Post-Install Notes
01/15/2018
Binary Classification Metrics
01/14/2018
Probabilitistic Classification
01/09/2018
Classification and Perceptron
01/08/2018
Linear Regression: Bayesian Approach, Normal Conjugacy
01/08/2018
Nonlinearity: Basis Functions
01/04/2018
Linear Regression: A Probabilistic Approach
12/30/2017
Linear Regression: A Mathematical Approach
12/20/2017
2017 Reflections: A Year of Curating
12/19/2017
Introduction to Regression: K-Nearest Neighbors
12/18/2017
Welcome to my Miscellaneous Blog!
12/18/2017
A Definitive Arch Linux Install Guide for the Chromebook C720
10/01/2017
C4D: Volume Effector
09/18/2017
Algorithms: Maximum Sliding Window
09/10/2017
Introduction to Inference: Coins and Discrete Probability
09/05/2017
C4D: Unreliable Booles
08/30/2017
Welcome to my Tech Blog!
08/30/2017
Welcome to my Problem Solving Blog!
Previous: Linear Regression: A Mathematical Approach | Next: Model Selection