01/08/2018
Understanding the linear regression from a probabilistic perspective allows us to perform more advanced statistical inference. Today, we’ll be applying Bayesian inference concepts to the linear regression. As a result, we’ll have a way to update the beliefs of our models as more data becomes accessible or account for prior knowledge when looking at data.
I highly recommend taking a look at this introductory post on inference before delving into this post as it covers the fundamentals of MLE, MAP, and conjugacy.
Recall from the last post on Nonlinearity: Basis Functions that we can avoid overfitting our models with regularization terms. Here is the $\ell_2$ norm loss, for example:
\[\min_{\mathbf{w}} \mathcal{L}(\mathbf{w}) + \lambda \mathbf{w}^\top\mathbf{w}\]We use $\lambda$ to be an adjustable hyperparameter that specifies how much weight we put on the regularization term. Regularization terms can be effective, but they’re rather ad hoc and inflexible.
Another solution to avoid overfitting to the data is to use Bayesian methods. In the Bayesian framework, we have a prior belief on our weights, $\mathbf{w}$, which we would like to use to maximize the likelihood of our data $D$, denoted $p(D|\mathbf{w})$.
For a super brief overview of Bayes’ Rule, we define the posterior, $p(\mathbf{w}|D)$, as the product of the likelihood and prior:
\[p(\mathbf{w}|D) = \frac{\overbrace{p(D|\mathbf{w})}^\text{likelihood}\overbrace{p(\mathbf{w})}^\text{prior}}{p(D)} \propto p(D|\mathbf{w})p(\mathbf{w})\] \[p(D) = \int_{\mathbf{w}} p(D|\mathbf{w})p(\mathbf{w})d\mathbf{w} = \text{constant for all }\mathbf{w}\]We’ve discussed how to calculate the likelihood before. It’s simply the product of independent probabilities of each data point. But how do we choose the prior? Often, the prior distribution is informed by a field expert or taken from a previous experiment. Otherwise, you can use a flat prior, one that provides little information in the prior.
In the linear regression context, the likelihood of our data comes from a multivariate Normal distribution $\mathcal{N}(\mathbf{y}|\mathbf{Xw}, \beta^{-1}\mathbf{I})$ with an assumed (fixed) precision $\beta$. Suppose we define our prior on $\mathbf{w}$ to be $\mathcal{N}(\mathbf{w}|\mathbf{m}_0, \mathbf{S}_0)$. Our posterior is then proportional to:
\[p(\mu|D) \propto p(D|\mu)p(\mu)\] \[=\mathcal{N}(\mathbf{y}|\mathbf{Xw}, \beta^{-1}\mathbf{I})\mathcal{N}(\mathbf{w}|\mathbf{m}_0, \mathbf{S_0})\]After multiplying out the two distributions, completing the square, and doing a whole lot of algebra in between, we arrive at the conclusion that the posterior distribution is indeed multivariate Normal:
\[\mathbf{w} \sim \mathcal{N}(\mathbf{m}_n, \mathbf{S}_n)\] \[\mathbf{S}_n = (\mathbf{S}_0^{-1} + \beta \mathbf{X}^\top\mathbf{X})^{-1}\] \[\mathbf{m}_n = \mathbf{S}_n(\mathbf{S}_0^{-1}\mathbf{m}_0 + \beta \mathbf{X}^\top \mathbf{y})\]This means that the Normal distribution is a conjugate of itself with the posterior form having the above parameters.
The MAP estimator for $\mu$ is then $\mathbf{m}_n$ instead of $\mathbf{m}_0$. This is a vector containing the average values of each feature.
08/22/2018
Toward the Light: Behind the Scenes
07/01/2018
Arch Linux: Chromebook C720 Webcam Microphone Disappeared
06/21/2018
SSH: How to Set Up a Simple Local Media Server
02/28/2018
Pacman: File Conflicts
01/17/2018
Making an Arch Linux USB Flash Install Medium
01/17/2018
Arch Linux: Post-Install Notes
01/15/2018
Binary Classification Metrics
01/14/2018
Probabilitistic Classification
01/09/2018
Classification and Perceptron
01/08/2018
Linear Regression: Bayesian Approach, Normal Conjugacy
01/08/2018
Nonlinearity: Basis Functions
01/04/2018
Linear Regression: A Probabilistic Approach
12/30/2017
Linear Regression: A Mathematical Approach
12/20/2017
2017 Reflections: A Year of Curating
12/19/2017
Introduction to Regression: K-Nearest Neighbors
12/18/2017
Welcome to my Miscellaneous Blog!
12/18/2017
A Definitive Arch Linux Install Guide for the Chromebook C720
10/01/2017
C4D: Volume Effector
09/18/2017
Algorithms: Maximum Sliding Window
09/10/2017
Introduction to Inference: Coins and Discrete Probability
09/05/2017
C4D: Unreliable Booles
08/30/2017
Welcome to my Tech Blog!
08/30/2017
Welcome to my Problem Solving Blog!
Previous: Nonlinearity: Basis Functions | Next: Classification and Perceptron