01/09/2018
We now leave the land of predicting real-numbered values to look at data classification. The discussion will conclude with one of the fundamental concepts behind classification, the Perceptron algorithm.
The classification problem involves specifying which category a data point belongs to. For instance, if you have a feature vector of an individual’s political preferences, a classification problem might be to classify the individual as a Democrat or Republican. This is an example of a binary classification since there are only two classes, Democrat or Republican.
We can express our output space as:
\[\mathcal{y} = \{-1, 1\}\]Note that this is different from the regression problem where $\mathcal{y} = \mathbb{R}$.
What if we wanted to also consider Libertarians, the Green Party, and Independents as possible political parties for an individual? We can instead do multiclass classification for $K$ classes where \(\mathcal{y} = \{C_1, C_2, \ldots, C_K \}\). Multiclass classification can be found in a variety of applications such as image recognition, disease diagnostics, and product recommendation. In this post, we will only cover the binary case.
Before discussing the parameterized classification problem, we first discuss the KNN solution. For some input feature vector $\mathbf{x}$, we classify $\mathbf{x}$ by taking the majority of the $k$ nearest points:
\[\hat{y} = \begin{cases} 1 & \sum_{i=1}^k y^{(i)} > 0\\ -1 & \text{otherwise} \end{cases}\]Of course, KNN is still subject to the same problems it had in the linear regression case, namely overfitting and the need to keep all your data around at all times.
Let’s take the same modeling approach we used in linear regression and use it to classify. Define a hypothesis function $h$:
\[h(\mathbf{x}; \mathbf{w}) = \mathbf{w}^\top\mathbf{x}\]We can use this line (hyperplane) to split the space into two subspaces. If \(h(\mathbf{x}; \mathbf{w}) > 0\), then we predict $\hat{y} = 1$. Otherwise, we predict \(\hat{y} = -1\).
The set of points on this hyperplane is known as the decision boundary. These are defined as the $\mathbf{x}$ where \(\mathbf{w}^\top\mathbf{x} = 0\). To determine the decision boundary, we would like to find its orientation and position.
The orientation of this hyperplane is determined solely by $\mathbf{w}$ since $\mathbf{w}$ is orthogonal to the hyperplane. The proof is simple:
\[\mathcal{S} = \{\mathbf{x} \in \mathcal{X} | \mathbf{w}^\top\mathbf{x} + w_0 = 0 \}\]Let $\mathbf{x}_1, \mathbf{x}_2$ be in $\mathcal{S}$. Let $\mathbf{s} = \mathbf{x}_1 - \mathbf{x}_2$.
\[\begin{align*} \mathbf{w}^\top\mathbf{s} &= \mathbf{w}^\top(\mathbf{x}_1 - \mathbf{x}_2) \\ &= \mathbf{w}^\top\mathbf{x}_1 - \mathbf{w}^\top\mathbf{x}_2 \\ & = w_0 - w_0 = 0 \end{align*}\]So $\mathbf{w}$ is orthogonal to the hyperplane.
To determine the decision boundary’s location, we can use Lagrange multipliers to solve an optimization problem. We will try to find the closest point to the origin (the location by definition):
\[\arg_{\mathbf{x}}\min \mathbf{x}^\top\mathbf{x}\] \[s.t. \mathbf{w}^\top \mathbf{x} + w_0 = 0\]We then incorporate the constraint for the Lagrange multiplier:
\[L(\mathbf{x}, \lambda) = \mathbf{x}^\top\mathbf{x} + \lambda(\mathbf{w}^\top\mathbf{x} + w_0 - 0)\] \[\frac{\partial L(\mathbf{x}, \lambda)}{\partial \mathbf{x}} = 2\mathbf{x} + \lambda \mathbf{w} := 0\] \[\frac{\partial L(\mathbf{x}, \lambda)}{\partial \lambda} = \mathbf{w}^\top\mathbf{x} + w_0 := 0\]Solving the system of equations:
\[\mathbf{w}^\top(-\lambda\mathbf{w} / 2) + w_0 = 0\] \[\lambda = \frac{2w_0}{\mathbf{w}^\top\mathbf{w}}\]Thus, the distance from the origin is a function of \(\frac{2w_0}{\mathbf{w}^\top\mathbf{w}}\).
We know how to construct a decision boundary with a given $\mathbf{w}$, but we haven’t looked at the data yet. Oftentimes, data may not be linearly separable, meaning there may not be a hyperplane that separates the data into the two classes. This could be due to noise or an incorrect linear space.
One solution would be to use basis functions. Doing so introduces nonlinearity that may correctly divide the data.
Now that we have an idea of how to construct a classification model, we now need to train it. But in order to train a model, we need a loss function. One instinct might be to try using the same distance-based loss functions we’ve seen thus far. Take, for instance, least squares:
\[\mathcal{L}(\mathbf{w}) = \sum_{i=1}^n (y_i - h(\mathbf{x}_i; \mathbf{w}))^2, y_i \in \{-1, 1\}\]But think about what this loss is doing. Here, we penalize estimations based on their distance to the true class, \(\pm 1\). This means that points that are correctly classified will be penalized based on how far away from the true class they are. Thus, an outlier point can drastically shift the boundary, even if it’s classified correctly.
Instead, we need to target the accuracy of a classification, and for that, we’ll use the Perceptron algorithm.
The Perceptron algorithm is as follows:
To run perceptron, we’ll need an accuracy-based loss function with an informative derivative that tells us how to change our weights.
Activation functions are the answer to our need for accuracy-based loss functions. Activation functions map real numbers to a discrete set of values. The first activation function we’ll look it is the 0/1 activation function:
\[f_{0, 1}(z) = \begin{cases} 1 & z > 0 \\ 0 & \text{otherwise} \end{cases}\]Our loss function would then be the sum of occurrences where the prediction and true value differ:
\[\mathcal{L}(\mathbf{w}) = \sum_{i=1}^n f_{0/1}(-h(\mathbf{x_i}; \mathbf{w}) \cdot y_i)\]There is a problem with the 0/1 activtion loss, but to understand it, we’ll first talk about how the derivative-based update would work.
The idea of gradient descent is that we wish to update our weights in a way that minimizes the loss. To do so, we first compute the gradient of our loss with respect to our weights, \(\frac{\partial \mathcal{L}(\mathbf{w})}{\partial \mathbf{w}}\). This represents the slope vector of greatest ascent. We can then update our weights by subtracting this gradient vector scaled by some learning rate parameter, $\eta$:
\[\mathbf{w} \leftarrow \mathbf{w} - \eta \cdot \frac{\partial \mathcal{L}(\mathbf{w})}{\partial \mathbf{w}}\]Unfortunately, the 0/1 activation function does not have an informative derivative. At $z=0$, the slope is undefined, and 0 otherwise. As a result, performing a weight adjustment based on the derivative will yield 0 change.
The solution to this would be to use the rectified linear unit activation function (ReLU). The function is defined as:
\[f_{relu}(z) = \begin{cases} z & z > 0 \\ 0 & \text{otherwise} \end{cases} = \max(0, z)\]In the 0/1 activation function, a misclassification always had an error value of 1. Here, the farther off a point is from its correct classification, the worse the penalty. As a result, the derivative is 1 for values greater than 0 and 0 otherwise, giving us a nonzero value during gradient descent!
Let’s tie this all together. The Perceptron loss is:
\[\mathcal{L}(\mathbf{w}) = \sum_{i=1}^n f_{relu}(-h(\mathbf{x}_i; \mathbf{w}) \cdot y_i) = \sum_{i=1; y_i \neq \hat{y}_i}^n (\mathbf{w}^\top\mathbf{x}_i + w_0) \cdot y_i\]In other words, this loss is the total distance of all the misclassified points.
The update step requires taking the derivative of the loss with respect to your weights. This requires going through all your data over and over again and can often be time consuming. Another strategy commonly used in practice is stochastic gradient descent (SGD). In SGD, a gradient descent update is performed for every individual datapoint’s loss until model convergence, rather than after observing all the data. Oftentimes, this performs better in practice than regular gradient descent.
To be clear, the gradient of the total-data loss would be:
\[\frac{\partial \mathcal{L}(\mathbf{w})}{\partial \mathbf{w}} = - \sum_{i=1; y_i \neq \hat{y}_i}^n y_i \cdot \mathbf{x}_i\]Whereas the stochastic, single-point version would be:
\[\frac{\partial \mathcal{L}^{(i)}(\mathbf{w})}{\partial \mathbf{w}} = -y_i \mathbf{x}_i\]So the update after each datapoint would be:
\[\mathbf{w}^{(t+1)} \leftarrow \mathbf{w}^{(t)} + \eta y_i \mathbf{x}_i\]The final Perceptron algorithm is therefore:
If the data is linearly separable, the algorithm will converge!
08/22/2018
Toward the Light: Behind the Scenes
07/01/2018
Arch Linux: Chromebook C720 Webcam Microphone Disappeared
06/21/2018
SSH: How to Set Up a Simple Local Media Server
02/28/2018
Pacman: File Conflicts
01/17/2018
Making an Arch Linux USB Flash Install Medium
01/17/2018
Arch Linux: Post-Install Notes
01/15/2018
Binary Classification Metrics
01/14/2018
Probabilitistic Classification
01/09/2018
Classification and Perceptron
01/08/2018
Linear Regression: Bayesian Approach, Normal Conjugacy
01/08/2018
Nonlinearity: Basis Functions
01/04/2018
Linear Regression: A Probabilistic Approach
12/30/2017
Linear Regression: A Mathematical Approach
12/20/2017
2017 Reflections: A Year of Curating
12/19/2017
Introduction to Regression: K-Nearest Neighbors
12/18/2017
Welcome to my Miscellaneous Blog!
12/18/2017
A Definitive Arch Linux Install Guide for the Chromebook C720
10/01/2017
C4D: Volume Effector
09/18/2017
Algorithms: Maximum Sliding Window
09/10/2017
Introduction to Inference: Coins and Discrete Probability
09/05/2017
C4D: Unreliable Booles
08/30/2017
Welcome to my Tech Blog!
08/30/2017
Welcome to my Problem Solving Blog!
Previous: Linear Regression: Bayesian Approach, Normal Conjugacy | Next: Probabilitistic Classification