6.2 Maximum Variance Perspective
Principal Component Analysis (PCA) can be derived from the idea that the most informative directions in data are those with the highest variance. When reducing dimensionality, we aim to retain as much of this variance (and hence information) as possible. In essence, PCA finds a low-dimensional subspace that maximizes the variance of the projected data.
For simplicity, we assume that the dataset is centered (mean = 0). This assumption can be made without loss of generality because shifting the data by its mean does not affect its variance:
\[ V_z[\mathbf{z}] = V_x[\mathbf{B}^\top(\mathbf{x} - \boldsymbol{\mu})] = V_x[\mathbf{B}^\top \mathbf{x}]. \]
Therefore, we can safely assume that both the data \(\mathbf{x}\) and the low-dimensional code \(\mathbf{z} = \mathbf{B}^\top \mathbf{x}\) have zero mean.
6.2.1 Direction with Maximal Variance
We start by finding a single direction \(\mathbf{b}_1 \in \mathbb{R}^D\) along which the variance of the projected data is maximized.
For centered data: \[ z_{1n} = \mathbf{b}_1^\top \mathbf{x}_n \] and the variance of the projection is: \[ V_1 = \frac{1}{N} \sum_{n=1}^{N} (\mathbf{b}_1^\top \mathbf{x}_n)^2 = \mathbf{b}_1^\top \mathbf{S} \mathbf{b}_1, \] where \(\mathbf{S}\) is the data covariance matrix.
To prevent arbitrary scaling of \(\mathbf{b}_1\), we impose the constraint: \[ \|\mathbf{b}_1\|^2 = 1. \]
This leads to the constrained optimization problem: \[ \max_{\mathbf{b}_1} \; \mathbf{b}_1^\top \mathbf{S} \mathbf{b}_1 \quad \text{subject to } \|\mathbf{b}_1\|^2 = 1. \]
Using a Lagrangian formulation: \[ L(\mathbf{b}_1, \lambda_1) = \mathbf{b}_1^\top \mathbf{S} \mathbf{b}_1 + \lambda_1(1 - \mathbf{b}_1^\top \mathbf{b}_1). \]
Setting derivatives to zero gives: \[ \mathbf{S} \mathbf{b}_1 = \lambda_1 \mathbf{b}_1. \]
Thus, \(\mathbf{b}_1\) is an eigenvector of \(\mathbf{S}\), and \(\lambda_1\) is the corresponding eigenvalue.
The variance of the projection onto \(\mathbf{b}_1\) is: \[ V_1 = \lambda_1. \]
Hence, the first principal component is the eigenvector corresponding to the largest eigenvalue of \(\mathbf{S}.\)
The projected data point can be reconstructed as: \[ \tilde{\mathbf{x}}_n = \mathbf{b}_1 \mathbf{b}_1^\top \mathbf{x}_n. \]
6.2.2 M-Dimensional Subspace with Maximal Variance
To generalize, we seek an M-dimensional subspace spanned by orthonormal vectors: \[ \mathbf{B} = [\mathbf{b}_1, \mathbf{b}_2, \dots, \mathbf{b}_M], \] where each \(\mathbf{b}_m\) is an eigenvector of \(\mathbf{S}\) associated with one of its largest eigenvalues. After finding the first \(m-1\) principal components, we can define the remaining (unexplained) data as: \[ \hat{\mathbf{X}} = \mathbf{X} - \sum_{i=1}^{m-1} \mathbf{b}_i \mathbf{b}_i^\top \mathbf{X} = \mathbf{X} - \mathbf{B}_{m-1} \mathbf{X}, \] and seek the next direction \(\mathbf{b}_m\) that maximizes the variance of the residual data. This again leads to: \[ \mathbf{S} \mathbf{b}_m = \lambda_m \mathbf{b}_m, \] where \(\lambda_m\) is the m-th largest eigenvalue of \(\mathbf{S}\). Thus, each principal component corresponds to an eigenvector of the covariance matrix. The variance captured by the m-th component is: \[ V_m = \mathbf{b}_m^\top \mathbf{S} \mathbf{b}_m = \lambda_m. \]
Theorem 6.1 The total variance captured by the first \(M\) principal components is: \[ V_M = \sum_{m=1}^{M} \lambda_m. \]
The variance lost (due to compression) is: \[ J_M = \sum_{j=M+1}^{D} \lambda_j = V_D - V_M. \]
Alternatively, the relative variance captured is: \[ \frac{V_M}{V_D}, \] and the relative variance lost is: \[ 1 - \frac{V_M}{V_D}. \]
Example 6.3 If we project the point \(\mathbf{x} = [5, 3]^\top = 5\mathbf{e}_1 + 3\mathbf{e}_2\) onto the subspace spanned by \(\mathbf{e}_2\): \[ \tilde{\mathbf{x}} = [0, z]^\top = z\mathbf{e}_2 = [0, 3]^\top \] We can calculate the error as \[ \mathbf{x} - \tilde{\mathbf{x}} = [5, 3]^\top [0, 3]^\top = [5, 0]^\top.\]
If this projection came from the covariance matrix of a large data set with 2 eigenvalues 6 and 4, then this projection represents 60% (\(6/(6+4)\)) of the variance in the projection while the lost variance is 40%. In essence, 60% of the variability of the data associated with a projection of this type would be a result of the second component.
Example 6.4 When PCA is applied to the MNIST images of the digit “8”:
- The eigenvalues of the data covariance matrix decrease rapidly.
- Most variance is captured by only a small number of principal components.
- This demonstrates that the dataset has an intrinsic low-dimensional structure.
Exercises
Exercise 6.1 Show that the variance of low dimensional code does not depend on the mean of the data. This allows us to assume our data has mean 0.
Exercise 6.2 Consider the constrained optimization problem \[\begin{align*} &\max_{\mathbf{b}_1} \mathbf{b}_1^T\mathbf{S}\mathbf{b}_1\\& \text{subject to} ||\mathbf{b}_1||^2 = 1. \end{align*}\] Derive the necessary Lagrangian conditions for this optimization problem.
Exercise 6.3 Explain why we maximize the variance when we select the basis vector that is associated with the largest eigenvalue.