6.3 Projection Perspective
Principal Component Analysis (PCA) can also be viewed as finding the best low-dimensional linear projection of the data that minimizes the average reconstruction error between the original data and its projection. This interpretation connects PCA with optimal linear autoencoders.
6.3.1 Setting and Objective
Any data vector \(\mathbf{x} \in \mathbb{R}^D\) can be expressed as a linear combination of orthonormal basis (ONB) vectors \((\mathbf{b}_1, \dots, \mathbf{b}_D)\): \[ \mathbf{x} = \sum_{d=1}^{D} \zeta_d \mathbf{b}_d. \] We seek an approximation \(\tilde{\mathbf{x}}\) in a lower-dimensional subspace \(U \subseteq \mathbb{R}^D\) of dimension \(M < D\): \[ \tilde{\mathbf{x}} = \sum_{m=1}^{M} z_m \mathbf{b}_m. \] The goal is to make \(\mathbf{x}\) and \(\tilde{\mathbf{x}}\) as close as possible by minimizing the average squared Euclidean distance: \[ J_M = \frac{1}{N} \sum_{n=1}^{N} \| \mathbf{x}_n - \tilde{\mathbf{x}}_n \|^2. \] This quantity is known as the reconstruction error.
6.3.2 Finding Optimal Coordinates
For a fixed basis \((\mathbf{b}_1, \dots, \mathbf{b}_M)\), the optimal coordinates of the projection are given by the orthogonal projection: \[ z_{in} = \mathbf{b}_i^\top \mathbf{x}_n. \] Hence, the optimal projection of \(\mathbf{x}_n\) is: \[ \tilde{\mathbf{x}}_n = \mathbf{B} \mathbf{B}^\top \mathbf{x}_n. \] where \(\mathbf{B} = [\mathbf{b}_1, \dots, \mathbf{b}_M]\).
Interpretation:
PCA’s projection step is an orthogonal projection of the data onto the principal subspace spanned by the top \(M\) basis vectors.
6.3.3 Finding the Basis of the Principal Subspace
Rewriting the reconstruction error using projection matrices: \[ J_M = \frac{1}{N} \sum_{n=1}^{N} \| \mathbf{x}_n - \tilde{\mathbf{x}}_n \|^2 = \frac{1}{N} \sum_{n=1}^{N} \| ( \mathbf{I} - \mathbf{B}\mathbf{B}^\top ) \mathbf{x}_n \|^2. \]
- \(\mathbf{B}\mathbf{B}^\top\) is a rank-\(M\) symmetric matrix — the best rank-\(M\) approximation of the identity matrix \(\mathbf{I}\).
- The reconstruction error can also be expressed as: \[ J_M = \sum_{j=M+1}^{D} \mathbf{b}_j^\top \mathbf{S} \mathbf{b}_j = \sum_{j=M+1}^{D} \lambda_j, \] where \(\mathbf{S}\) is the data covariance matrix and \(\lambda_j\) are its eigenvalues.
Example 6.5 Applying PCA to the MNIST digits “0” and “1”:
- Data embedded in 2D (using top two principal components) shows clear class separation.
- Digits “0” show greater internal variation than digits “1”.