5.4 Maximum Likelihood as Orthogonal Projection
Maximum Likelihood Estimation (MLE) in linear regression has a clear geometric interpretation. It corresponds to the orthogonal projection of the target vector \(\mathbf{y}\) onto the subspace spanned by the input data.
5.4.1 Simple Linear Regression Case
Consider the model: \[ y = \mathbf{x}\mathbf{\theta} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2). \] Given training data \(\{(x_1, y_1), \ldots, (x_N, y_N)\}\), the MLE for \(\theta\) is: \[ \theta_{ML} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} = \frac{\mathbf{X}^\top \mathbf{y}}{\mathbf{X}^\top \mathbf{X}} \] where:
- \(\mathbf{X} = [x_1, \ldots, x_N]^\top \in \mathbb{R}^N\),
- \(\mathbf{y} = [y_1, \ldots, y_N]^\top \in \mathbb{R}^N\).
This gives the fitted (reconstructed) outputs: \[ \mathbf{X} \theta_{ML} = \frac{\mathbf{X}\mathbf{X}^\top}{\mathbf{X}^\top \mathbf{X}} \mathbf{y} \]
Geometrically, we can view the matrix \[ P = \frac{\mathbf{X}\mathbf{X}^\top}{\mathbf{X}^\top \mathbf{X}}. \] as the projection matrix onto the one-dimensional subspace spanned by \(\mathbf{X}\). The MLE solution \(\mathbf{X}\theta_{ML}\) is thus the orthogonal projection of \(\mathbf{y}\) onto this subspace. This projection minimizes the squared distance between \(\mathbf{y}\) and the model predictions \(\mathbf{X}\theta\): \[ \min_{\theta} \|\mathbf{y} - \mathbf{X}\theta\|^2. \] Hence, MLE not only gives the best statistical fit (in the least-squares sense) but also the geometrically optimal fit — the closest point to \(\mathbf{y}\) within the space of linear combinations of \(\mathbf{X}\).
5.4.2 General Linear Regression Case
For the more general model: \[ y = \phi(\mathbf{x})^\top \theta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2), \] where \(\phi(\mathbf{x}) \in \mathbb{R}^K\) is a vector of feature functions, the results extend naturally. The MLE estimate is: \[ \theta_{ML} = (\Phi^\top \Phi)^{-1} \Phi^\top y, \] and the fitted outputs are: \[ \mathbf{y} \approx \Phi \theta_{ML}. \]
Here:
- \(\Phi \in \mathbb{R}^{N \times K}\) is the feature matrix,
- The column space of \(\Phi\) defines a K-dimensional subspace of \(\mathbb{R}^N\),
- The projection matrix is: \[ P = \Phi (\Phi^\top \Phi)^{-1} \Phi^\top. \] Thus, MLE corresponds to orthogonal projection of \(\mathbf{y}\) onto the subspace spanned by the columns of \(\Phi\).
5.4.3 Special Case: Orthonormal Basis
If the feature vectors \(\phi_k\) form an orthonormal basis, then: \[ \Phi^\top \Phi = I \] and the projection simplifies to: \[ P = \Phi \Phi^\top = \sum_{k=1}^K \phi_k \phi_k^\top. \] In this case:
- The projection of \(\mathbf{y}\) is simply the sum of individual projections onto each basis vector.
- The coupling between features disappears because of orthogonality.
Example 5.9 Fourier bases and wavelets are examples of orthogonal bases commonly used in signal processing.