5.4 Maximum Likelihood as Orthogonal Projection

Maximum Likelihood Estimation (MLE) in linear regression has a clear geometric interpretation. It corresponds to the orthogonal projection of the target vector \(\mathbf{y}\) onto the subspace spanned by the input data.

5.4.1 Simple Linear Regression Case

Consider the model: \[ y = \mathbf{x}\mathbf{\theta} + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2). \] Given training data \(\{(x_1, y_1), \ldots, (x_N, y_N)\}\), the MLE for \(\theta\) is: \[ \theta_{ML} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} = \frac{\mathbf{X}^\top \mathbf{y}}{\mathbf{X}^\top \mathbf{X}} \] where:

\(\mathbf{X} = [x_1, \ldots, x_N]^\top \in \mathbb{R}^N\),
\(\mathbf{y} = [y_1, \ldots, y_N]^\top \in \mathbb{R}^N\).

This gives the fitted (reconstructed) outputs: \[ \mathbf{X} \theta_{ML} = \frac{\mathbf{X}\mathbf{X}^\top}{\mathbf{X}^\top \mathbf{X}} \mathbf{y} \]

Geometrically, we can view the matrix \[ P = \frac{\mathbf{X}\mathbf{X}^\top}{\mathbf{X}^\top \mathbf{X}}. \] as the projection matrix onto the one-dimensional subspace spanned by \(\mathbf{X}\). The MLE solution \(\mathbf{X}\theta_{ML}\) is thus the orthogonal projection of \(\mathbf{y}\) onto this subspace. This projection minimizes the squared distance between \(\mathbf{y}\) and the model predictions \(\mathbf{X}\theta\): \[ \min_{\theta} \|\mathbf{y} - \mathbf{X}\theta\|^2. \] Hence, MLE not only gives the best statistical fit (in the least-squares sense) but also the geometrically optimal fit — the closest point to \(\mathbf{y}\) within the space of linear combinations of \(\mathbf{X}\).

5.4.2 General Linear Regression Case

For the more general model: \[ y = \phi(\mathbf{x})^\top \theta + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2), \] where \(\phi(\mathbf{x}) \in \mathbb{R}^K\) is a vector of feature functions, the results extend naturally. The MLE estimate is: \[ \theta_{ML} = (\Phi^\top \Phi)^{-1} \Phi^\top y, \] and the fitted outputs are: \[ \mathbf{y} \approx \Phi \theta_{ML}. \]

Here:

\(\Phi \in \mathbb{R}^{N \times K}\) is the feature matrix,
The column space of \(\Phi\) defines a K-dimensional subspace of \(\mathbb{R}^N\),
The projection matrix is: \[ P = \Phi (\Phi^\top \Phi)^{-1} \Phi^\top. \] Thus, MLE corresponds to orthogonal projection of \(\mathbf{y}\) onto the subspace spanned by the columns of \(\Phi\).

5.4.3 Special Case: Orthonormal Basis

If the feature vectors \(\phi_k\) form an orthonormal basis, then: \[ \Phi^\top \Phi = I \] and the projection simplifies to: \[ P = \Phi \Phi^\top = \sum_{k=1}^K \phi_k \phi_k^\top. \] In this case:

The projection of \(\mathbf{y}\) is simply the sum of individual projections onto each basis vector.
The coupling between features disappears because of orthogonality.

Example 5.9 Fourier bases and wavelets are examples of orthogonal bases commonly used in signal processing.

Exercises

Put some exercises here.