5.2 Parameter Estimation

In linear regression, we are given a training dataset
\[ \mathcal{D} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\}, \] where \(\mathbf{x}_n \in \mathbb{R}^D\) are inputs and \(y_n \in \mathbb{R}\) are corresponding observations. Let \(\mathcal{X}\) and \(\mathcal{Y}\) be the sets of training inputs and traning targets respectively. We assume each target is generated according to a probabilistic model: \[ p(y_n | \mathbf{x}_n, \theta) = \mathcal{N}(y_n| \mathbf{x}_n^\top \theta, \sigma^2), \] where \(\theta\) are model parameters and \(\sigma^2\) is the noise variance. Because each data point is conditionally independent given the parameters, the likelihood factorizes as: \[ p(\mathcal{Y} | \mathcal{X}, \theta) = \prod_{n=1}^N p(y_n | \mathbf{x}_n, \theta) = \prod_{n=1}^N \mathcal{N}(y_n| \mathbf{x}_n^\top \theta, \sigma^2). \]


5.2.1 Maximum Likelihood Estimation (MLE)

The Maximum Likelihood Estimation approach finds parameters that maximize the likelihood: \[ \theta_{\text{ML}} = \arg \max_\theta p(\mathcal{Y} | \mathcal{X}, \theta). \]

Since taking the logarithm does not change the location of the optimum, we minimize the negative log-likelihood instead: \[ L(\theta) = -\log p(\mathcal{Y} | \mathcal{X}, \theta) = \frac{1}{2\sigma^2} \| y - X\theta \|^2 + const. \] This expression is quadratic in \(\theta\), so setting its gradient to zero yields the closed-form solution: \[ \theta_{\text{ML}} = (X^\top X)^{-1} X^\top y. \]

Here:

  • \(X \in \mathbb{R}^{N \times D}\) is the design matrix, where each row is \(x_n^\top\).
  • The matrix \(X^\top X\) must be invertible (i.e., full column rank).

This solution minimizes the sum of squared errors between the predictions \(X\theta\) and the observed targets \(y\).


5.2.2 MLE with Nonlinear Features

Although the term “linear regression” refers to models that are linear in the parameters, the inputs can undergo nonlinear transformations through a feature mapping: \[ f(x) = \mathbf{\phi}(x)^\top \theta, \] where \(\mathbf{\phi}(x)\) is a feature vector (e.g., polynomial terms, basis functions).

We define the feature matrix as: \[ \mathbf{\Phi} = \begin{bmatrix} \mathbf{\phi}(x_1)^\top \\ \mathbf{\phi}(x_2)^\top \\ \vdots \\ \mathbf{\phi}(x_N)^\top \end{bmatrix} \in \mathbb{R}^{N \times K}. \]

For example, with second-order polynomial features: \[ \mathbf{\phi}(x) = [1, x, x^2]^\top, \quad \mathbf{\Phi} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_N & x_N^2 \end{bmatrix}. \] The corresponding MLE solution becomes: \[ \theta_{\text{ML}} = (\mathbf{\Phi}^\top \mathbf{\Phi})^{-1} \mathbf{\Phi}^\top y. \] The estimated noise variance can also be obtained as: \[ \sigma^2_{\text{ML}} = \frac{1}{N} \sum_{n=1}^N (y_n - \mathbf{\phi}(x_n)^\top \theta_{\text{ML}})^2. \]


5.2.3 Overfitting in Linear Regression

Overfitting occurs when a model fits the training data too closely, capturing noise rather than the underlying trend. To assess performance, we often compute the Root Mean Square Error (RMSE): \[ \text{RMSE} = \sqrt{\frac{1}{N} \| \mathbf{y} - \mathbf{\mathbf{\Phi}}\mathbf{\theta} \|^2}. \]

  • Training error tends to decrease as model complexity (e.g., polynomial degree \(M\)) increases.
  • Test error initially decreases but then increases beyond an optimal model complexity, indicating overfitting.

For polynomial regression:

  • Low-degree polynomials underfit (poor fit to data).
  • Moderate-degree polynomials (e.g., \(M = 4\)) generalize well.
  • High-degree polynomials (\(M \geq N - 1\)) fit all training points but fail to generalize.

5.2.4 Maximum A Posteriori (MAP) Estimation

Maximum likelihood estimation (MLE) can lead to overfitting, often resulting in very large parameter values. To mitigate this, we introduce a prior distribution \(p(\theta)\) on the parameters, which encodes our beliefs about plausible parameter values before observing any data.

For example: \[ p(\theta) = \mathcal{N}(0, 1) \] suggests that \(\theta\) is likely to lie within approximately \([-2, 2]\).

Once data \((\mathcal{X}, \mathcal{Y})\) are available, we seek parameters that maximize the posterior distribution: \[ p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \frac{p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) \, p(\mathbf{\theta})}{p(\mathcal{Y} | \mathcal{X})} \]

The MAP estimate is the maximizer of this posterior: \[ \mathbf{\theta}_{\text{MAP}} = \arg \max_{\mathbf{\theta}} \, p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}). \] Taking the logarithm, we get: \[ \log p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \log p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) + \log p(\mathbf{\theta}) + \text{const.} \]

This shows that the MAP estimate balances:

  • The log-likelihood term (fit to data), and
  • The log-prior term (penalizing unlikely parameters).

Hence, MAP can be viewed as a compromise between fitting the data and respecting prior beliefs.


5.2.5 Optimization

MAP estimation is equivalent to minimizing the negative log-posterior: \[ \mathbf{\theta}_{\text{MAP}} = \arg \min_{\mathbf{\theta}} \{-\log p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) - \log p(\mathbf{\theta})\}. \]

If we assume a Gaussian prior \(p(\mathbf{\theta}) = \mathcal{N}(\mathbf{0}, b^2 \mathbf{I})\), the negative log-posterior becomes: \[ - \log p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \frac{1}{2\sigma^2}(y - \mathbf{\Phi} \mathbf{\theta})^T (y - \mathbf{\Phi} \mathbf{\theta}) + \frac{1}{2b^2}\mathbf{\theta}^T \mathbf{\theta} + \text{const.} \] Solving for \(\mathbf{\theta}\) yields the MAP estimate: \[ \mathbf{\theta}_{\text{MAP}} = \left(\mathbf{\Phi}^T \mathbf{\Phi} + \frac{\sigma^2}{b^2} \mathbf{I} \right)^{-1} \mathbf{\Phi}^T y. \]


5.2.6 Comparison to MLE

Compared to the MLE solution: \[ \mathbf{\theta}_{\text{MLE}} = (\mathbf{\Phi}^T \mathbf{\Phi})^{-1} \mathbf{\Phi}^T y, \] MAP adds the regularization term \(\frac{\sigma^2}{b^2} \mathbf{I}\), which:

  • Ensures invertibility of the matrix,
  • Reduces overfitting,
  • Shrinks parameter values toward zero.

Example 5.2 In polynomial regression:

  • Using a Gaussian prior \(p(\mathbf{\theta}) = \mathcal{N}(0, \mathbf{I})\),
  • The MAP estimate smooths high-degree polynomials (reducing overfitting),
  • The effect is minimal for low-degree models.

However, while MAP helps, it is not a complete solution to overfitting.


5.2.7 MAP Estimation as Regularization

Instead of using a prior, we can add a regularization term directly to the loss function: \[ \min_{\mathbf{\theta}} \|y - \mathbf{\Phi} \mathbf{\theta}\|^2 + \lambda \|\mathbf{\theta}\|_2^2. \] Here:

  • The first term is the data-fit (misfit) term,
  • The second term is the regularizer, which penalizes large parameter values,
  • \(\lambda \ge 0\) controls the strength of regularization.

The regularization term corresponds to the negative log of a Gaussian prior: \[ -\log p(\mathbf{\theta}) = \frac{1}{2b^2}\|\mathbf{\theta}\|_2^2 + \text{const.} \] If we set \(\lambda = \frac{\sigma^2}{b^2}\), the regularized least squares and MAP estimation solutions are equivalent: \[ \mathbf{\theta}_{\text{RLS}} = (\mathbf{\Phi}^T \mathbf{\Phi} + \lambda I)^{-1} \mathbf{\Phi}^T y. \]

Example 5.3 Linear Regression Example (4–6 data points)

We consider a small dataset with 5 observations.
Let the input be a scalar \(x\) and the response be \(y\):

\(x\) 0 1 2 3 4
\(y\) 1 2 2 3 5

We assume Gaussian noise throughout.

Model with Nonlinear Features

Instead of a purely linear model, we use nonlinear features: \[ \phi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \end{bmatrix} \]

The regression model is: \[ y = \mathbf{w}^\top \phi(x) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

Let: \[ \mathbf{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} \]

Design Matrix

The design matrix is: \[ \mathbf{X} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 1 \\ 2 \\ 2 \\ 3 \\ 5 \end{bmatrix} \]

Maximum Likelihood Estimation (MLE)

The likelihood is: \[ p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) = \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}) \] Maximizing the likelihood is equivalent to minimizing: \[ \mathcal{L}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 \]

The MLE solution is: \[ \hat{\mathbf{w}}_{\text{MLE}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]

This shows:

  • MLE = least squares
  • Nonlinear features handled via \(\mathbf{X}\)

Regularization (Ridge Regression)

To prevent overfitting, add an \(\ell_2\) penalty: \[ \mathcal{L}_{\text{reg}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|^2 \]

The solution becomes: \[ \hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} \] This is regularized MLE.

MAP Estimation

Assume a Gaussian prior on the parameters: \[ p(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I}) \] The posterior is: \[ p(\mathbf{w} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \mathbf{w}) p(\mathbf{w}) \] Maximizing the posterior is equivalent to minimizing: \[ \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{\sigma^2}{\tau^2}\|\mathbf{w}\|^2 \] Thus the MAP estimator is: \[ \hat{\mathbf{w}}_{\text{MAP}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} \quad \text{where } \lambda = \frac{\sigma^2}{\tau^2} \]


Exercises

Exercise 5.5 Why should we maximize the log-likelihood rather than the likelihood?

Exercise 5.6 Show \(p(Y|X, \mathbf{\theta}) = \prod_{n=1}^N N(y_n|\mathbf{x}_n^T\mathbf{\theta}, \sigma^2)\).

Exercise 5.7 We know that \[-\log p(Y|X, \mathbf{\theta}) = -\sum_{n=1}^N \log p(y_n|\mathbf{x}_n, \mathbf{\theta}).\]

Exercise 5.8 State the formula for \(p(y_n|\mathbf{x}_n, \mathbf{\theta})\). Assume that \(\sigma^2\) is known. Use it to show \[\log p(y_n|\mathbf{x}_n, \mathbf{\theta}) = -\dfrac{1}{2\sigma^2}(y_n - \mathbf{x}_n^T \mathbf{\theta})^2 + const.\] What is the constant? What does it matter that it has no \(\theta\) in it?

Exercise 5.9 Suppose \(y = 1 + 2x_1 + 3x_2\), \(\mathbf{X} = \begin{bmatrix}1 & 4 & -3 \\ 1 & 1 & 1 \end{bmatrix}\), and \(\mathbf{y} = \begin{bmatrix}0\\5 \end{bmatrix}\). Show that \[\dfrac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mathbf{x}_n^T \mathbf{\theta})^2 = \dfrac{1}{2\sigma^2} ||\mathbf{y} - \mathbf{X}\mathbf{\theta}||^2. \]

Exercise 5.10 Define \(L(\theta)\) as the negative log-likelihood function. Show that \[\begin{align*} L(\mathbf{\theta}) &= \dfrac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mathbf{x}_n^T \mathbf{\theta})^2 \\ &= \dfrac{1}{2\sigma^2} ||\mathbf{y} - \mathbf{X}\mathbf{\theta}||^2. \end{align*}\]

Exercise 5.11 $ = (- + ^T ^T ) ^{1 x} $. Justify each step.

Exercise 5.12 Solve \(\dfrac{dL}{d\mathbf{\theta}} = 0\) to find \(\mathbf{\theta}_{ML}\).

Exercise 5.13 To find the MAP estimate, begin with \[p(\mathbf{\theta}|X,Y) = \dfrac{p(Y|X, \mathbf{\theta})p(\mathbf{\theta})}{p(Y|X)}.\] Find the maximum log-likelihood.