5.2 Parameter Estimation

In linear regression, we are given a training dataset
\[ \mathcal{D} = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\}, \] where $\mathbf{x}_n \in \mathbb{R}^D$ are inputs and $y_n \in \mathbb{R}$ are corresponding observations. Let $\mathcal{X}$ and $\mathcal{Y}$ be the sets of training inputs and traning targets respectively. We assume each target is generated according to a probabilistic model: \[ p(y_n | \mathbf{x}_n, \theta) = \mathcal{N}(y_n| \mathbf{x}_n^\top \theta, \sigma^2), \] where $\theta$ are model parameters and $\sigma^2$ is the noise variance. Because each data point is conditionally independent given the parameters, the likelihood factorizes as: \[ p(\mathcal{Y} | \mathcal{X}, \theta) = \prod_{n=1}^N p(y_n | \mathbf{x}_n, \theta) = \prod_{n=1}^N \mathcal{N}(y_n| \mathbf{x}_n^\top \theta, \sigma^2). \]

5.2.1 Maximum Likelihood Estimation (MLE)

The Maximum Likelihood Estimation approach finds parameters that maximize the likelihood: \[ \theta_{\text{ML}} = \arg \max_\theta p(\mathcal{Y} | \mathcal{X}, \theta). \]

Since taking the logarithm does not change the location of the optimum, we minimize the negative log-likelihood instead: \[ L(\theta) = -\log p(\mathcal{Y} | \mathcal{X}, \theta) = \frac{1}{2\sigma^2} \| y - X\theta \|^2 + const. \] This expression is quadratic in $\theta$, so setting its gradient to zero yields the closed-form solution: \[ \theta_{\text{ML}} = (X^\top X)^{-1} X^\top y. \]

Here:

$X \in \mathbb{R}^{N \times D}$ is the design matrix, where each row is $x_n^\top$.
The matrix $X^\top X$ must be invertible (i.e., full column rank).

This solution minimizes the sum of squared errors between the predictions $X\theta$ and the observed targets $y$.

5.2.2 MLE with Nonlinear Features

Although the term “linear regression” refers to models that are linear in the parameters, the inputs can undergo nonlinear transformations through a feature mapping: \[ f(x) = \mathbf{\phi}(x)^\top \theta, \] where $\mathbf{\phi}(x)$ is a feature vector (e.g., polynomial terms, basis functions).

We define the feature matrix as: \[ \mathbf{\Phi} = \begin{bmatrix} \mathbf{\phi}(x_1)^\top \\ \mathbf{\phi}(x_2)^\top \\ \vdots \\ \mathbf{\phi}(x_N)^\top \end{bmatrix} \in \mathbb{R}^{N \times K}. \]

For example, with second-order polynomial features: \[ \mathbf{\phi}(x) = [1, x, x^2]^\top, \quad \mathbf{\Phi} = \begin{bmatrix} 1 & x_1 & x_1^2 \\ 1 & x_2 & x_2^2 \\ \vdots & \vdots & \vdots \\ 1 & x_N & x_N^2 \end{bmatrix}. \] The corresponding MLE solution becomes: \[ \theta_{\text{ML}} = (\mathbf{\Phi}^\top \mathbf{\Phi})^{-1} \mathbf{\Phi}^\top y. \] The estimated noise variance can also be obtained as: \[ \sigma^2_{\text{ML}} = \frac{1}{N} \sum_{n=1}^N (y_n - \mathbf{\phi}(x_n)^\top \theta_{\text{ML}})^2. \]

5.2.3 Overfitting in Linear Regression

Overfitting occurs when a model fits the training data too closely, capturing noise rather than the underlying trend. To assess performance, we often compute the Root Mean Square Error (RMSE): \[ \text{RMSE} = \sqrt{\frac{1}{N} \| \mathbf{y} - \mathbf{\mathbf{\Phi}}\mathbf{\theta} \|^2}. \]

Training error tends to decrease as model complexity (e.g., polynomial degree $M$) increases.
Test error initially decreases but then increases beyond an optimal model complexity, indicating overfitting.

For polynomial regression:

Low-degree polynomials underfit (poor fit to data).
Moderate-degree polynomials (e.g., $M = 4$) generalize well.
High-degree polynomials ($M \geq N - 1$) fit all training points but fail to generalize.

5.2.4 Maximum A Posteriori (MAP) Estimation

Maximum likelihood estimation (MLE) can lead to overfitting, often resulting in very large parameter values. To mitigate this, we introduce a prior distribution $p(\theta)$ on the parameters, which encodes our beliefs about plausible parameter values before observing any data.

For example: \[ p(\theta) = \mathcal{N}(0, 1) \] suggests that $\theta$ is likely to lie within approximately $[-2, 2]$.

Once data $(\mathcal{X}, \mathcal{Y})$ are available, we seek parameters that maximize the posterior distribution: \[ p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \frac{p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) \, p(\mathbf{\theta})}{p(\mathcal{Y} | \mathcal{X})} \]

The MAP estimate is the maximizer of this posterior: \[ \mathbf{\theta}_{\text{MAP}} = \arg \max_{\mathbf{\theta}} \, p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}). \] Taking the logarithm, we get: \[ \log p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \log p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) + \log p(\mathbf{\theta}) + \text{const.} \]

This shows that the MAP estimate balances:

The log-likelihood term (fit to data), and
The log-prior term (penalizing unlikely parameters).

Hence, MAP can be viewed as a compromise between fitting the data and respecting prior beliefs.

5.2.5 Optimization

MAP estimation is equivalent to minimizing the negative log-posterior: \[ \mathbf{\theta}_{\text{MAP}} = \arg \min_{\mathbf{\theta}} \{-\log p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) - \log p(\mathbf{\theta})\}. \]

If we assume a Gaussian prior $p(\mathbf{\theta}) = \mathcal{N}(\mathbf{0}, b^2 \mathbf{I})$, the negative log-posterior becomes: \[ - \log p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \frac{1}{2\sigma^2}(y - \mathbf{\Phi} \mathbf{\theta})^T (y - \mathbf{\Phi} \mathbf{\theta}) + \frac{1}{2b^2}\mathbf{\theta}^T \mathbf{\theta} + \text{const.} \] Solving for $\mathbf{\theta}$ yields the MAP estimate: \[ \mathbf{\theta}_{\text{MAP}} = \left(\mathbf{\Phi}^T \mathbf{\Phi} + \frac{\sigma^2}{b^2} \mathbf{I} \right)^{-1} \mathbf{\Phi}^T y. \]

5.2.6 Comparison to MLE

Compared to the MLE solution: \[ \mathbf{\theta}_{\text{MLE}} = (\mathbf{\Phi}^T \mathbf{\Phi})^{-1} \mathbf{\Phi}^T y, \] MAP adds the regularization term $\frac{\sigma^2}{b^2} \mathbf{I}$, which:

Ensures invertibility of the matrix,
Reduces overfitting,
Shrinks parameter values toward zero.

Example 5.2 In polynomial regression:

Using a Gaussian prior $p(\mathbf{\theta}) = \mathcal{N}(0, \mathbf{I})$,
The MAP estimate smooths high-degree polynomials (reducing overfitting),
The effect is minimal for low-degree models.

However, while MAP helps, it is not a complete solution to overfitting.

5.2.7 MAP Estimation as Regularization

Instead of using a prior, we can add a regularization term directly to the loss function: \[ \min_{\mathbf{\theta}} \|y - \mathbf{\Phi} \mathbf{\theta}\|^2 + \lambda \|\mathbf{\theta}\|_2^2. \] Here:

The first term is the data-fit (misfit) term,
The second term is the regularizer, which penalizes large parameter values,
$\lambda \ge 0$ controls the strength of regularization.

The regularization term corresponds to the negative log of a Gaussian prior: \[ -\log p(\mathbf{\theta}) = \frac{1}{2b^2}\|\mathbf{\theta}\|_2^2 + \text{const.} \] If we set $\lambda = \frac{\sigma^2}{b^2}$, the regularized least squares and MAP estimation solutions are equivalent: \[ \mathbf{\theta}_{\text{RLS}} = (\mathbf{\Phi}^T \mathbf{\Phi} + \lambda I)^{-1} \mathbf{\Phi}^T y. \]

Example 5.3 Linear Regression Example (4–6 data points)

We consider a small dataset with 5 observations.
Let the input be a scalar $x$ and the response be $y$:

$x$	0	1	2	3	4
$y$	1	2	2	3	5

We assume Gaussian noise throughout.

Model with Nonlinear Features

Instead of a purely linear model, we use nonlinear features: \[ \phi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \end{bmatrix} \]

The regression model is: \[ y = \mathbf{w}^\top \phi(x) + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2) \]

Let: \[ \mathbf{w} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix} \]

Design Matrix

The design matrix is: \[ \mathbf{X} = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 1 \\ 1 & 2 & 4 \\ 1 & 3 & 9 \\ 1 & 4 & 16 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 1 \\ 2 \\ 2 \\ 3 \\ 5 \end{bmatrix} \]

Maximum Likelihood Estimation (MLE)

The likelihood is: \[ p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) = \mathcal{N}(\mathbf{X}\mathbf{w}, \sigma^2 \mathbf{I}) \] Maximizing the likelihood is equivalent to minimizing: \[ \mathcal{L}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 \]

The MLE solution is: \[ \hat{\mathbf{w}}_{\text{MLE}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \]

This shows:

MLE = least squares
Nonlinear features handled via $\mathbf{X}$

Regularization (Ridge Regression)

To prevent overfitting, add an $\ell_2$ penalty: \[ \mathcal{L}_{\text{reg}}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda \|\mathbf{w}\|^2 \]

The solution becomes: \[ \hat{\mathbf{w}}_{\text{ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} \] This is regularized MLE.

MAP Estimation

Assume a Gaussian prior on the parameters: \[ p(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \tau^2 \mathbf{I}) \] The posterior is: \[ p(\mathbf{w} \mid \mathbf{y}) \propto p(\mathbf{y} \mid \mathbf{w}) p(\mathbf{w}) \] Maximizing the posterior is equivalent to minimizing: \[ \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{\sigma^2}{\tau^2}\|\mathbf{w}\|^2 \] Thus the MAP estimator is: \[ \hat{\mathbf{w}}_{\text{MAP}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y} \quad \text{where } \lambda = \frac{\sigma^2}{\tau^2} \]

Exercises

Exercise 5.5 Why should we maximize the log-likelihood rather than the likelihood?

Solution

Exercise 5.6 Show $p(Y|X, \mathbf{\theta}) = \prod_{n=1}^N N(y_n|\mathbf{x}_n^T\mathbf{\theta}, \sigma^2)$.

Solution

Exercise 5.7 We know that \[-\log p(Y|X, \mathbf{\theta}) = -\sum_{n=1}^N \log p(y_n|\mathbf{x}_n, \mathbf{\theta}).\]

Solution

Exercise 5.8 State the formula for $p(y_n|\mathbf{x}_n, \mathbf{\theta})$. Assume that $\sigma^2$ is known. Use it to show \[\log p(y_n|\mathbf{x}_n, \mathbf{\theta}) = -\dfrac{1}{2\sigma^2}(y_n - \mathbf{x}_n^T \mathbf{\theta})^2 + const.\] What is the constant? What does it matter that it has no $\theta$ in it?

Solution

Exercise 5.9 Suppose $y = 1 + 2x_1 + 3x_2$, $\mathbf{X} = \begin{bmatrix}1 & 4 & -3 \\ 1 & 1 & 1 \end{bmatrix}$, and $\mathbf{y} = \begin{bmatrix}0\\5 \end{bmatrix}$. Show that \[\dfrac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mathbf{x}_n^T \mathbf{\theta})^2 = \dfrac{1}{2\sigma^2} ||\mathbf{y} - \mathbf{X}\mathbf{\theta}||^2. \]

Solution

Exercise 5.10 Define $L(\theta)$ as the negative log-likelihood function. Show that \[\begin{align*} L(\mathbf{\theta}) &= \dfrac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mathbf{x}_n^T \mathbf{\theta})^2 \\ &= \dfrac{1}{2\sigma^2} ||\mathbf{y} - \mathbf{X}\mathbf{\theta}||^2. \end{align*}\]

Solution

Exercise 5.11 $ = (- + ^T ^T ) ^{1 x} $. Justify each step.

Solution

Exercise 5.12 Solve $\dfrac{dL}{d\mathbf{\theta}} = 0$ to find $\mathbf{\theta}_{ML}$.

Solution

Exercise 5.13 To find the MAP estimate, begin with \[p(\mathbf{\theta}|X,Y) = \dfrac{p(Y|X, \mathbf{\theta})p(\mathbf{\theta})}{p(Y|X)}.\] Find the maximum log-likelihood.

Solution