5.3 Bayesian Linear Regression

Bayesian linear regression extends the concept of linear regression by treating parameters as random variables instead of estimating a single best set of parameters (as in MLE or MAP). Rather than finding a point estimate for parameters \(\mathbf{\theta}\), it computes the full posterior distribution over them and uses this distribution to make predictions.

This approach:

Incorporates prior beliefs about parameters,
Naturally accounts for uncertainty,
Prevents overfitting, especially with limited data.

5.3.1 The Model

We define the model as: \[ p(\mathbf{\theta}) = \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0) \quad \text{(prior on parameters)} \] \[ p(y | \mathbf{x}, \mathbf{\theta}) = \mathcal{N}(y | \phi(\mathbf{x})^\top \mathbf{\theta}, \sigma^2) \quad \text{(likelihood)}. \] The joint distribution is: \[ p(y, \mathbf{\theta} | \mathbf{x}) = p(y | \mathbf{x}, \mathbf{\theta}) \, p(\mathbf{\theta}). \] Here:

\(\mathbf{\theta}\) is now a random variable,
\(\mathbf{m}_0\) and \(\mathbf{S}_0\) are the prior mean and covariance.

Predictions are obtained by integrating out the parameter uncertainty: \[ p(y_* | \mathbf{x}_*) = \int p(y_* | \mathbf{x}_*, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}. \] Since both likelihood and prior are Gaussian, the predictive distribution is also Gaussian: \[ p(y_* | \mathbf{x}_*) = \mathcal{N}(\phi(\mathbf{x}_*)^\top \mathbf{m}_0, \phi(\mathbf{x}_*)^\top \mathbf{S}_0 \phi(\mathbf{x}_*) + \sigma^2) \]

The term \(\phi(\mathbf{x}_*)^\top \mathbf{S}_0 \phi(\mathbf{x}_*)\) reflects uncertainty due to parameter variability. The term \(\sigma^2\) reflects observation noise.

For noise-free function values \(f(\mathbf{x}_*) = \phi(\mathbf{x}_*)^\top \mathbf{\theta}\): \[ p(f(\mathbf{x}_*)) = \mathcal{N}(\phi(\mathbf{x}_*)^\top \mathbf{m}_0, \phi(\mathbf{x}_*)^\top \mathbf{S}_0 \phi(\mathbf{x}_*)) \]

The parameter prior \(p(\mathbf{\theta})\) induces a distribution over functions:

Each sampled parameter vector \(\mathbf{\theta}_i \sim p(\mathbf{\theta})\) defines a function \(f_i(\mathbf{x}) = \phi(\mathbf{x})^\top \mathbf{\theta}_i\).
The collection of these defines \(p(f(\cdot))\), a distribution over possible functions.

Example 5.4 Bayesian Linear Regression with a Single Feature

We model a simple linear relationship \[ y = \theta_0 + \theta_1 x + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2). \] Define the feature map \[ \phi(x) = \begin{bmatrix} 1 \\ x \end{bmatrix}, \quad \mathbf{\theta} = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}. \] Then \[ p(y \mid x, \mathbf{\theta}) = \mathcal{N}(y \mid \phi(x)^\top \mathbf{\theta}, \sigma^2). \]

Assume a Gaussian prior: \[ p(\mathbf{\theta}) = \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0), \quad \mathbf{m}_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \mathbf{S}_0 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}. \]

Interpretation:

Before seeing data, we believe slopes and intercepts are near 0.
Large uncertainty allows many possible linear functions.

Each draw \(\mathbf{\theta}_i \sim p(\mathbf{\theta})\) defines a function \[ f_i(x) = \theta_{0,i} + \theta_{1,i} x. \] Thus the prior defines a distribution over lines.

For a new input \(x_*\), \[ p(y_* \mid x_*) = \mathcal{N} \left( \phi(x_*)^\top \mathbf{m}_0,\; \phi(x_*)^\top \mathbf{S}_0 \phi(x_*) + \sigma^2 \right). \] Because \(\mathbf{m}_0 = \mathbf{0}\): \[ \mathbb{E}[y_*] = 0. \] The variance \[ \phi(x_*)^\top \mathbf{S}_0 \phi(x_*) = 1 + x_*^2 \] grows with distance from the origin — uncertainty increases away from data.

5.3.2 Posterior Distribution

After observing training data \((\mathcal{X}, \mathcal{Y})\), we compute the posterior over parameters:

\[ p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \frac{p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta})}{p(\mathcal{Y} | \mathcal{X})} \]

Where:

\(p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) = \mathcal{N}(\mathcal{Y} | \mathbf{\Phi} \mathbf{\theta}, \sigma^2 I)\)
\(p(\mathcal{Y} | \mathcal{X}) = \int p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}\) is the marginal likelihood (normalizing constant).

Because both prior and likelihood are Gaussian, the posterior is also Gaussian: \[ p(\mathbf{\theta} | X, Y) = \mathcal{N}(\mathbf{\theta} | \mathbf{m}_N, \mathbf{S}_N), \] with \[ \mathbf{S}_N = (\mathbf{S}_0^{-1} + \sigma^{-2} \mathbf{\Phi}^\top \mathbf{\Phi})^{-1}, \quad \mathbf{m}_N = \mathbf{S}_N (\mathbf{S}_0^{-1} \mathbf{m}_0 + \sigma^{-2} \mathbf{\Phi}^\top Y). \] This is derived by completing the square in the exponent of the unnormalized posterior.

Example 5.5 Posterior After Observing Data

Suppose we observe: \[ \mathcal{X} = \begin{bmatrix} 0 \\ 1 \\ 2 \end{bmatrix}, \quad \mathcal{Y} = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \] The design matrix is \[ \mathbf{\Phi} = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}. \]

Because the prior and likelihood are Gaussian, the posterior is: \[ p(\mathbf{\theta} \mid \mathcal{X}, \mathcal{Y}) = \mathcal{N}(\mathbf{m}_N, \mathbf{S}_N), \] with \[ \mathbf{S}_N = (\mathbf{S}_0^{-1} + \sigma^{-2} \mathbf{\Phi}^\top \mathbf{\Phi})^{-1}, \quad \mathbf{m}_N = \mathbf{S}_N (\mathbf{S}_0^{-1} \mathbf{m}_0 + \sigma^{-2} \mathbf{\Phi}^\top \mathcal{Y}). \]

Interpretation:

Posterior mean \(\mathbf{m}_N\): best estimate of parameters (MAP).
Posterior covariance \(\mathbf{S}_N\): remaining uncertainty after seeing data.

Uncertainty shrinks in directions well-supported by data.

5.3.3 Posterior Predictions

To predict at a new point \(\mathbf{x}_*\), we again integrate over \(\mathbf{\theta}\), but now using the posterior instead of the prior: \[ p(y_* | \mathcal{X}, \mathcal{Y}, \mathbf{x}_*) = \int p(y_* | \mathbf{x}_*, \mathbf{\theta}) p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) d\mathbf{\theta}. \] This yields: \[ p(y_* | \mathcal{X}, \mathcal{Y}, \mathbf{x}_*) = \mathcal{N}(\phi(\mathbf{x}_*)^\top \mathbf{m}_N, \phi(\mathbf{x}_*)^\top \mathbf{S}_N \phi(\mathbf{x}_*) + \sigma^2). \]

The predictive mean \(\phi(\mathbf{x}_*)^\top \mathbf{m}_N\) equals the MAP estimate prediction.
The predictive variance accounts for both model and observation uncertainty.

For \(f(\mathbf{x}_*) = \phi(\mathbf{x}_*)^\top \mathbf{\theta}\): \[ \mathbb{E}[f(\mathbf{x}_*) | \mathcal{X}, \mathcal{Y}] = \phi(\mathbf{x}_*)^\top \mathbf{m}_N, \quad \text{Var}[f(\mathbf{x}_*) | \mathcal{X}, \mathcal{Y}] = \phi(\mathbf{x}_*)^\top \mathbf{S}_N \phi(\mathbf{x}_*) \]

Sampling \(\mathbf{\theta}_i \sim p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y})\) gives functions \(f_i(\mathbf{x}) = \phi(\mathbf{x})^\top \mathbf{\theta}_i\). These represent the posterior distribution over functions, with:

Mean function: \(\mathbf{m}_N^\top \phi(\mathbf{x})\),
Variance: \(\phi(\mathbf{x})^\top \mathbf{S}_N \phi(\mathbf{x})\).

Interpretation:
Higher uncertainty (larger variance) reflects regions with sparse or no data.

Example 5.6 Posterior Predictive Distribution

For a new input \(x_* = 1.5\):

\[ p(y_* \mid \mathcal{X}, \mathcal{Y}, x_*) = \mathcal{N} \left( \phi(x_*)^\top \mathbf{m}_N,\; \phi(x_*)^\top \mathbf{S}_N \phi(x_*) + \sigma^2 \right). \]

This variance has two components:

\(\phi(x_*)^\top \mathbf{S}_N \phi(x_*)\): parameter uncertainty
\(\sigma^2\): observation noise

Even with infinite data, noise remains.

Example 5.7 Noise-Free Function Values

For the latent function \[ f(x_*) = \phi(x_*)^\top \mathbf{\theta}, \] we have \[ p(f(x_*) \mid \mathcal{X}, \mathcal{Y}) = \mathcal{N} \left( \phi(x_*)^\top \mathbf{m}_N,\; \phi(x_*)^\top \mathbf{S}_N \phi(x_*) \right). \] This is a distribution over functions, not just numbers.

5.3.4 Marginal Likelihood

The marginal likelihood (or evidence) measures how well the model explains the observed data: \[ p(\mathcal{Y} | \mathcal{X}) = \int p(\mathcal{Y} |\mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}. \] Using Gaussian conjugacy, the result is: \[ p(\mathcal{Y} | \mathcal{X}) = \mathcal{N}(\mathbf{y} | \mathbf{X} \mathbf{m}_0, \mathbf{X} \mathbf{S}_0 \mathbf{X}^\top + \sigma^2 \mathbf{I}), \] with: \[ \mathbb{E}[\mathcal{Y} | \mathcal{X}] = \mathbf{X} \mathbf{m}_0, \quad \text{Cov}[\mathcal{Y} | \mathcal{X}] = \mathbf{X} \mathbf{S}_0 X^\top + \sigma^2 I \] The marginal likelihood is crucial for model comparison and selection, as it measures the model’s fit after integrating over parameter uncertainty.

Example 5.8 Marginal Likelihood (Evidence)

The marginal likelihood integrates out parameters: \[ p(\mathcal{Y} \mid \mathcal{X}) = \int p(\mathcal{Y} \mid \mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}. \]

Result: \[ p(\mathcal{Y} \mid \mathcal{X}) = \mathcal{N} (\mathcal{Y} \mid \mathbf{\Phi} \mathbf{m}_0, \mathbf{\Phi} \mathbf{S}_0 \mathbf{\Phi}^\top + \sigma^2 \mathbf{I}). \]

Interpretation:

Measures how well the model explains data on average over parameters
Automatically penalizes overly flexible models
Central to Bayesian model selection

5.3.4.1 Key Insights

Concept	Description
Parameter Treatment	Bayesian regression treats parameters \(\mathbf{\theta}\) as random variables.
Prior → Posterior	Prior \(p(\mathbf{\theta})\) is updated using data to yield posterior \(p(\mathbf{\theta} \| \mathcal{X}, \mathcal{Y})\).
Predictions	Are averages over all plausible parameter values (not point estimates).
Uncertainty	Captured via predictive variance combining model and observation noise.
Marginal Likelihood	Used for model evidence and Bayesian model selection.

5.3.4.2 Summary of Distributions

Distribution	Epression	Description
Prior	\(p(\mathbf{\theta}) = \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)\)	Beliefs before seeing data
Likelihood	\(p(\mathcal{Y} \| \mathcal{X}, \mathbf{\theta}) = \mathcal{N}(Y \| \mathbf{\Phi} \mathbf{\theta}, \sigma^2 I)\)	Data model
Posterior	\(p(\mathbf{\theta} \| \mathcal{X}, \mathcal{Y}) = \mathcal{N}(\mathbf{\theta} \| \mathbf{m}_N, \mathbf{S}_N)\)	Updated parameter beliefs
Predictive	\(p(y_* \| \mathcal{X}, \mathcal{Y}, \mathbf{x}_) = \mathcal{N}(\phi(\mathbf{x}_)^\top \mathbf{m}_N, \phi(\mathbf{x}_)^\top \mathbf{S}_N \phi(\mathbf{x}_) + \sigma^2)\)	Predictions at new points
Marginal Likelihood	\(p(\mathcal{Y} \| \mathcal{X}) = \mathcal{N}(\mathcal{Y} \| \mathbf{X} \mathbf{m}_0, \mathbf{X} \mathbf{S}_0 \mathbf{X} ^\top + \sigma^2 \mathbf{I} )\)	Model evidence

Exercises

Put some exercises here.