5.3 Bayesian Linear Regression
Bayesian linear regression extends the concept of linear regression by treating parameters as random variables instead of estimating a single best set of parameters (as in MLE or MAP). Rather than finding a point estimate for parameters \(\mathbf{\theta}\), it computes the full posterior distribution over them and uses this distribution to make predictions.
This approach:
- Incorporates prior beliefs about parameters,
- Naturally accounts for uncertainty,
- Prevents overfitting, especially with limited data.
5.3.1 The Model
We define the model as: \[ p(\mathbf{\theta}) = \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0) \quad \text{(prior on parameters)} \] \[ p(y | \mathbf{x}, \mathbf{\theta}) = \mathcal{N}(y | \phi(\mathbf{x})^\top \mathbf{\theta}, \sigma^2) \quad \text{(likelihood)}. \] The joint distribution is: \[ p(y, \mathbf{\theta} | \mathbf{x}) = p(y | \mathbf{x}, \mathbf{\theta}) \, p(\mathbf{\theta}). \] Here:
- \(\mathbf{\theta}\) is now a random variable,
- \(\mathbf{m}_0\) and \(\mathbf{S}_0\) are the prior mean and covariance.
Predictions are obtained by integrating out the parameter uncertainty: \[ p(y_* | \mathbf{x}_*) = \int p(y_* | \mathbf{x}_*, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}. \] Since both likelihood and prior are Gaussian, the predictive distribution is also Gaussian: \[ p(y_* | \mathbf{x}_*) = \mathcal{N}(\phi(\mathbf{x}_*)^\top \mathbf{m}_0, \phi(\mathbf{x}_*)^\top \mathbf{S}_0 \phi(\mathbf{x}_*) + \sigma^2) \]
The term \(\phi(\mathbf{x}_*)^\top \mathbf{S}_0 \phi(\mathbf{x}_*)\) reflects uncertainty due to parameter variability. The term \(\sigma^2\) reflects observation noise.
For noise-free function values \(f(\mathbf{x}_*) = \phi(\mathbf{x}_*)^\top \mathbf{\theta}\): \[ p(f(\mathbf{x}_*)) = \mathcal{N}(\phi(\mathbf{x}_*)^\top \mathbf{m}_0, \phi(\mathbf{x}_*)^\top \mathbf{S}_0 \phi(\mathbf{x}_*)) \]
The parameter prior \(p(\mathbf{\theta})\) induces a distribution over functions:
- Each sampled parameter vector \(\mathbf{\theta}_i \sim p(\mathbf{\theta})\) defines a function \(f_i(\mathbf{x}) = \phi(\mathbf{x})^\top \mathbf{\theta}_i\).
- The collection of these defines \(p(f(\cdot))\), a distribution over possible functions.
Example 5.4 Bayesian Linear Regression with a Single Feature
We model a simple linear relationship \[ y = \theta_0 + \theta_1 x + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, \sigma^2). \] Define the feature map \[ \phi(x) = \begin{bmatrix} 1 \\ x \end{bmatrix}, \quad \mathbf{\theta} = \begin{bmatrix} \theta_0 \\ \theta_1 \end{bmatrix}. \] Then \[ p(y \mid x, \mathbf{\theta}) = \mathcal{N}(y \mid \phi(x)^\top \mathbf{\theta}, \sigma^2). \]
Assume a Gaussian prior: \[ p(\mathbf{\theta}) = \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0), \quad \mathbf{m}_0 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad \mathbf{S}_0 = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}. \]
Interpretation:
- Before seeing data, we believe slopes and intercepts are near 0.
- Large uncertainty allows many possible linear functions.
Each draw \(\mathbf{\theta}_i \sim p(\mathbf{\theta})\) defines a function \[ f_i(x) = \theta_{0,i} + \theta_{1,i} x. \] Thus the prior defines a distribution over lines.
For a new input \(x_*\), \[ p(y_* \mid x_*) = \mathcal{N} \left( \phi(x_*)^\top \mathbf{m}_0,\; \phi(x_*)^\top \mathbf{S}_0 \phi(x_*) + \sigma^2 \right). \] Because \(\mathbf{m}_0 = \mathbf{0}\): \[ \mathbb{E}[y_*] = 0. \] The variance \[ \phi(x_*)^\top \mathbf{S}_0 \phi(x_*) = 1 + x_*^2 \] grows with distance from the origin — uncertainty increases away from data.
5.3.2 Posterior Distribution
After observing training data \((\mathcal{X}, \mathcal{Y})\), we compute the posterior over parameters:
\[ p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \frac{p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta})}{p(\mathcal{Y} | \mathcal{X})} \]
Where:
- \(p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) = \mathcal{N}(\mathcal{Y} | \mathbf{\Phi} \mathbf{\theta}, \sigma^2 I)\)
- \(p(\mathcal{Y} | \mathcal{X}) = \int p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}\) is the marginal likelihood (normalizing constant).
Because both prior and likelihood are Gaussian, the posterior is also Gaussian: \[ p(\mathbf{\theta} | X, Y) = \mathcal{N}(\mathbf{\theta} | \mathbf{m}_N, \mathbf{S}_N), \] with \[ \mathbf{S}_N = (\mathbf{S}_0^{-1} + \sigma^{-2} \mathbf{\Phi}^\top \mathbf{\Phi})^{-1}, \quad \mathbf{m}_N = \mathbf{S}_N (\mathbf{S}_0^{-1} \mathbf{m}_0 + \sigma^{-2} \mathbf{\Phi}^\top Y). \] This is derived by completing the square in the exponent of the unnormalized posterior.
Example 5.5 Posterior After Observing Data
Suppose we observe: \[ \mathcal{X} = \begin{bmatrix} 0 \\ 1 \\ 2 \end{bmatrix}, \quad \mathcal{Y} = \begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}. \] The design matrix is \[ \mathbf{\Phi} = \begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \end{bmatrix}. \]
Because the prior and likelihood are Gaussian, the posterior is: \[ p(\mathbf{\theta} \mid \mathcal{X}, \mathcal{Y}) = \mathcal{N}(\mathbf{m}_N, \mathbf{S}_N), \] with \[ \mathbf{S}_N = (\mathbf{S}_0^{-1} + \sigma^{-2} \mathbf{\Phi}^\top \mathbf{\Phi})^{-1}, \quad \mathbf{m}_N = \mathbf{S}_N (\mathbf{S}_0^{-1} \mathbf{m}_0 + \sigma^{-2} \mathbf{\Phi}^\top \mathcal{Y}). \]
Interpretation:
- Posterior mean \(\mathbf{m}_N\): best estimate of parameters (MAP).
- Posterior covariance \(\mathbf{S}_N\): remaining uncertainty after seeing data.
Uncertainty shrinks in directions well-supported by data.
5.3.3 Posterior Predictions
To predict at a new point \(\mathbf{x}_*\), we again integrate over \(\mathbf{\theta}\), but now using the posterior instead of the prior: \[ p(y_* | \mathcal{X}, \mathcal{Y}, \mathbf{x}_*) = \int p(y_* | \mathbf{x}_*, \mathbf{\theta}) p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) d\mathbf{\theta}. \] This yields: \[ p(y_* | \mathcal{X}, \mathcal{Y}, \mathbf{x}_*) = \mathcal{N}(\phi(\mathbf{x}_*)^\top \mathbf{m}_N, \phi(\mathbf{x}_*)^\top \mathbf{S}_N \phi(\mathbf{x}_*) + \sigma^2). \]
- The predictive mean \(\phi(\mathbf{x}_*)^\top \mathbf{m}_N\) equals the MAP estimate prediction.
- The predictive variance accounts for both model and observation uncertainty.
For \(f(\mathbf{x}_*) = \phi(\mathbf{x}_*)^\top \mathbf{\theta}\): \[ \mathbb{E}[f(\mathbf{x}_*) | \mathcal{X}, \mathcal{Y}] = \phi(\mathbf{x}_*)^\top \mathbf{m}_N, \quad \text{Var}[f(\mathbf{x}_*) | \mathcal{X}, \mathcal{Y}] = \phi(\mathbf{x}_*)^\top \mathbf{S}_N \phi(\mathbf{x}_*) \]
Sampling \(\mathbf{\theta}_i \sim p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y})\) gives functions \(f_i(\mathbf{x}) = \phi(\mathbf{x})^\top \mathbf{\theta}_i\). These represent the posterior distribution over functions, with:
- Mean function: \(\mathbf{m}_N^\top \phi(\mathbf{x})\),
- Variance: \(\phi(\mathbf{x})^\top \mathbf{S}_N \phi(\mathbf{x})\).
Interpretation:
Higher uncertainty (larger variance) reflects regions with sparse or no data.
Example 5.6 Posterior Predictive Distribution
For a new input \(x_* = 1.5\):
\[ p(y_* \mid \mathcal{X}, \mathcal{Y}, x_*) = \mathcal{N} \left( \phi(x_*)^\top \mathbf{m}_N,\; \phi(x_*)^\top \mathbf{S}_N \phi(x_*) + \sigma^2 \right). \]
This variance has two components:
- \(\phi(x_*)^\top \mathbf{S}_N \phi(x_*)\): parameter uncertainty
- \(\sigma^2\): observation noise
Even with infinite data, noise remains.
Example 5.7 Noise-Free Function Values
For the latent function \[ f(x_*) = \phi(x_*)^\top \mathbf{\theta}, \] we have \[ p(f(x_*) \mid \mathcal{X}, \mathcal{Y}) = \mathcal{N} \left( \phi(x_*)^\top \mathbf{m}_N,\; \phi(x_*)^\top \mathbf{S}_N \phi(x_*) \right). \] This is a distribution over functions, not just numbers.
5.3.4 Marginal Likelihood
The marginal likelihood (or evidence) measures how well the model explains the observed data: \[ p(\mathcal{Y} | \mathcal{X}) = \int p(\mathcal{Y} |\mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}. \] Using Gaussian conjugacy, the result is: \[ p(\mathcal{Y} | \mathcal{X}) = \mathcal{N}(\mathbf{y} | \mathbf{X} \mathbf{m}_0, \mathbf{X} \mathbf{S}_0 \mathbf{X}^\top + \sigma^2 \mathbf{I}), \] with: \[ \mathbb{E}[\mathcal{Y} | \mathcal{X}] = \mathbf{X} \mathbf{m}_0, \quad \text{Cov}[\mathcal{Y} | \mathcal{X}] = \mathbf{X} \mathbf{S}_0 X^\top + \sigma^2 I \] The marginal likelihood is crucial for model comparison and selection, as it measures the model’s fit after integrating over parameter uncertainty.
Example 5.8 Marginal Likelihood (Evidence)
The marginal likelihood integrates out parameters: \[ p(\mathcal{Y} \mid \mathcal{X}) = \int p(\mathcal{Y} \mid \mathcal{X}, \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta}. \]
Result: \[ p(\mathcal{Y} \mid \mathcal{X}) = \mathcal{N} (\mathcal{Y} \mid \mathbf{\Phi} \mathbf{m}_0, \mathbf{\Phi} \mathbf{S}_0 \mathbf{\Phi}^\top + \sigma^2 \mathbf{I}). \]
Interpretation:
- Measures how well the model explains data on average over parameters
- Automatically penalizes overly flexible models
- Central to Bayesian model selection
5.3.4.1 Key Insights
| Concept | Description |
|---|---|
| Parameter Treatment | Bayesian regression treats parameters \(\mathbf{\theta}\) as random variables. |
| Prior → Posterior | Prior \(p(\mathbf{\theta})\) is updated using data to yield posterior \(p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y})\). |
| Predictions | Are averages over all plausible parameter values (not point estimates). |
| Uncertainty | Captured via predictive variance combining model and observation noise. |
| Marginal Likelihood | Used for model evidence and Bayesian model selection. |
5.3.4.2 Summary of Distributions
| Distribution | Epression | Description |
|---|---|---|
| Prior | \(p(\mathbf{\theta}) = \mathcal{N}(\mathbf{m}_0, \mathbf{S}_0)\) | Beliefs before seeing data |
| Likelihood | \(p(\mathcal{Y} | \mathcal{X}, \mathbf{\theta}) = \mathcal{N}(Y | \mathbf{\Phi} \mathbf{\theta}, \sigma^2 I)\) | Data model |
| Posterior | \(p(\mathbf{\theta} | \mathcal{X}, \mathcal{Y}) = \mathcal{N}(\mathbf{\theta} | \mathbf{m}_N, \mathbf{S}_N)\) | Updated parameter beliefs |
| Predictive | \(p(y_* | \mathcal{X}, \mathcal{Y}, \mathbf{x}_*) = \mathcal{N}(\phi(\mathbf{x}_*)^\top \mathbf{m}_N, \phi(\mathbf{x}_*)^\top \mathbf{S}_N \phi(\mathbf{x}_*) + \sigma^2)\) | Predictions at new points |
| Marginal Likelihood | \(p(\mathcal{Y} | \mathcal{X}) = \mathcal{N}(\mathcal{Y} | \mathbf{X} \mathbf{m}_0, \mathbf{X} \mathbf{S}_0 \mathbf{X} ^\top + \sigma^2 \mathbf{I} )\) | Model evidence |