4.6 Model Selection
Model selection in machine learning involves choosing among different models or configurations that can best generalize to unseen data. The goal is to balance model complexity and data fit — avoiding both underfitting and overfitting.
Complex models (e.g., higher-degree polynomials) are more expressive and can describe a wider variety of datasets. However, greater flexibility often leads to overfitting on the training set, which reduces performance on unseen data. Therefore, model selection aims to identify the simplest model that explains the data sufficiently well — a concept known as Occam’s Razor.
4.6.1 Nested Cross-Validation
Definition 4.14 Cross-validation estimates a model’s generalization error by repeatedly splitting data into training and validation sets.
Definition 4.15 Nested cross-validation extends the idea of cross validation by using two levels of cross-validation:
- Inner loop: Chooses the best model or hyperparameters based on validation performance.
- Outer loop: Estimates the generalization performance of the chosen model on unseen test data.
Lemma 4.1 The expected validation error estimate is given by: \[ E_V[\mathbf{R}(\mathcal{V} | M)] \approx \frac{1}{K} \sum_{k=1}^{K} \mathbf{R}(\mathcal{V}^{(k)} | M), \] where \(\mathbf{R}(\mathcal{V} | M)\) is the empirical risk (e.g., RMSE) of model \(M\) on validation set \(\mathcal{V}\).
This process yields both a mean generalization estimate and a standard error for uncertainty quantification.
4.6.2 Bayesian Model Selection
Bayesian model selection provides a probabilistic framework for comparing models. It incorporates both data fit and model complexity via Bayes’ theorem: \[ p(M_k | \mathcal{D}) \propto p(M_k) \, p(\mathcal{D} | M_k). \] Here:
- \(p(M_k)\): prior probability of model \(M_k\)
- \(p(\mathcal{D} | M_k)\): model evidence or marginal likelihood \[ p(\mathcal{D} | M_k) = \int p(\mathcal{D} | \mathbf{\theta}_k) \, p(\mathbf{\theta}_k | M_k) \, d\mathbf{\theta}_k \]
This integral marginalizes over model parameters \(\mathbf{\theta}_k\), automatically penalizing overly complex models.
Definition 4.16 Model evidence quantifies how well a model predicts observed data after accounting for parameter uncertainty.
Thus, the MAP estimate of the best model is: \[ M^* = \arg\max_{M_k} p(M_k | \mathcal{D}) \]
4.6.3 Bayes Factors for Model Comparison
To compare two models \(M_1\) and \(M_2\), we consider their posterior odds: \[ \frac{p(M_1 | \mathcal{D})}{p(M_2 | \mathcal{D})} = \underbrace{\frac{p(M_1)}{p(M_2)}}_{\text{prior odds}} \times \underbrace{\frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_2)}}_{\text{Bayes factor}} \]
The Bayes factor \(\frac{p(\mathcal{D} | M_1)}{p(\mathcal{D} | M_2)}\) measures relative model support from the data. With uniform priors, model selection depends only on the Bayes factor. If the Bayes factor > 1, \(M_1\) is preferred; otherwise, \(M_2\) is chosen.
The Jeffreys-Lindley paradox is a phenomenon in Bayesian statistics that highlights a surprising difference between Bayesian model comparison and classical (frequentist) hypothesis testing.
Suppose we want to test:
\[ H_0: \mathbf{\theta} = \mathbf{\theta}_0 \quad \text{vs.} \quad H_1: \mathbf{\theta} \neq \mathbf{\theta}_0 \]
- In frequentist hypothesis testing, we might reject \(H_0\) if the p-value is small.
- In Bayesian model selection, we compute the Bayes factor: \[ \text{BF} = \frac{p(\text{data} | H_0)}{p(\text{data} | H_1)} \]
The paradox arrises when, even if the data strongly rejects \(H_0\) according to a classical test (small p-value), the Bayes factor may favor the null hypothesis \(H_0\). This happens especially when the prior for the alternative hypothesis \(H_1\) is diffuse (spread over a wide range of possible parameter values). As a result, \(H_0\) can appear more probable in the Bayesian sense, even if the observed data seems extreme.
The result is that Bayesian and frequentist conclusions can disagree. Prior choices in Bayesian analysis have a strong influence on model comparison. The paradox emphasizes the importance of carefully selecting priors for alternative hypotheses.
4.6.4 Computing the Marginal Likelihood
The marginal likelihood integral: \[ p(\mathcal{D} | M_k) = \int p(\mathcal{D} | \mathbf{\theta}_k) p(\mathbf{\theta}_k | M_k) \, d\mathbf{\theta}_k \] is often analytically intractable. Common approximation techniques include:
- Numerical integration
- Monte Carlo sampling
- Bayesian Monte Carlo methods
However, when using conjugate priors, this term can sometimes be computed in closed form.