4.2 Empirical Risk Minimization

Empirical Risk Minimization (ERM) is the foundation of learning in machine learning. It focuses on estimating model parameters from training data to minimize prediction error. The section introduces four main design components:

Hypothesis Class: What kind of functions can the predictor take?
Loss Function: How do we measure how well the predictor fits the training data?
Regularization: How do we prevent overfitting and ensure good generalization?
Model Search: How do we assess and choose the best model?

4.2.1 Hypothesis Class of Functions

In supervised learning, we are given a dataset of input-output pairs: \[ \mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_N, y_N)\}, \] where each \(\mathbf{x}_n \in \mathbb{R}^D\) represents an input feature vector, and \(y_n \in \mathbb{R}\) (or a discrete set) represents the target label. The goal of learning is to find a predictor function \(f(\mathbf{x}, \mathbf{\mathbf{\theta}})\) that maps input features to outputs, such that the predictions are as close as possible to the true values \(y_n\). Here, \(\mathbf{\mathbf{\theta}}\) denotes the parameters (or weights) that define the specific function within a family of possible functions.

Definition 4.4 The hypothesis class is the set of all functions that the learning algorithm can choose from, based on the model type and its parameters.

Formally: \[ \mathcal{H} = \{ f(\mathbf{x}, \mathbf{\mathbf{\theta}}) : \mathbf{\mathbf{\theta}} \in \Theta \}, \] where:

\(\mathcal{H}\) is the hypothesis class (the space of possible predictors),
\(f(\mathbf{x}, \mathbf{\mathbf{\theta}})\) is the model function,
\(\Theta\) is the parameter space that determines which functions are possible.

The hypothesis class defines the capacity of a model — how flexible or expressive it can be in fitting data.

Example 4.2 Linear (Affine) Models:

A common and simple choice for \(f(\mathbf{x}, \mathbf{\mathbf{\theta}})\) is a linear model: \[ f(\mathbf{x}, \mathbf{\mathbf{\theta}}) = \mathbf{\mathbf{\theta}}^\top \mathbf{x} + b, \] where:

\(\mathbf{\mathbf{\theta}} \in \mathbb{R}^D\) is a weight vector,
\(b \in \mathbb{R}\) is a bias term.

The hypothesis class for linear regression is: \[ \mathcal{H}_{\text{linear}} = \{ f(\mathbf{x}) = \mathbf{\theta}^\top \mathbf{x} + b : \mathbf{\theta} \in \mathbb{R}^D, b \in \mathbb{R} \}. \] This means all hyperplanes in \(\mathbb{R}^D\) are possible predictors.

Linear models are interpretable and efficient but limited in representing nonlinear relationships. To capture more complex patterns, we can expand the hypothesis class by transforming the inputs or using nonlinear functions.

Example 4.3 Nonlinear Models:

Polynomial models: \[ f(\mathbf{x}, \mathbf{\theta}) = \mathbf{\theta}_0 + \mathbf{\theta}_1 \mathbf{x} + \mathbf{\theta}_2 \mathbf{x}^2 + \cdots + \mathbf{\theta}_k \mathbf{x}^k \]

Neural networks: \[ f(\mathbf{x}, \mathbf{\theta}) = \sigma(W_2 \, \sigma(W_1 \mathbf{x} + b_1) + b_2) \] where \(\sigma(\cdot)\) is a nonlinear activation function.

These types of models can approximate complex relationships but risk overfitting if the hypothesis class is too flexible relative to the data size.

A small hypothesis class (e.g., linear functions) may lead to underfitting, failing to capture important patterns.
A large hypothesis class (e.g., deep neural networks) can represent almost any function but may overfit training data.

The key is to choose a hypothesis class that is rich enough to capture underlying relationships but constrained enough to generalize well.

Geometrically, each function \(f(\mathbf{x}, \mathbf{\theta})\) in \(\mathcal{H}\) corresponds to a surface (in regression) or decision boundary (in classification) in feature space. Learning means selecting one of these functions (via parameter estimation) that best fits the observed data under a chosen loss function.

4.2.2 Loss Function for Training

Definition 4.5 A loss function \(\ell(y_n, \hat{y}_n)\) measures the discrepancy between the true label \(y_n\) and prediction \(\hat{y}_n = f(\mathbf{x}_n, \mathbf{\theta})\).

Definition 4.6 The empirical risk is the average loss across all training examples: \[ \mathbf{R}_{\text{emp}}(f, \mathbf{\mathbf{X}}, \mathbf{y}) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, f(\mathbf{x}_n, \mathbf{\theta})) \]

ERM (Empirical Risk Minimization) seeks to minimize this risk: \[ \min_{\mathbf{\theta}} \mathbf{R}_{\text{emp}}(f, \mathbf{\mathbf{X}}, \mathbf{y}). \]

Example 4.4 In least squares regression, the average empirical risk is given by \[ \frac{1}{N} \sum_{n=1}^N (y_n - \mathbf{\theta}^\top \mathbf{x}_n)^2. \] ERM seeks to find \[ \min_{\mathbf{\theta}} \frac{1}{N} \sum_{n=1}^N (y_n - \mathbf{\theta}^\top \mathbf{x}_n)^2 \] or equivalently, in matrix form: \[ \min_{\mathbf{\theta}} \frac{1}{N} \|\mathbf{y} - \mathbf{\mathbf{X}}\mathbf{\theta}\|^2 \]

Definition 4.7 The true (expected) risk measures performance on unseen data: \[ \mathbf{R}_{\text{true}}(f) = \mathbb{E}_{\mathbf{x},y}[\ell(y, f(\mathbf{x}))] \]

ERM approximates true risk with finite training data.

4.2.3 Regularization to Reduce Overfitting

Regularization reduces overfitting and improves generalization.

Once a hypothesis class \(\mathcal{H}\) is chosen, the next challenge is ensuring that the model not only fits the training data well but also generalizes effectively to unseen data. This problem is known as the bias-variance trade-off, and one of the most powerful tools to manage it is regularization.

Definition 4.8 A model overfits when it learns patterns that exist only in the training data — including noise or random fluctuations — rather than the underlying structure of the problem.

For example:

A high-degree polynomial may perfectly interpolate training points but produce wild oscillations on new inputs.
A deep neural network with too many parameters may memorize the data rather than learn general features.

Overfitting leads to low training error but high test error.

Definition 4.9 Regularization introduces a penalty for model complexity directly into the optimization objective. It biases the learning algorithm toward simpler models that are less likely to overfit.

The general form of a regularized optimization problem is: \[ \min_{\mathbf{\theta}} \; \mathcal{L}_{\text{train}}(\mathbf{\theta}) + \lambda \, \Omega(\mathbf{\theta}) \] where:

\(\mathcal{L}_{\text{train}}(\mathbf{\theta})\): training loss (e.g., mean squared error, cross-entropy)
\(\Omega(\mathbf{\theta})\): regularization term that penalizes large or complex parameter values
\(\lambda \ge 0\): regularization coefficient controlling the strength of the penalty

A large \(\lambda\) enforces more simplicity (strong regularization), while a small \(\lambda\) allows more flexibility.

4.2.4 Cross-Validation to Assess Generalization

Even after applying regularization, a key question remains: How well does our model generalize to unseen data?

In machine learning, generalization refers to a model’s ability to perform well not just on the training data (used to learn parameters), but also on new, unseen data drawn from the same underlying distribution. Because we only have access to a finite dataset, we must estimate generalization performance empirically. Cross-validation provides a systematic way to do this.

4.2.4.1 The Challenge of Estimating Generalization

If we train and test on the same dataset, the model will almost always appear to perform well — often unrealistically so. This is because the model has already “seen” the data during training, and may have overfit to it.

To better approximate the true risk \[ R_{\text{true}}(f) = \mathbb{E}_{\mathbf{x}, y}[\ell(y, f(\mathbf{x}))],\] we need a separate dataset on which the model has not been trained. A simple approach is to split the data into:

a training set for fitting parameters, and
a test (or validation) set for evaluating generalization.

However, when data is limited, such a simple split can be unreliable — the test performance may depend heavily on which points happened to be held out. This motivates cross-validation.

4.2.4.2 The Idea of Cross-Validation

Cross-validation (CV) is a statistical resampling method that allows us to:

Use all the data for both training and validation (just not at the same time), and
Obtain a more robust estimate of model performance by averaging across multiple train/test splits.

The most widely used form is K-fold cross-validation.

4.2.4.3 K-Fold Cross-Validation

In K-fold cross-validation, the dataset is randomly partitioned into \(K\) equally (or nearly equally) sized subsets, or folds. Then, the procedure is:

For each \(k = 1, \ldots, K\):
- Hold out fold \(k\) as the validation set \(V^{(k)}\).
- Use the remaining \(K - 1\) folds as the training set \(R^{(k)}\).
- Train the model on \(R^{(k)}\), producing a predictor \(f^{(k)}\).
- Compute the empirical risk (validation error) on \(V^{(k)}\):
  \[ R_{\text{emp}}(f^{(k)}, V^{(k)}) = \frac{1}{|V^{(k)}|} \sum_{(\mathbf{x}_i, y_i) \in V^{(k)}} \ell(y_i, f^{(k)}(\mathbf{x}_i)) \]
Average the validation errors across all folds: \[ E_V[R(f, V)] \approx \frac{1}{K} \sum_{k=1}^{K} R_{\text{emp}}(f^{(k)}, V^{(k)}) \]

The resulting average gives an estimate of the expected generalization error.

4.2.4.4 Choosing \(K\)

The choice of \(K\) controls the bias–variance tradeoff of the cross-validation estimate:

K	Training Proportion	Bias	Variance	Computational Cost
2	50%	High	Low	Low
5	80%	Moderate	Moderate	Moderate
10	90%	Low	Moderate	Higher
N (Leave-One-Out CV)	~100%	Very Low	High	Very High

Generally speaking, a smaller \(K\) leads to less computation, but a higher bias (since less data per training run). A larger \(K\) has better performance estimates, but more computationally expensive. In practice, \(K = 5\) or \(K = 10\) are common choices.

Exercises

Exercise 4.3 Verify that \[\mathcal{L}(\mathbf{\theta})=-\log(p(\mathcal{Y}|\mathcal{X},\mathbf{\theta}) = -\sum_{n=1}^N \log(p(y_n|\mathbf{x}_n,\mathbf{\theta}).\] Make sure to justify all of your steps.

Solution

Exercise 4.4 Verify that \[\mathcal{L}(\mathbf{\theta})=\dfrac{1}{2\sigma^2}\sum_{n=1}^N (y_n - \mathbf{x}_n^T\mathbf{\theta})^2 - \sum_{n=1}^N \log \dfrac{1}{\sqrt{2\pi\sigma^2}}.\] Make sure to justify all of your steps.

Solution

Exercise 4.5 If we have prior knowledge about a distribution of the parameters, \(\theta\), we can multiply an additional term to our likelihood, \(p(\mathbf{x} | \mathbf{\theta})\). We know \[p(\mathbf{\theta}|\mathbf{x})=\dfrac{p(\mathbf{x} | \mathbf{\theta})p(\mathbf{\theta})}{p(\mathbf{x})}.\] However, we can ignore \(p(\mathbf{x})\) and we write \[p(\mathbf{\theta}|\mathbf{x}) \propto p(\mathbf{x} | \mathbf{\theta})p(\mathbf{\theta}).\] Why?

Solution