4.1 The Three Components of Machine Learning

Machine learning systems involve three fundamental components:

Data — numerical or structured information used for training and testing.
Models — mathematical structures or functions that represent relationships in the data.
Learning — the process of adjusting model parameters to improve performance.

A “good” model is one that performs well on unseen data, judged by performance metrics such as accuracy or distance from a ground truth.

This chapter outlines common frameworks for training and evaluating models, including:

Empirical risk minimization (ERM)
Maximum likelihood estimation (MLE)
Probabilistic modeling
Graphical models
Model selection techniques

4.1.1 Data as Vectors

Machine learning assumes data can be represented numerically, typically in tabular form where:

Rows correspond to examples (or data points).
Columns correspond to features (also called attributes or covariates).

Each example is a vector: \[ \mathbf{x}_n \in \mathbb{R}^D, \quad n = 1, \ldots, N \] where \(D\) is the number of features and \(N\) the number of samples.

It is also important to remember that:

Categorical data must be encoded numerically (e.g., gender as 0/1).
Scaling: Data should typically be standardized so each feature has mean 0 and variance 1.
Identifiers (e.g., names) are often dropped for privacy and because they provide no predictive power.

Therefore, the dataset is represented as:
\[ \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\}, \quad \mathcal{X} \in \mathbb{R}^{N \times D} \]

Example 4.1 Predicting annual salary \(y\) from age \(x\) is a supervised learning problem, where each data point has an associated label.

4.1.2 Models as Functions

Definition 4.1 A model defines a mapping from inputs to outputs — a predictor: \[ f: \mathbb{R}^D \to \mathbb{R} \]

For simplicity, many algorithms use linear predictors: \[ f(x) = \mathbf{\mathbf{\theta}}^{\top}\mathbf{x} + \mathbf{\theta}_0 \] where \(\mathbf{\mathbf{\theta}}\) and \(\mathbf{\theta}_0\) are parameters to be learned.

Linear models balance mathematical simplicity and expressive power, forming the foundation for regression and classification methods.

4.1.3 Models as Probability Distributions

Real-world data is noisy, so models must handle uncertainty.

Definition 4.2 A probabilistic model represents predictions not as fixed outputs but as distributions over possible outcomes.

Instead of a single predictor \(f(\mathbf{x})\), we consider a distribution over functions parameterized by finite-dimensional variables. These models express:

Uncertainty in predictions (e.g., confidence intervals)
Uncertainty in parameters

Probability theory (Chapter 6) provides the foundation for these ideas. Probabilistic modeling is used to describe machine learning systems in Section 8.4, and represent them compactly with graphical models in Section 8.5.

4.1.4 Learning as Finding Parameters

Definition 4.3 Learning is the process of finding model parameters that perform well on unseen data.

This involves three algorithmic phases:

Prediction (Inference) — Using a trained model on new data.
Training (Parameter Estimation) — Adjusting parameters based on training data.
Model Selection (Hyperparameter Tuning) — Choosing among competing models or configurations.

There are many different training strategies that may be utilized. Some of those include:

Empirical Risk Minimization (ERM) — Optimize parameters by minimizing prediction error on training data (Section 8.2).
Maximum Likelihood Estimation (MLE) — Choose parameters that make observed data most probable (Section 8.3).
Bayesian Inference — Model uncertainty over parameters using probability distributions (Section 8.4).

Training typically involves numerical optimization (Chapter 7), often framed as minimizing a cost function. Cross-validation (Section 8.2.4) is used to simulate performance on unseen data.

4.1.5 Regularization and Model Complexity

To achieve good generalization, we balance:

Fit to training data, and
Model simplicity.

This is achieved through:

Regularization — adding penalty terms to discourage complexity (Section 8.2.3)
Bayesian priors — probabilistic constraints on parameters (Section 8.3.2)

This process reflects abduction — inference to the best explanation — rather than strict induction or deduction.

4.1.6 Model Selection and Hyperparameters

Model selection chooses the best model or hyperparameters (e.g., number of components, type of distribution). This can be done using:

Cross-validation
Nested cross-validation for hyperparameter tuning.

Exercises

Exercise 4.1 Suppose we have a model \[\hat{y}_i = \beta_0 + \beta_1 \mathbf{x}_i.\] The least squares problem aims to minimize the loss function \[l(\mathbf{x}) = (y_i - (\beta_0 + \beta_1\mathbf{x}_i))^2.\] Suppose we add a regularization term \(\lambda ||\mathbf{ \beta }||^2\) to the function we want to minimize. Construct the function we want to now minimize and find the equations that minimize the sum of squared residuals. This is known as L2 regularization.

Solution

Exercise 4.2 Suppose we have a model \[\hat{y}_i = \beta_0 + \beta_1 \mathbf{x}_i.\] The least squares problem aims to minimize the loss function \[l(\mathbf{x}) = (y_i - (\beta_0 + \beta_1\mathbf{x}_i))^2.\] Suppose we add a regularization term \(\lambda \sum_{i=0}^1 |\beta_i |\) to the function we want to minimize. Construct the function we want to now minimize and find the equations that minimize the sum of squared residuals. This is known as L1 regularization.

Solution