Chapter 8 Classification with Support Vector Machines (SVMs)

In many machine learning problems, we aim to predict discrete outcomes rather than continuous values. For example:

An email classifier distinguishes between personal mail and junk mail.
A telescope identifies whether an object is a galaxy, star, or planet.

This type of task is known as classification, and when there are only two possible outcomes, it is called binary classification.

In binary classification, the goal is to predict labels \(y_n \in \{+1, -1\}\) for examples \(\mathbf{x}_n \in \mathbb{R}^D\). The two labels are often referred to as the positive and negative classes, although these names do not imply any moral or semantic “positiveness.”

Example 8.1 In a cancer detection problem, \(y = +1\) might indicate the presence of cancer in a patient while \(y = -1\) might indicate no cancer.

Formally, the predictor is a function: \[ f: \mathbb{R}^D \to \{+1, -1\}. \]

The task is to find model parameters that minimize classification error on a labeled dataset: \[ \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\}. \]

Similar to regression (Chapter 9), SVMs assume a linear model in some transformed feature space: \[ f(\mathbf{x}) = \text{sign}(\langle \mathbf{w}, \phi(\mathbf{x}) \rangle + b), \] where \(\phi(\mathbf{x})\) is a possibly nonlinear transformation of the input. This approach allows SVMs to represent nonlinear decision boundaries through kernel functions.

SVMs are a widely used and theoretically grounded method for binary classification. They are particularly valuable because:

They provide ageometric interpretation of classification, based on concepts such as inner products, projection, and margins.
The optimization problem in SVMs does not have a closed-form solution, requiring tools from convex optimization (see Chapter 7).
They achieve strong theoretical guarantees and excellent empirical performance (Steinwart & Christmann, 2008).

Unlike probabilistic approaches such as maximum likelihood estimation (Chapter 9), SVMs are derived from geometric principle* and empirical risk minimization (Section 8.2).