8.2 Primal Support Vector Machine

Support Vector Machines (SVMs) are based on the geometric concept of finding a separating hyperplane that best divides positive and negative examples. When data is linearly separable, there are infinitely many such hyperplanes. To obtain a unique and robust solution, SVMs choose the hyperplane that maximizes the margin—the distance between the hyperplane and the nearest data points.

A large-margin classifier tends to generalize well to unseen data (Steinwart & Christmann, 2008).

Definition 8.2 The margin is the distance between the separating hyperplane and the closest training examples.

Example 8.2 For a hyperplane defined as: \[ f(\mathbf{x}) = \langle \mathbf{w}, \mathbf{x} \rangle + b = 0, \] and an example \(\mathbf{x}_a\), the orthogonal distance \(r\) to the hyperplane is obtained by projecting \(\mathbf{x}_a\) onto it: \[ \mathbf{x}_a = \mathbf{x}_a' + r \frac{\mathbf{w}}{\|\mathbf{w}\|}, \] where \(\mathbf{x}_a'\) lies on the hyperplane and \(\|\mathbf{w}\|\) is the Euclidean norm of \(\mathbf{w}\).

To ensure that all examples lie at least distance \(r\) from the hyperplane: \[ y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge r. \] Assuming \(\|\mathbf{w}\| = 1\) (unit length normalization), the optimization problem becomes: \[ \max_{\mathbf{w},b,r} \quad r, \] subject to
\[ y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge r, \quad \|\mathbf{w}\| = 1, \quad r > 0. \]

This formulation seeks to maximize the margin \(r\) while ensuring all data points are correctly classified.

8.2.1 Traditional Derivation of the Margin

An equivalent approach defines the scale of the data such that the closest points satisfy: \[ \langle \mathbf{w}, \mathbf{x}_a \rangle + b = 1. \] By projecting onto the hyperplane, we find the margin: \[ r = \frac{1}{\|\mathbf{w}\|}. \] Thus, maximizing the margin is equivalent to minimizing the norm of \(\mathbf{w}\). The optimization problem becomes: \[ \min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2, \] subject to
\[ y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge 1, \quad \forall n. \]

This is known as the Hard Margin SVM, which assumes perfect separability and allows no violations of the margin condition.

8.2.2 Why We Can Set the Margin to 1?

Theorem 12.1 below shows that the two formulations—maximizing \(r\) under \(\|\mathbf{w}\|=1\), and minimizing \(\|\mathbf{w}\|^2\) under a unit margin—are equivalent.

Theorem 8.1 Maximizing the margin \(r\), where we consider normalized weights: \[ \begin{aligned} \max_{w, b, r} \quad & r \quad && \text{(margin)} \\ \text{subject to} \quad & y_n(\langle w, x_n \rangle + b) \ge r \quad && \text{(data fitting)} \\ & \|w\| = 1 \quad && \text{(normalization)} \\ & r > 0 \end{aligned} \] is equivalent to scaling the data such that the margin is unity: \[ \begin{aligned} \min_{w, b} \quad & \frac{1}{2} \|w\|^2 \quad && \text{(margin)} \\ \text{subject to} \quad & y_n(\langle w, x_n \rangle + b) \ge 1 \quad && \text{(data fitting)} \end{aligned} \]

Hence, setting the margin to 1 simplifies computation without changing the solution. This equivalence arises from a rescaling of parameters: \[ r = \frac{1}{\|\mathbf{w}\|}. \]

8.2.3 Soft Margin SVM: Geometric View

When data is not linearly separable, we relax the hard constraints by introducing slack variables \(\xi_n \ge 0\), allowing points to lie within or beyond the margin.

The resulting Soft Margin SVM optimization problem is: \[ \min_{\mathbf{w},b,\xi} \quad \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \xi_n, \] subject to
\[ y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) \ge 1 - \xi_n, \quad \xi_n \ge 0. \]

Here:

\(C > 0\) is the regularization parameter controlling the trade-off between a large margin and small classification error.
The term \(\|\mathbf{w}\|^2\) acts as the regularizer, encouraging simpler models.
Larger \(C\) penalizes margin violations more strongly (less regularization).

8.2.4 Soft Margin SVM: Loss Function View

Using the empirical risk minimization framework, we can rewrite the SVM objective as minimizing a regularized loss: \[ \min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \ell(y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b)), \] where \(\ell(t)\) is the hinge loss: \[ \ell(t) = \max(0, 1 - t). \] This loss penalizes points:

Outside the margin (\(t \ge 1\)): no penalty (\(\ell = 0\)),
Inside the margin (\(0 < t < 1\)): linear penalty,
Misclassified (\(t < 0\)): large penalty.

Hence, the SVM objective can be written as an unconstrained optimization: \[ \min_{\mathbf{w},b} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{n=1}^N \max(0, 1 - y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b)). \] The first term acts as a regularizer (encouraging a large margin), while the second term measures training error.

Exercises

Put some exercises here.