8.3 Dual Support Vector Machine

The Dual Support Vector Machine (SVM) is an alternative formulation of the SVM optimization problem. While the primal SVM works with variables \(\mathbf{w}\) and \(b\) (whose size depends on the number of features \(D\)), the dual SVM expresses the problem in terms of Lagrange multipliers, where the number of parameters depends instead on the number of training examples. This formulation is especially useful when:

The number of features is much larger than the number of examples.
Kernel functions are used, as they can be incorporated easily in the dual form.
We leverage convex duality, discussed earlier in the text.

8.3.1 Convex Duality via Lagrange Multipliers

The dual formulation is derived from the primal soft-margin SVM by introducing Lagrange multipliers:

\(\alpha_n \ge 0\) for the classification constraints.
\(\gamma_n \ge 0\) for the non-negativity of the slack variables \(\xi_n\).

The Lagrangian is defined as: \[ L(\mathbf{w}, b, \xi, \alpha, \gamma) = \frac{1}{2}\|\mathbf{w}\|^2 + C\sum_{n=1}^N \xi_n - \sum_{n=1}^N \alpha_n[y_n(\langle \mathbf{w}, \mathbf{x}_n \rangle + b) - 1 + \xi_n] - \sum_{n=1}^N \gamma_n \xi_n. \] Setting derivatives to zero gives: \[ \frac{\partial L}{\partial \mathbf{w}} = 0 \Rightarrow \mathbf{w} = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n. \]

This equation illustrates the representer theorem, showing that \(\mathbf{w}\) is a linear combination of the training example. Only points with \(\alpha_n > 0\) contribute to \(\mathbf{w}\); these are the support vectors.

8.3.2 Dual Optimization Problem

Substituting \(\mathbf{w}\) back into the Lagrangian yields the dual problem: \[ \begin{aligned} \min_{\alpha} \quad & \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N y_i y_j \alpha_i \alpha_j \langle \mathbf{x}_i, \mathbf{x}_j \rangle - \sum_{i=1}^N \alpha_i \\ \text{subject to} \quad & \sum_{i=1}^N y_i \alpha_i = 0, \\ & 0 \le \alpha_i \le C, \; \forall i = 1, \ldots, N. \end{aligned} \] These box constraints restrict each \(\alpha_i\) between 0 and \(C\). Support vectors with \(0 < \alpha_i < C\) lie exactly on the margin boundary.

The primal parameters can be recovered as: \[ \mathbf{w}^* = \sum_{n=1}^N \alpha_n y_n \mathbf{x}_n, \quad b^* = y_n - \langle \mathbf{w}^*, \mathbf{x}_n \rangle \] (for any example on the margin).

8.3.3 Dual SVM: Convex Hull View

An alternative geometric view of the dual SVM comes from convex hulls:

Each class (positive and negative) forms a convex hull from its examples.
The SVM finds the closest points \(c\) and \(d\) between these two convex hulls.
The difference vector \(\mathbf{w} = \mathbf{c} - \mathbf{d}\) defines the maximum-margin hyperplane.

Mathematically: \[ \text{conv}(X) = \left\{ \sum_{n=1}^N \alpha_n \mathbf{x}_n \; \Big| \; \sum_{n=1}^N \alpha_n = 1, \; \alpha_n \ge 0 \right\}. \] The optimization problem can then be expressed as minimizing the distance between the convex hulls: \[ \min_{\alpha} \frac{1}{2} \left\| \sum_{n: y_n = +1} \alpha_n \mathbf{x}_n - \sum_{n: y_n = -1} \alpha_n \mathbf{x}_n \right\|^2, \] subject to: \[ \sum_{n=1}^N y_n \alpha_n = 0, \quad \alpha_n \ge 0. \] For the soft-margin case, a reduced convex hull is used by limiting \(\alpha_n \le C\), shrinking the hull and allowing for misclassified examples.

Exercises

Put some exercises here.