8.4 Kernels
The dual formulation of the SVM depends only on inner products between training example \(\mathbf{x}_i\) and \(\mathbf{x}_j\). This modularity allows us to replace the inner product with a more general similarity function, giving rise to the concept of kernels.
8.4.1 Feature Representations and Kernels
Let each input \(\mathbf{x}\) be represented in some (possibly nonlinear) feature space by a mapping: \[ \phi(\mathbf{x}) : \mathcal{X} \rightarrow \mathcal{H}, \] where \(\mathcal{H}\) is a Hilbert space. In this case, the dual SVM objective involves terms of the form: \[ \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}. \]
Rather than explicitly computing \(\phi(\mathbf{x})\), we define a kernel function: \[ k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}. \] This allows the SVM to implicitly operate in a high-dimensional (or even infinite-dimensional) space, without directly computing the transformation. This substitution is known as the kernel trick — it hides the explicit feature mapping while preserving the geometric relationships defined by the inner product.
8.4.2 The Kernel Function and RKHS
- A kernel is a function \(k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) that defines inner products in a reproducing kernel Hilbert space (RKHS).
- Every valid kernel has a unique RKHS (Aronszajn, 1950).
- The canonical feature map is \(\phi(\mathbf{x}) = k(\cdot, \mathbf{x})\).
The kernel must satisfy: \[ \forall z \in \mathbb{R}^N : z^\top K z \ge 0, \] where \(K\) is the Gram matrix (or kernel matrix) defined by \(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\).This ensures that \(K\) is symmetric and positive semidefinite.
8.4.3 Common Kernels
Some widely used kernels for data \(\mathbf{x}_i \in \mathbb{R}^D\) include:
- Polynomial kernel:
\[ k(\mathbf{x}_i, \mathbf{x}_j) = (\langle \mathbf{x}_i, \mathbf{x}_j \rangle + c)^d. \]
- Gaussian radial basis function (RBF) kernel:
\[ k(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2\sigma^2}\right). \]
- Rational quadratic kernel:
\[ k(\mathbf{x}_i, \mathbf{x}_j) = 1 - \frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{\|\mathbf{x}_i - \mathbf{x}_j\|^2 + c}. \]
These kernels allow nonlinear decision boundaries while the SVM still computes a linear separator in feature space.
8.4.4 Practical Aspects
The kernel trick offers computational efficiency — the kernel can be computed directly without constructing \(\phi(\mathbf{x})\). For example,
- The polynomial kernel avoids explicit polynomial expansion.
- The Gaussian RBF kernel corresponds to an infinite-dimensional feature space.
The choice of kernel and its parameters (e.g., degree \(d\), variance \(\sigma\)) are typically selected via nested cross-validation. Kernels are not restricted to vector data. They can be defined over strings, graphs, sets, distributions, and other structured objects.