8.4 Kernels

The dual formulation of the SVM depends only on inner products between training example \(\mathbf{x}_i\) and \(\mathbf{x}_j\). This modularity allows us to replace the inner product with a more general similarity function, giving rise to the concept of kernels.


8.4.1 Feature Representations and Kernels

Let each input \(\mathbf{x}\) be represented in some (possibly nonlinear) feature space by a mapping: \[ \phi(\mathbf{x}) : \mathcal{X} \rightarrow \mathcal{H}, \] where \(\mathcal{H}\) is a Hilbert space. In this case, the dual SVM objective involves terms of the form: \[ \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}. \]

Rather than explicitly computing \(\phi(\mathbf{x})\), we define a kernel function: \[ k(\mathbf{x}_i, \mathbf{x}_j) = \langle \phi(\mathbf{x}_i), \phi(\mathbf{x}_j) \rangle_{\mathcal{H}}. \] This allows the SVM to implicitly operate in a high-dimensional (or even infinite-dimensional) space, without directly computing the transformation. This substitution is known as the kernel trick — it hides the explicit feature mapping while preserving the geometric relationships defined by the inner product.


8.4.2 The Kernel Function and RKHS

  • A kernel is a function \(k : \mathcal{X} \times \mathcal{X} \to \mathbb{R}\) that defines inner products in a reproducing kernel Hilbert space (RKHS).
  • Every valid kernel has a unique RKHS (Aronszajn, 1950).
  • The canonical feature map is \(\phi(\mathbf{x}) = k(\cdot, \mathbf{x})\).

The kernel must satisfy: \[ \forall z \in \mathbb{R}^N : z^\top K z \ge 0, \] where \(K\) is the Gram matrix (or kernel matrix) defined by \(K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)\).This ensures that \(K\) is symmetric and positive semidefinite.


8.4.3 Common Kernels

Some widely used kernels for data \(\mathbf{x}_i \in \mathbb{R}^D\) include:

  • Polynomial kernel:
    \[ k(\mathbf{x}_i, \mathbf{x}_j) = (\langle \mathbf{x}_i, \mathbf{x}_j \rangle + c)^d. \]
  • Gaussian radial basis function (RBF) kernel:
    \[ k(\mathbf{x}_i, \mathbf{x}_j) = \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2\sigma^2}\right). \]
  • Rational quadratic kernel:
    \[ k(\mathbf{x}_i, \mathbf{x}_j) = 1 - \frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{\|\mathbf{x}_i - \mathbf{x}_j\|^2 + c}. \]

These kernels allow nonlinear decision boundaries while the SVM still computes a linear separator in feature space.


8.4.4 Practical Aspects

The kernel trick offers computational efficiency — the kernel can be computed directly without constructing \(\phi(\mathbf{x})\). For example,

  • The polynomial kernel avoids explicit polynomial expansion.
  • The Gaussian RBF kernel corresponds to an infinite-dimensional feature space.

The choice of kernel and its parameters (e.g., degree \(d\), variance \(\sigma\)) are typically selected via nested cross-validation. Kernels are not restricted to vector data. They can be defined over strings, graphs, sets, distributions, and other structured objects.


8.4.5 Terminology Note

The term “kernel” can have different meanings:

  1. In this context: kernel functions defining an RKHS (machine learning).
  2. In linear algebra: the null space of a matrix.
  3. In statistics: smoothing kernels in kernel density estimation.

Exercises

Put some exercises here.