5.5 Useful Identities for Computing Gradients

This section lists key gradient identities that are frequently used in machine learning applications (based on Petersen and Pedersen, 2012). Important matrix operations include the trace tr(·), the determinant det(·), and the inverse of a matrix f(X)^{-1} (when it exists).

Theorem 5.3

Transpose Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{\top}}{\partial \mathbf{X}} = \left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right)^{\top} \]
Trace Rule \[ \frac{\partial \, \text{tr}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{tr}\left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Determinant Rule \[ \frac{\partial \, \text{det}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{det}(\mathbf{f}(\mathbf{X})) \, \text{tr}\left(\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Inverse Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{-1}}{\partial \mathbf{X}} = -\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}} \mathbf{f}(\mathbf{X})^{-1} \]
Quadratic Form Rules
- For vectors \(\mathbf{a}, \mathbf{b}\) and invertible matrix \(\mathbf{X}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X}^{-1} \mathbf{b})}{\partial \mathbf{X}} = - (\mathbf{X}^{-1})^{\top} \mathbf{a} \mathbf{b}^{\top} (\mathbf{X}^{-1})^{\top} \]
- For vector \(\mathbf{x}\) and constant vector \(\mathbf{a}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{a})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \quad \text{and} \quad \frac{\partial (\mathbf{a}^{\top} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \]
- For \(\mathbf{a}^{\top} \mathbf{X} \mathbf{b}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{a} \mathbf{b}^{\top} \]
- For symmetric \(\mathbf{B}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{B} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{x}^{\top} (\mathbf{B} + \mathbf{B}^{\top}) \]
- For symmetric \(\mathbf{W}\): \[ \frac{\partial (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} (\mathbf{x} - \mathbf{A}\mathbf{s})}{\partial \mathbf{s}} = -2 (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} \mathbf{A} \]

Exercises

Exercise 5.25 For each of the following properties, prove them for small matrices.

Transpose Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{\top}}{\partial \mathbf{X}} = \left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right)^{\top} \]
Trace Rule \[ \frac{\partial \, \text{tr}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{tr}\left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Determinant Rule \[ \frac{\partial \, \text{det}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{det}(\mathbf{f}(\mathbf{X})) \, \text{tr}\left(\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Inverse Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{-1}}{\partial \mathbf{X}} = -\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}} \mathbf{f}(\mathbf{X})^{-1} \]
Quadratic Form Rules
- For vectors \(\mathbf{a}, \mathbf{b}\) and invertible matrix \(\mathbf{X}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X}^{-1} \mathbf{b})}{\partial \mathbf{X}} = - (\mathbf{X}^{-1})^{\top} \mathbf{a} \mathbf{b}^{\top} (\mathbf{X}^{-1})^{\top} \]
- For vector \(\mathbf{x}\) and constant vector \(\mathbf{a}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{a})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \quad \text{and} \quad \frac{\partial (\mathbf{a}^{\top} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \]
- For \(\mathbf{a}^{\top} \mathbf{X} \mathbf{b}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{a} \mathbf{b}^{\top} \]
- For symmetric \(\mathbf{B}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{B} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{x}^{\top} (\mathbf{B} + \mathbf{B}^{\top}) \]
- For symmetric \(\mathbf{W}\): \[ \frac{\partial (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} (\mathbf{x} - \mathbf{A}\mathbf{s})}{\partial \mathbf{s}} = -2 (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} \mathbf{A} \]