5.5 Useful Identities for Computing Gradients
This section lists key gradient identities that are frequently used in machine learning applications (based on Petersen and Pedersen, 2012). Important matrix operations include the trace tr(·), the determinant det(·), and the inverse of a matrix f(X)^{-1} (when it exists).
Theorem 5.3
Transpose Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{\top}}{\partial \mathbf{X}} = \left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right)^{\top} \]
Trace Rule \[ \frac{\partial \, \text{tr}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{tr}\left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Determinant Rule \[ \frac{\partial \, \text{det}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{det}(\mathbf{f}(\mathbf{X})) \, \text{tr}\left(\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Inverse Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{-1}}{\partial \mathbf{X}} = -\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}} \mathbf{f}(\mathbf{X})^{-1} \]
Quadratic Form Rules
For vectors \(\mathbf{a}, \mathbf{b}\) and invertible matrix \(\mathbf{X}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X}^{-1} \mathbf{b})}{\partial \mathbf{X}} = - (\mathbf{X}^{-1})^{\top} \mathbf{a} \mathbf{b}^{\top} (\mathbf{X}^{-1})^{\top} \]
For vector \(\mathbf{x}\) and constant vector \(\mathbf{a}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{a})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \quad \text{and} \quad \frac{\partial (\mathbf{a}^{\top} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \]
For \(\mathbf{a}^{\top} \mathbf{X} \mathbf{b}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{a} \mathbf{b}^{\top} \]
For symmetric \(\mathbf{B}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{B} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{x}^{\top} (\mathbf{B} + \mathbf{B}^{\top}) \]
For symmetric \(\mathbf{W}\): \[ \frac{\partial (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} (\mathbf{x} - \mathbf{A}\mathbf{s})}{\partial \mathbf{s}} = -2 (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} \mathbf{A} \]
Exercises
Exercise 5.25 For each of the following properties, prove them for small matrices.
Transpose Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{\top}}{\partial \mathbf{X}} = \left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right)^{\top} \]
Trace Rule \[ \frac{\partial \, \text{tr}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{tr}\left(\frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Determinant Rule \[ \frac{\partial \, \text{det}(\mathbf{f}(\mathbf{X}))}{\partial \mathbf{X}} = \text{det}(\mathbf{f}(\mathbf{X})) \, \text{tr}\left(\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}}\right) \]
Inverse Rule \[ \frac{\partial \mathbf{f}(\mathbf{X})^{-1}}{\partial \mathbf{X}} = -\mathbf{f}(\mathbf{X})^{-1} \frac{\partial \mathbf{f}(\mathbf{X})}{\partial \mathbf{X}} \mathbf{f}(\mathbf{X})^{-1} \]
Quadratic Form Rules
For vectors \(\mathbf{a}, \mathbf{b}\) and invertible matrix \(\mathbf{X}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X}^{-1} \mathbf{b})}{\partial \mathbf{X}} = - (\mathbf{X}^{-1})^{\top} \mathbf{a} \mathbf{b}^{\top} (\mathbf{X}^{-1})^{\top} \]
For vector \(\mathbf{x}\) and constant vector \(\mathbf{a}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{a})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \quad \text{and} \quad \frac{\partial (\mathbf{a}^{\top} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{a}^{\top} \]
For \(\mathbf{a}^{\top} \mathbf{X} \mathbf{b}\): \[ \frac{\partial (\mathbf{a}^{\top} \mathbf{X} \mathbf{b})}{\partial \mathbf{X}} = \mathbf{a} \mathbf{b}^{\top} \]
For symmetric \(\mathbf{B}\): \[ \frac{\partial (\mathbf{x}^{\top} \mathbf{B} \mathbf{x})}{\partial \mathbf{x}} = \mathbf{x}^{\top} (\mathbf{B} + \mathbf{B}^{\top}) \]
For symmetric \(\mathbf{W}\): \[ \frac{\partial (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} (\mathbf{x} - \mathbf{A}\mathbf{s})}{\partial \mathbf{s}} = -2 (\mathbf{x} - \mathbf{A}\mathbf{s})^{\top} \mathbf{W} \mathbf{A} \]