5.4 Gradients of Matrices

In many machine learning applications, we need to compute gradients of matrices with respect to vectors or other matrices. The result is typically a multidimensional tensor, which can be viewed as an array of partial derivatives.


5.4.1 Gradients as Tensors

Definition 5.10 If \(\mathbf{A} \in \mathbb{R}^{m \times n}\) and \(\mathbf{B} \in \mathbb{R}^{p \times q}\), then the Jacobian of \(\mathbf{A}\) with respect to \(\mathbf{B}\) is a tensor of shape: \[ \mathbf{J} \in \mathbb{R}^{(m \times n) \times (p \times q)} \] with entries: \[ \mathbf{J}_{ijkl} = \frac{\partial \mathbf{A}_{ij}}{\partial \mathbf{B}_{kl}} \]

Because matrices represent linear mappings, there exists a vector-space isomorphism between \(\mathbb{R}^{m \times n}\) and \(\mathbb{R}^{mn}\). This means matrices can be flattened (by stacking columns) into vectors, allowing gradients to be represented as matrices instead of tensors.

Working with flattened matrices simplifies computation: the chain rule becomes standard matrix multiplication, whereas tensor gradients require summing across multiple dimensions.

Example 5.14 Given: \[ f = \mathbf{A} \mathbf{x}, \quad \mathbf{f} \in \mathbb{R}^M, \; \mathbf{A} \in \mathbb{R}^{M \times N}, \; \mathbf{x} \in \mathbb{R}^N \] we seek \(\frac{d\mathbf{f}}{d\mathbf{A}}\).

  • The gradient has dimension: \[ \frac{d \mathbf{f}}{d\mathbf{A}} \in \mathbb{R}^{M \times (M \times N)} \]

  • For each component: \[ f_i = \sum_{j=1}^{N} \mathbf{A}_{ij} x_j \Rightarrow \frac{\partial f_i}{\partial \mathbf{A}_{iq}} = x_q \]

  • The partial derivatives with respect to each row of \(\mathbf{A}\) are: \[ \frac{\partial f_i}{\partial \mathbf{A}_{i,:}} = x^\top, \quad \frac{\partial f_i}{\partial \mathbf{A}_{k \neq i,:}} = 0^\top \]

  • Stacking these gives the full gradient tensor: \[ \frac{\partial f_i}{\partial \mathbf{A}} = \begin{bmatrix} 0^\top \\ \vdots \\ 0^\top \\ x^\top \\ 0^\top \\ \vdots \\ 0^\top \end{bmatrix} \in \mathbb{\mathbf{R}}^{1 \times (M \times N)} \]

Example 5.15 Let: \[ \mathbf{f}(\mathbf{R}) = \mathbf{R}^\top \mathbf{R} = \mathbf{K}, \quad \mathbf{R} \in \mathbb{\mathbf{R}}^{M \times N}, \; \mathbf{K} \in \mathbb{\mathbf{R}}^{N \times N} \] We want \(\frac{d\mathbf{K}}{d\mathbf{R}}\).

  • The gradient tensor has dimensions: \[ \frac{d\mathbf{K}}{d\mathbf{R}} \in \mathbb{\mathbf{R}}^{(N \times N) \times (M \times N)} \]

  • Each entry of \(\mathbf{K}\) is defined as: \[ K_{pq} = r_p^\top r_q = \sum_{m=1}^{M} R_{mp} R_{mq} \]

  • Differentiating gives: \[ \frac{\partial K_{pq}}{\partial R_{ij}} = \begin{cases} R_{iq}, & \text{if } j = p, \; p \neq q \\ R_{ip}, & \text{if } j = q, \; p \neq q \\ 2R_{iq}, & \text{if } j = p, \; p = q \\ 0, & \text{otherwise} \end{cases} \]

  • Therefore, every element of the gradient tensor \(\frac{d\mathbf{K}}{d\mathbf{R}}\) is determined by how each matrix element \(R_{ij}\) contributes to every element \(K_{pq}\) of \(\mathbf{K}\).


Exercises

Exercise 5.24 The following rules describe matrix derivatives in a general setting. Each rule can be proved using small-dimensional examples (e.g., \(2\times2\) or \(3\times3\) matrices) by differentiating entrywise. Prove each result below. Throughout, let:

  • \(\mathbf{A}(t), \mathbf{B}(t)\) be matrix-valued functions of compatible sizes
  • \(c \in \mathbb{R}\) be a constant
  • \(\mathbf{x}(t)\) be a vector-valued function
  1. Entrywise Differentiation: Matrix derivatives are defined componentwise: \[ \frac{d}{dt}\mathbf{A}(t) = \left[ \frac{d}{dt} a_{ij}(t) \right]. \] This rule justifies all others.

  2. Constant Matrix Rule: If \(\mathbf{A}\) is constant (independent of \(t\)), then \[ \frac{d}{dt}\mathbf{A} = \mathbf{0}. \]

  3. Constant Multiple Rule \[ \frac{d}{dt}[c\mathbf{A}(t)] = c\,\mathbf{A}'(t). \]

  4. Sum Rule: If \(\mathbf{A}(t)\) and \(\mathbf{B}(t)\) have the same dimensions, \[ \frac{d}{dt}[\mathbf{A}(t) + \mathbf{B}(t)] = \mathbf{A}'(t) + \mathbf{B}'(t). \]

  5. Matrix–Vector Product Rule: If \(\mathbf{A}(t) \in \mathbb{R}^{m\times n}\) and \(\mathbf{x}(t) \in \mathbb{R}^n\), \[ \frac{d}{dt}[\mathbf{A}(t)\mathbf{x}(t)] = \mathbf{A}'(t)\mathbf{x}(t) + \mathbf{A}(t)\mathbf{x}'(t). \] This is the matrix analogue of the product rule.

  6. Scalar–Matrix Product Rule: If \(f(t) \in \mathbb{R}\), \[ \frac{d}{dt}[f(t)\mathbf{A}(t)] = f'(t)\mathbf{A}(t) + f(t)\mathbf{A}'(t). \]

  7. Matrix–Matrix Product Rule: If \(\mathbf{A}(t)\mathbf{B}(t)\) is defined, \[ \frac{d}{dt}[\mathbf{A}(t)\mathbf{B}(t)] = \mathbf{A}'(t)\mathbf{B}(t) + \mathbf{A}(t)\mathbf{B}'(t). \]

  8. Transpose Rule \[ \frac{d}{dt}[\mathbf{A}(t)^\top] = (\mathbf{A}'(t))^\top. \]

  9. Quadratic Form (Vector Case): If \(\mathbf{x}(t) \in \mathbb{R}^n\) and \(\mathbf{A}\) is constant,

\[ \frac{d}{dt}[\mathbf{x}(t)^\top \mathbf{A} \mathbf{x}(t)] = \mathbf{x}'(t)^\top \mathbf{A} \mathbf{x}(t) + \mathbf{x}(t)^\top \mathbf{A} \mathbf{x}'(t). \] If \(\mathbf{A}\) is symmetric, this simplifies to: \[ = 2\,\mathbf{x}'(t)^\top \mathbf{A} \mathbf{x}(t). \]

  1. Inverse Matrix Rule (Advanced but Verifiable): If \(\mathbf{A}(t)\) is invertible for all \(t\), \[ \frac{d}{dt}[\mathbf{A}(t)^{-1}] = -\,\mathbf{A}(t)^{-1}\mathbf{A}'(t)\mathbf{A}(t)^{-1}. \] This follows from differentiating: \[ \mathbf{A}(t)\mathbf{A}(t)^{-1} = \mathbf{I}. \]