2.8 Gaussian Distribution
The Gaussian (or normal) distribution is one of the most fundamental probability distributions for continuous-valued random variables. Its importance arises from its computational convenience and its natural appearance in many real-world and theoretical contexts — most notably due to the Central Limit Theorem, which states that the sum of many independent and identically distributed random variables tends to a Gaussian distribution.
The Gaussian distribution plays a central role in machine learning, forming the foundation of many machine learning ideas such as linear regression (as the likelihood and prior), mixture models (for density estimation), Gaussian processes, etc.
Definition 2.36 For a scalar random variable \(x\), the Gaussian (normal) probability density function (pdf) is defined as \[ p(x \mid \mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), \] where:
- \(\mu\) is the mean (location parameter),
- \(\sigma^2\) is the variance (spread or scale parameter).
Example 2.59 Consider a random variable \(X\) representing the heights (in cm) of a population of adults, modeled as: \[ X \sim \mathcal{N}(\mu = 170, \sigma^2 = 25) \] so that the mean height is 170 cm and the standard deviation is \(\sigma = 5\) cm.
The probability density function is: \[ p(x \mid 170, 25) = \frac{1}{\sqrt{2\pi \cdot 25}} \exp\left( -\frac{(x - 170)^2}{2 \cdot 25} \right) = \frac{1}{5\sqrt{2\pi}} \exp\left( -\frac{(x - 170)^2}{50} \right). \]
Find the density at \(x = 175\) cm:
\[ p(175) = \frac{1}{5\sqrt{2\pi}} \exp\left( -\frac{(175 - 170)^2}{50} \right) = \frac{1}{5\sqrt{2\pi}} \exp\left( -\frac{25}{50} \right) = \frac{1}{5\sqrt{2\pi}} \exp(-0.5) \approx 0.048. \]
This tells us that a height of 175 cm has a density of approximately 0.048 under this normal distribution. This is simply the height of the curve - it is not the probability of someone having a height of 175cm.
Definition 2.37 For a random vector \(\mathbf{x} \in \mathbb{R}^D\), the multivariate Gaussian distribution is given by: \[ p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = (2\pi)^{-D/2} |\boldsymbol{\Sigma}|^{-1/2} \exp\!\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right), \] where:
- \(\boldsymbol{\mu}\) is the mean vector , and
- \(\boldsymbol{\Sigma}\) is the covariance matrix.
Example 2.60 Consider a 2-dimensional random vector
\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \] representing the height (in cm) and weight (in kg) of a person. Suppose the data is modeled as a 2D Gaussian with \[ \boldsymbol{\mu} = \begin{bmatrix} 170 \\ 65 \end{bmatrix}, \quad \boldsymbol{\Sigma} = \begin{bmatrix} 25 & 10 \\ 10 & 16 \end{bmatrix}. \] Then the probability density function is: \[ p(\mathbf{x} \mid \boldsymbol{\mu}, \boldsymbol{\Sigma}) = \frac{1}{2 \pi \sqrt{|\boldsymbol{\Sigma}|}} \exp\!\Bigg( -\frac{1}{2} (\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \Bigg). \]
Let \(\mathbf{x} = \begin{bmatrix} 175 \\ 70 \end{bmatrix}\). Then the difference from the mean is \[ \mathbf{x} - \boldsymbol{\mu} = \begin{bmatrix} 5 \\ 5 \end{bmatrix}. \]
Compute the Mahalanobis term: \[ (\mathbf{x}-\boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu}) \approx 2.0 \]
and the determinant \[ |\boldsymbol{\Sigma}| = 25\cdot16 - 10^2 = 300. \] Thus, the density is \[ p(\mathbf{x}) \approx \frac{1}{2 \pi \sqrt{300}} \exp(-1) \approx 0.0042. \]
This gives the height of the Gaussian curve at the vector \([175, 70]^\top\).
2.8.1 Joint, Marginal, and Conditional Gaussians
Lemma 2.9 Consider a joint Gaussian over concatenated variables: \[ p(\mathbf{x}, \mathbf{y}) = \mathcal{N}\!\left( \begin{bmatrix} \boldsymbol{\mu}_x \\[4pt] \boldsymbol{\mu}_y \end{bmatrix}, \begin{bmatrix} \boldsymbol{\Sigma}_{xx} & \boldsymbol{\Sigma}_{xy} \\[4pt] \boldsymbol{\Sigma}_{yx} & \boldsymbol{\Sigma}_{yy} \end{bmatrix} \right). \] Then:
- The marginals \(p(\mathbf{x})\) and \(p(\mathbf{y})\) are Gaussian.
- The conditional distribution \(p(\mathbf{x} \mid \mathbf{y})\) is also Gaussian, with mean and covariance derived as:
\[ \begin{aligned} \boldsymbol{\mu}_{x|y} &= \boldsymbol{\mu}_x + \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} (\mathbf{y} - \boldsymbol{\mu}_y), \\ \boldsymbol{\Sigma}_{x|y} &= \boldsymbol{\Sigma}_{xx} - \boldsymbol{\Sigma}_{xy} \boldsymbol{\Sigma}_{yy}^{-1} \boldsymbol{\Sigma}_{yx}. \end{aligned} \]
Example 2.61 Consider two random variables \(\mathbf{x} \in \mathbb{R}\) and \(\mathbf{y} \in \mathbb{R}\) with the joint Gaussian distribution: \[ \begin{bmatrix} \mathbf{x} \\ \mathbf{y} \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \right). \]
The marginal of \(\mathbf{x}\) is Gaussian: \[ p(\mathbf{x}) \= \mathcal{N}(\mu_x, \Sigma_{xx}) = \mathcal{N}(1, 2) \]
The marginal of \(\mathbf{y}\) is Gaussian: \[ p(\mathbf{y}) \= \mathcal{N}(\mu_y, \Sigma_{yy}) = \mathcal{N}(2, 3) \]
Suppose we observe \(p(\mathbf{y}) = 3\). Then the conditional distribution \(p(\mathbf{x} \mid \mathbf{y}=3)\) is Gaussian with:
\[ \mu_{x|y} = \mu_x + \Sigma_{xy} \Sigma_{yy}^{-1} (y - \mu_y) = 1 + 1 \cdot 3^{-1} \cdot (3 - 2) = 1 + \frac{1}{3} \approx 1.333 \]
\[ \Sigma_{x|y} = \Sigma_{xx} - \Sigma_{xy} \Sigma_{yy}^{-1} \Sigma_{yx} = 2 - 1 \cdot 3^{-1} \cdot 1 = 2 - \frac{1}{3} \approx 1.667 \]
Thus, the conditional distribution is: \[ p(\mathbf{x} \mid \mathbf{y}=3) = \mathcal{N}(1.333, 1.667) \]
2.8.2 Product of Gaussian Densities
Lemma 2.10 The product of two Gaussian densities is proportional to another Gaussian.
This property is essential for Bayesian inference, where the posterior distribution is obtained by multiplying the likelihood and prior, both often modeled as Gaussians.
Example 2.62 If \(\mathcal{N}(\mathbf{x}|\mathbf{a}, \mathbf{A})\) and \(\mathcal{N}(\mathbf{x}|\mathbf{b}, \mathbf{B})\) are the two Gaussians, their product is a Gaussian of the form \(\mathcal{N}(\mathbf{x}|\mathbf{c}, \mathbf{C})\) where
- \(\mathbf{C} = \left( \mathbf{A}^{-1} + \mathbf{B}^{-1} \right)^{-1}\)
- \(\mathbf{c} = \mathbf{C}\left( \mathbf{A}^{-1}\mathbf{a} + \mathbf{B}^{-1}\mathbf{b} \right)\)
- \(c = (2\pi)^{-D/2} |\mathbf{A} + \mathbf{B}|^{-1/2} exp(-1/2 (\mathbf{a} - \mathbf{b})^T(\mathbf{A} + \mathbf{B})^{-1}(\mathbf{a} - \mathbf{b}))\).
2.8.3 Mixtures of Gaussians
A mixture of Gaussians combines multiple Gaussian components to form a more flexible distribution:
\[ p(x) = \alpha p_1(x) + (1 - \alpha)p_2(x), \] where \(0 < \alpha < 1\) is the mixture weight.
Lemma 2.11 Let
\[
p(x) = \alpha p_1(x) + (1 - \alpha)p_2(x),
\]
where \(0 < \alpha < 1\). If \(p_1(x) = \mathcal{N}(\mu_1, \sigma_1^2)\) and \(p_2(x) = \mathcal{N}(\mu_2, \sigma_2^2)\),
then:
\[
\begin{aligned}
E[x] &= \alpha \mu_1 + (1 - \alpha)\mu_2, \\
V[x] &= \alpha \sigma_1^2 + (1 - \alpha)\sigma_2^2
+ \alpha(1 - \alpha)(\mu_1 - \mu_2)^2.
\end{aligned}
\]
This expression illustrates the law of total variance: \[ \mathrm{Var}(X) = E_Y[\mathrm{Var}(X \mid Y)] + \mathrm{Var}_Y(E[X \mid Y]). \]
2.8.4 Linear and Affine Transformations of Gaussians
If \(\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\) and \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \boldsymbol{\mu}\), then \[ \mathbf{Y} \sim \mathcal{N}(\boldsymbol{\mu}, \mathbf{A}\mathbf{A}^\top). \]
Hence, any linear or affine transformation of a Gaussian random variable is also Gaussian. This property is fundamental in probabilistic modeling, regression, and state estimation.
Exercises
Exercise 2.96 Derive the formula for the product of Gaussians. That is, prove \[\mathcal{N}(\mathbf{x}|\mathbf{a},\mathbf{a})\mathcal{N}(\mathbf{x}|\mathbf{B},\mathbf{B}) = c\mathcal{N}(\mathbf{x},\mathbf{c},\mathbf{C}),\] showing the values of \(c, \mathbb{c}\) and \(\mathbb{C}\).
Exercise 2.97
- Select any two integers \(a\) and \(b\). What is \(\alpha a + (1-\alpha)b\), where \(0 \leq \alpha \leq 1\).
- Select any \(a,b \in \mathbb{R}^2\). What is \(\alpha a + (1-\alpha)b\), where \(0 \leq \alpha \leq 1\).
- Generalize the results from the last 2 parts to \(a,b \in \mathbb{R}^n\). Justify your answer.
Exercise 2.98 Suppose we have a full rank matrix \(\mathbf{a} \in \mathbb{R}^{M \times N}\), where \(M \geq N\) and \(\mathbb{y} \in \mathbb{R}^M\) is a Gaussian random variable with mean \(\mathbf{a}\mathbf{x}\), i.e., \[p(\mathbf{y}) = \mathcal{N}(\mathbf{y}|\mathbf{a}\mathbf{x}, \mathbf{\Sigma}).\] If \(\mathbf{a}\) is invertible, find \(p(\mathbf{x})\).
Exercise 2.99 Justify the following statement for a full rank matrix \(\mathbf{a} \in \mathbb{R}^{M \times N}\), where \(M \geq N\), \(\mathbb{y} \in \mathbb{R}^M\) and \(\mathbf{x} \in \mathbb{R}^N\): \[\mathbb{y} = \mathbf{a}\mathbf{x} \Longleftrightarrow \left(\mathbf{a}^T\mathbf{a}\right)^{-1}\mathbf{a}^T\mathbb{y} = \mathbf{x}.\] Be careful to justify all of your steps.