7.2 Parameter Learning via Maximum Likelihood

We are given a dataset \(\mathcal{X} = \{\textbf{x}_1, \dots, \textbf{x}_N\}\), where each data point \(\textbf{x}_n\) is drawn i.i.d. from an unknown distribution \(p(\textbf{x})\). Our goal is to approximate this distribution using a Gaussian mixture model with \(K\) components, each defined by: \[ \boldsymbol{\theta} = \{\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k : k = 1, \dots, K\} \] where:

\(\pi_k\) is the mixture weights for the \(k^{th}\) distribution,
\(\boldsymbol{\mu}_k\) is the mean for the \(k^{th}\) distribution, and
\(\boldsymbol{\Sigma}_k\) is the covariance associated with the \(k^{th}\) distribution.

7.2.1 Likelihood and Log-Likelihood

The likelihood of the dataset is: \[ p(\mathcal{X} | \boldsymbol{\theta}) = \prod_{n=1}^{N} p(\textbf{x}_n | \boldsymbol{\theta}), \quad p(\textbf{x}_n | \boldsymbol{\theta}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\textbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \]

The log-likelihood becomes: \[ L(\boldsymbol{\theta}) = \sum_{n=1}^{N} \log \sum_{k=1}^{K} \pi_k \mathcal{N}(\textbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k). \]

Maximizing this log-likelihood yields the maximum likelihood estimate (MLE) \(\boldsymbol{\theta}_{ML}\). However, because of the log-sum term, there is no closed-form solution, so we rely on iterative optimization, leading to the Expectation-Maximization (EM) algorithm.

7.2.2 Responsibilities

Definition 7.3 The responsibility \(r_{nk}\) as the probability that mixture component \(k\) generated data point \(\textbf{x}_n\): \[ r_{nk} = \frac{\pi_k \mathcal{N}(\textbf{x}_n | \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)}{\sum_{j=1}^{K} \pi_j \mathcal{N}(\textbf{x}_n | \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}. \]

Each row \(r_n = [r_{n1}, \dots, r_{nK}]\) forms a probability vector where \(\sum_k r_{nk} = 1\).

7.2.3 Updating Parameters

The EM algorithm alternates between computing responsibilities (E-step) and updating parameters (M-step):

1. Update Means \[ \boldsymbol{\mu}_k^{new} = \frac{\sum_{n=1}^{N} r_{nk} \textbf{x}_n}{\sum_{n=1}^{N} r_{nk}} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} \textbf{x}_n, \] where \(N_k = \sum_{n=1}^{N} r_{nk}\) is the effective number of points assigned to component \(k\).

2. Update Covariances \[ \boldsymbol{\Sigma}_k^{new} = \frac{1}{N_k} \sum_{n=1}^{N} r_{nk} (\textbf{x}_n - \boldsymbol{\mu}_k)(\textbf{x}_n - \boldsymbol{\mu}_k)^{\top}. \] This represents a responsibility-weighted covariance.

3. Update Mixture Weights \[ \pi_k^{new} = \frac{N_k}{N}. \] This ensures that the mixture weights sum to 1, reflecting the proportion of data explained by each component.

7.2.4 Interpretation

At each update step, three things happen in the algorithm:

It re-estimates model parameters based on the soft assignment of data points (responsibilities).
It moves the Gaussian components toward regions of high data density.
It increases the log-likelihood, converging to a local optimum.

Together, these iterative updates form the Expectation-Maximization (EM) algorithm for GMMs, which alternates between inferring hidden assignments (E-step) and maximizing parameters (M-step) until convergence.

Exercises

Exercise 7.1 If we were to consider a single Gaussian as the desired density, the sum over \(k\) in \[\sum_{n=1}^N \log \sum_{k=1}^K \pi_kN(\mathbf{x}_n|\mathbf{\mu}_k,\mathbf{\Sigma}_k)\] vanishes. So, \(\log p(X|\mathbf{\theta})\) equals what?

Solution

Exercise 7.2 Given \(p(X|\mathbf{\theta}) = \prod_{n=1}^N p(\mathbf{x}_n|\mathbf{\theta})\) and \(p(\mathbf{x}_n|\mathbf{\theta}) = \sum_{k=1}^K \pi_k N(\mathbf{x}_n|\mathbf{\mu}_k, \mathbf{\Sigma}_k)\), derive the log likelihood and the necessary conditions for a local optimum.

Solution

Exercise 7.3 Prove the update of the mean parameters \(\mathbf{\mu}_k\) of the GMM is given by \[\mathbf{\mu}_k^{new} = \dfrac{\sum_{n=1}^N r_{nk}\mathbf{x}_n}{\sum_{n=1}^N r_{nk}},\] where \(r_{nk}\) is the responsibility.

Solution

Exercise 7.4 Prove the update of the covariance parameters \(\pi_k\) of the GMM is given by \[\pi_k^{new} = \dfrac{N_k}{N},\] where \(N\) is the number of data points and \(N_k\) the column sum of the responsibility matrix.

Solution

Exercise 7.5 Prove the update of the mixture weights \(\mathbf{\Sigma}_k\) of the GMM is given by \[\mathbf{\Sigma}_k^{new} = \dfrac{1}{N_k}\sum_{n=1}^N r_{nk}(\mathbf{x}_n-\mathbf{\mu}_k)(\mathbf{x}_n - \mathbf{\mu}_k)^T,\] where \(r_{nk}\) is the responsibility and \(N_k\) the column sum of the responsibility matrix.

Solution