4.4 Probabilistic Modeling and Inference
In machine learning, we use probabilistic models to represent uncertainty in data and parameters. These models describe how observed data are generated from underlying parameters, allowing us to reason about prediction, inference, and decision-making under uncertainty.
4.4.1 Probabilistic Models
Definition 4.11 A probabilistic model is defined by the joint distribution of all random variables: \[ p(\mathbf{x}, \mathbf{\theta}) \] where
- \(\mathbf{x}\) represents the observed data, and
- \(\mathbf{\theta}\) represents the model parameters.
This joint distribution encapsulates:
- The likelihood \(p(\mathbf{x} | \mathbf{\theta})\)
- The prior \(p(\mathbf{\theta})\)
- The posterior \(p(\mathbf{\theta} | \mathbf{x})\)
- The marginal likelihood \(p(\mathbf{x}) = \int p(\mathbf{x} | \mathbf{\theta})p(\mathbf{\theta}) d\mathbf{\theta}\)
The probabilistic framework provides a consistent way to model, infer, and predict using probability theory.
4.4.2 Bayesian Inference
Bayesian inference is concerned with computing the posterior distribution of parameters given data: \[ p(\mathbf{\theta} | \mathcal{X}) = \frac{p(\mathcal{X} | \mathbf{\theta}) p(\mathbf{\theta})}{p(\mathcal{X})} \] where \[ p(\mathcal{X}) = \int p(\mathcal{X} | \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta} \] acts as a normalization constant (marginal likelihood). Bayesian inference inverts the relationship between parameters and data. Instead of finding a single “best” parameter estimate (as in MLE or MAP), it computes a distribution over parameters, capturing full uncertainty.
Predictions are then made by marginalizing over parameters: \[ p(\mathbf{x}) = \int p(\mathbf{x} | \mathbf{\theta})p(\mathbf{\theta})d\mathbf{\theta} = \mathbb{E}_{\mathbf{\theta}}[p(\mathbf{x} | \mathbf{\theta})] \]
4.4.2.1 Comparison with Parameter Estimation:
| Approach | Output | Main Computation | Example Methods |
|---|---|---|---|
| MLE / MAP | Point estimate \(\mathbf{\theta}^*\) | Optimization | Gradient Descent, Least Squares |
| Bayesian Inference | Distribution \(p(\mathbf{\theta} | \mathcal{X})\) | Integration | MCMC, Laplace, Variational Inference |
Bayesian methods allow:
- Incorporation of prior knowledge
- Uncertainty propagation in predictions
- Better handling of small datasets or noisy data
However, Bayesian inference often requires approximations since integrals for our posterior distribution and our parameters are rarely analytic. Common approximation methods include:
- Stochastic methods: Markov Chain Monte Carlo (MCMC)
- Deterministic methods: Laplace approximation, Variational Inference, Expectation Propagation
4.4.3 Latent-Variable Models
Definition 4.12 A latent-variable model introduces hidden variables \(z\) that help explain the data.
These variables are not directly observed but simplify or enrich the model’s structure. The generative process is: \[ p(\mathbf{x} | \mathbf{z}, \mathbf{\theta}) \] with a prior on latent variables: \[ p(\mathbf{z}). \] To obtain the likelihood of observed data, we integrate out the latent variables: \[ p(\mathbf{x} | \mathbf{\theta}) = \int p(\mathbf{x} | \mathbf{z}, \mathbf{\theta})p(\mathbf{z})d\mathbf{z}. \] Once the likelihood is known, we can:
- Perform maximum likelihood or MAP estimation for parameters
- Conduct Bayesian inference to obtain posterior distributions
The posterior over parameters is: \[ p(\mathbf{\theta} | \mathcal{X}) = \frac{p(\mathcal{X} | \mathbf{\theta})p(\mathbf{\theta})}{p(\mathcal{X})} \] and the posterior over latent variables is: \[ p(\mathbf{z} | \mathcal{X}, \mathbf{\theta}) = \frac{p(\mathcal{X} | \mathbf{z}, \mathbf{\theta})p(\mathbf{z})}{p(\mathcal{X} | \mathbf{\theta})}. \]
4.4.4 Examples of Latent-Variable Models
- Principal Component Analysis (PCA) – dimensionality reduction
- Gaussian Mixture Models (GMMs) – density estimation
- Hidden Markov Models (HMMs) – time-series analysis
- Dynamical Systems – modeling temporal dependencies
- Meta-Learning / Task Generalization – learning across tasks
Although introducing latent variables can make models more interpretable and flexible, inference becomes more challenging because marginalization is often intractable and requires approximation methods.