4.4 Probabilistic Modeling and Inference

In machine learning, we use probabilistic models to represent uncertainty in data and parameters. These models describe how observed data are generated from underlying parameters, allowing us to reason about prediction, inference, and decision-making under uncertainty.

4.4.1 Probabilistic Models

Definition 4.11 A probabilistic model is defined by the joint distribution of all random variables: \[ p(\mathbf{x}, \mathbf{\theta}) \] where

\(\mathbf{x}\) represents the observed data, and
\(\mathbf{\theta}\) represents the model parameters.

This joint distribution encapsulates:

The likelihood \(p(\mathbf{x} | \mathbf{\theta})\)
The prior \(p(\mathbf{\theta})\)
The posterior \(p(\mathbf{\theta} | \mathbf{x})\)
The marginal likelihood \(p(\mathbf{x}) = \int p(\mathbf{x} | \mathbf{\theta})p(\mathbf{\theta}) d\mathbf{\theta}\)

The probabilistic framework provides a consistent way to model, infer, and predict using probability theory.

4.4.2 Bayesian Inference

Bayesian inference is concerned with computing the posterior distribution of parameters given data: \[ p(\mathbf{\theta} | \mathcal{X}) = \frac{p(\mathcal{X} | \mathbf{\theta}) p(\mathbf{\theta})}{p(\mathcal{X})} \] where \[ p(\mathcal{X}) = \int p(\mathcal{X} | \mathbf{\theta}) p(\mathbf{\theta}) d\mathbf{\theta} \] acts as a normalization constant (marginal likelihood). Bayesian inference inverts the relationship between parameters and data. Instead of finding a single “best” parameter estimate (as in MLE or MAP), it computes a distribution over parameters, capturing full uncertainty.

Predictions are then made by marginalizing over parameters: \[ p(\mathbf{x}) = \int p(\mathbf{x} | \mathbf{\theta})p(\mathbf{\theta})d\mathbf{\theta} = \mathbb{E}_{\mathbf{\theta}}[p(\mathbf{x} | \mathbf{\theta})] \]

4.4.2.1 Comparison with Parameter Estimation:

Approach	Output	Main Computation	Example Methods
MLE / MAP	Point estimate \(\mathbf{\theta}^*\)	Optimization	Gradient Descent, Least Squares
Bayesian Inference	Distribution \(p(\mathbf{\theta} \| \mathcal{X})\)	Integration	MCMC, Laplace, Variational Inference

Bayesian methods allow:

Incorporation of prior knowledge
Uncertainty propagation in predictions
Better handling of small datasets or noisy data

However, Bayesian inference often requires approximations since integrals for our posterior distribution and our parameters are rarely analytic. Common approximation methods include:

Stochastic methods: Markov Chain Monte Carlo (MCMC)
Deterministic methods: Laplace approximation, Variational Inference, Expectation Propagation

4.4.3 Latent-Variable Models

Definition 4.12 A latent-variable model introduces hidden variables \(z\) that help explain the data.

These variables are not directly observed but simplify or enrich the model’s structure. The generative process is: \[ p(\mathbf{x} | \mathbf{z}, \mathbf{\theta}) \] with a prior on latent variables: \[ p(\mathbf{z}). \] To obtain the likelihood of observed data, we integrate out the latent variables: \[ p(\mathbf{x} | \mathbf{\theta}) = \int p(\mathbf{x} | \mathbf{z}, \mathbf{\theta})p(\mathbf{z})d\mathbf{z}. \] Once the likelihood is known, we can:

Perform maximum likelihood or MAP estimation for parameters
Conduct Bayesian inference to obtain posterior distributions

The posterior over parameters is: \[ p(\mathbf{\theta} | \mathcal{X}) = \frac{p(\mathcal{X} | \mathbf{\theta})p(\mathbf{\theta})}{p(\mathcal{X})} \] and the posterior over latent variables is: \[ p(\mathbf{z} | \mathcal{X}, \mathbf{\theta}) = \frac{p(\mathcal{X} | \mathbf{z}, \mathbf{\theta})p(\mathbf{z})}{p(\mathcal{X} | \mathbf{\theta})}. \]

4.4.4 Examples of Latent-Variable Models

Principal Component Analysis (PCA) – dimensionality reduction
Gaussian Mixture Models (GMMs) – density estimation
Hidden Markov Models (HMMs) – time-series analysis
Dynamical Systems – modeling temporal dependencies
Meta-Learning / Task Generalization – learning across tasks

Although introducing latent variables can make models more interpretable and flexible, inference becomes more challenging because marginalization is often intractable and requires approximation methods.

Exercises

Put some exercises here.