4.3 Parameter Estimation

In this section, we introduce how probability distributions can model uncertainty in data and parameters. This extends the concepts from empirical risk minimization to a probabilistic framework, allowing us to reason about both the data-generating process and the model parameters.

4.3.1 8.3.1 Maximum Likelihood Estimation (MLE)

Definition 4.10 Maximum Likelihood Estimation (MLE) is a statistical method used to estimate the parameters of a model by finding the values that make the observed data most probable.

Suppose we have data \(\mathbf{x} = (\mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_N)\) drawn from a probability distribution \(p(\mathbf{x} | \mathbf{\theta})\) that depends on some unknown parameter(s) \(\mathbf{\theta}\).
The likelihood function represents how likely the observed data is for different parameter values: \[ L(\mathbf{\theta}) = p(\mathbf{x} | \mathbf{\theta}) = \prod_{n=1}^{N} p(\mathbf{x}_n | \mathbf{\theta}) \] The goal of MLE is to find the parameter value \(\hat{\mathbf{\theta}}\) that maximizes this likelihood: \[ \hat{\mathbf{\theta}}_{\text{MLE}} = \arg\max_{\mathbf{\theta}} L(\mathbf{\theta}) \] Because products of probabilities can become very small, it is common to work with the log-likelihood, which turns products into sums and simplifies computation: \[ \ell(\mathbf{\theta}) = \log L(\mathbf{\theta}) = \sum_{n=1}^{N} \log p(\mathbf{x}_n | \mathbf{\theta}) \] Maximizing the log-likelihood is equivalent to maximizing the likelihood itself, since the logarithm is a monotonic transformation.

In practice, optimization algorithms often minimize rather than maximize functions. Therefore, we typically minimize the negative log-likelihood (NLL): \[ \mathcal{L}(\mathbf{\theta}) = -\ell(\mathbf{\theta}) = - \sum_{n=1}^{N} \log p(\mathbf{x}_n | \mathbf{\theta}) \]

When we maximize the log-likelihood (or minimize the negative log-likelihood), \(\mathbf{\theta}\) varies while the data \(\mathbf{x}\) is fixed. Minimizing \(L(\mathbf{\theta})\) corresponds to maximizing the likelihood — that is, finding the parameters most likely to have produced the observed data.

Example 4.5 If \(p(y_n | \mathbf{x}_n, \mathbf{\theta})\) is Gaussian, MLE corresponds to minimizing the sum of squared residuals — the classic linear regression case.

Although MLE can yield closed-form solutions in simple cases, it may suffer from overfitting and may require numerical optimization when closed forms are unavailable.

4.3.2 Maximum A Posteriori (MAP) Estimation

When prior knowledge about parameters is available, we can combine it with the likelihood using Bayes’ theorem: \[ p(\mathbf{\theta} | \mathbf{x}) = \frac{p(\mathbf{x} | \mathbf{\theta}) p(\mathbf{\theta})}{p(\mathbf{x})}. \] Since \(p(\mathbf{x})\) does not depend on \(\mathbf{\theta}\), maximizing the posterior is equivalent to maximizing \(p(\mathbf{x} | \mathbf{\theta})p(\mathbf{\theta})\). This leads to Maximum A Posteriori (MAP) Estimation, where we minimize the negative log-posterior: \[ L_{\text{MAP}}(\mathbf{\theta}) = -\log p(\mathbf{x} | \mathbf{\theta}) - \log p(\mathbf{\theta}) \]

MAP estimation adds a regularizing effect, since the prior \(p(\mathbf{\theta})\) discourages implausible parameter values.

Example 4.6 With a Gaussian likelihood and a Gaussian prior on parameters (e.g., zero-mean prior), the MAP estimate resembles ridge regression — balancing data fit and parameter simplicity.

4.3.3 Model Fitting

Model fitting involves optimizing parameters \(\mathbf{\theta}\) to minimize a loss (e.g., the negative log-likelihood). The model class \(\mathcal{M}_\mathbf{\theta}\) defines the family of possible predictors, and fitting finds the instance within this class that best approximates the true data-generating process \(\mathcal{M}^*\).

There are three main fitting outcomes:

Overfitting:
- The model class is too flexible.
- Captures noise as if it were signal.
- Low training error but high test error.
Underfitting:
- The model class is too simple.
- Fails to capture the true data structure.
- High error on both training and test data.
Good Fit:
- The model class is appropriately complex.
- Balances bias and variance.
- Exhibits good generalization.

To mitigate overfitting, we can apply:

Regularization (Section 8.2.3), or
Priors (Section 8.3.2).

In practice, large model classes such as deep neural networks rely on these techniques to control generalization and improve performance.

Exercises

Exercise 4.6 Suppose that you would like to estimate the portion of voters in your town that plan to vote for Party A in an upcoming election. To do so, you take a random sample of size \(n\) from the likely voters in the town. Since you have a limited amount of time and resources, your sample is relatively small. Specifically, suppose that \(n=20\). After doing your sampling, you find out that 6 people in your sample say they will vote for Party A. In the previous election, 40% of voters voted for party A. Provide both a frequentist and Bayesian approach to dealing with this problem.

Solution