6.1 Construction of a Probability Space

The goal of probability theory is to define a mathematical framework for describing random outcomes of experiments. For example, while we cannot predict the outcome of a single coin toss, repeating the experiment many times reveals regular patterns in the long run. This structure allows us to perform automated reasoning (computer driven), generalizing Boolean logical reasoning into a continuous framework for uncertainty.


6.1.1 Philosophical Issues

Classical Boolean logic cannot express plausible reasoning — the type of reasoning we use in uncertain, real-world situations. Probability theory extends logic to handle degrees of plausibility rather than discrete true/false statements.

“For plausible reasoning it is necessary to extend the discrete true and false values of truth to continuous plausibilities.” — E. T. Jaynes (2003)

Example 6.2 If your friend is late, you may rule out the hypothesis she is on time and find she is delayed by traffic more plausible, though not logically required.

This everyday reasoning can be formalized through probability theory.

Definition 6.4 E. T. Jaynes identified three key criteria for plausibility that form the foundation of probability:

  1. Plausibility as real numbers.
    • Degrees of plausibility are represented by real numbers.
    • This allows for continuous gradations of belief between complete disbelief (false) and certainty (true).
  2. Consistency with common sense.
    • The rules governing these plausibility values must be consistent with the rules of common sense reasoning.
    • For example, if evidence makes an event more likely, the numerical plausibility should increase accordingly.
  3. Consistency in reasoning, meaning:
    • (a) Non-contradiction: If the same conclusion can be reached by different arguments, it must have the same plausibility.
    • (b) Honesty: All available and relevant information must be considered.
    • (c) Reproducibility: Identical states of knowledge must lead to identical plausibility assignments.

The Cox–Jaynes theorem shows that any system satisfying these principles must follow the rules of probability.

Theorem 6.1 Cox-Jaynes: Under the Jaynes assumptions, the rules governing plausibilities are isomorphic to the rules of probability theory.

That is, for plausability measure \(p\) and probability measure \(P\), there exists a monotonic transformation \(f\) such that for any propositions \(A\) and \(B\): \[P(A|B) = f(p(A|B)),\] where \(P(A|B)\) satisfies the standard sum and product rules of probability: \[ P(A \land B|C) = P(A|C) \, P(B|A, C) \]

\[ P(A \lor B|C) = P(A|C) + P(B|C) - P(A \land B|C). \]


6.1.2 Bayesian vs. Frequentist Interpretations

There are two different interpretations of probability that we need to consider when dealing with machine learning models: Bayesian and Frequentist.

Interpretation Description Example Usage
Bayesian Probability represents degree of belief or subjective uncertainty about an event. Updating belief after observing data.
Frequentist Probability represents the long-run relative frequency of an event over repeated trials. Estimating event probability as data size \(\rightarrow \infty\).

Both perspectives appear in machine learning, depending on whether the focus is on modeling uncertainty (Bayesian) or empirical frequency (Frequentist).

Example 6.3 We have a coin that might be biased. We toss it 10 times and observe 7 heads. We want to know: What is the probability that the coin is biased toward heads?

Frequentist Perspective: The frequentist treats the probability as an objective long-run frequency.

  • The coin has a fixed, but unknown probability \(p\) of landing heads.
  • The data (7 heads out of 10 tosses) are used to estimate this parameter.

\[ \hat{p} = \frac{7}{10} = 0.7 \] A 95% confidence interval for \(p\) (using a binomial model) might be approximately: \[ p \in [0.35, 0.93] \] Therefore, if we were to repeat the entire experiment many times, then 95% of the confidence intervals constructed this way would contain the true \(p\). The parameter \(p\) is fixed — only the data vary.

Bayesian Perspective: The Bayesian treats probability as a degree of belief about \(p\). The parameter \(p\) itself is a random variable with a prior distribution. The data are used to update beliefs via Bayes’ theorem.

Assume a uniform prior: \[ p \sim \text{Beta}(1, 1) \] After observing 7 heads and 3 tails: \[ p | \text{data} \sim \text{Beta}(8, 4) \] then, \[ E[p|\text{data}] = \frac{8}{8 + 4} = 0.67 \] A 95% credible interval is approximately: \[ p \in [0.42, 0.88] \] Therefore, given the data and prior beliefs, there is a 95% probability that \(p\) lies between 0.42 and 0.88. Here, the data are fixed and the parameter is uncertain.

Below are some of the differences between the frequentist and Bayesian perspective.

Aspect Frequentist Bayesian
Definition of Probability Long-run frequency of events Degree of belief about parameters
Parameter \(p\) Fixed but unknown Random variable
Data Random Fixed (once observed)
Interval Meaning 95% of intervals contain true \(p\) in repeated experiments 95% probability that \(p\) lies in the given range
Uses Prior Information No Yes

6.1.3 Probability and Random Variables

There are three key ideas in probability theory:

  1. Probability Space – the fundamental mathematical setup.
  2. Random Variables – functions mapping outcomes to measurable quantities.
  3. Distributions (Laws) – describe how probabilities are assigned to these quantities.

Definition 6.5 A probability space is defined by three components:

  • Sample Space (\(\Omega\)):
    The set of all possible outcomes of an experiment.
    Example: two coin tosses \(\longrightarrow \Omega = \{hh, ht, th, tt\}\).

  • Event Space (\(\mathcal{A}\)):
    The collection of subsets of Ω that can occur as events. Often, for discrete spaces, \(\mathcal{A}\) is the power set of \(\Omega\).

  • Probability Measure (\(P\)):
    A function assigning probabilities to events:

    • \(0 \leq P(A) \leq 1\) for all \(A \in \mathcal{A}\)
    • \(P(\Omega) = 1\)

Definition 6.6 A random variable is a function \(X: \Omega \rightarrow T\) that maps outcomes \(\omega \in \Omega\) to values \(x \in T\), the target space.

Example 6.4 Toss two coins and let \(X\) = number of heads. Then \(T = \{0, 1, 2\}\), and: \[ X(hh) = 2, \quad X(ht) = 1, \quad X(th) = 1, \quad X(tt) = 0. \] Here, we see that our function \(X\) maps from our set of outcomes \(\Omega = \{hh, ht, th, tt\}\) into our target space \(T = \{0,1,2\}\).

The probability of \(X\) taking certain values can be expressed as: \[ P_X(S) = P(X \in S) = P(\{\omega \in \Omega : X(\omega) \in S\}) \]

This defines the distribution (law) of \(X\), written \(P_X = P \circ X^{-1}\).

  • If \(T\) is finite or countable, \(X\) is a discrete random variable.
  • If \(T = \mathbb{R}^D\), \(X\) is a continuous random variable.

Example 6.5 Suppose we draw two coins (with replacement) from a bag containing U.S. coins ($) with probability 0.3 and U.K. coins (£) with probability 0.7.

Sample space:
Ω = {($,$), ($,£), (£,$), (£,£)}

Random variable:
\(X\) = number of U.S. coins drawn \(\longrightarrow T = \{0,1,2\}\)

Probability mass function: \[ \begin{aligned} P(X=2) &= 0.3 \times 0.3 = 0.09 \\ P(X=1) &= 2 \times 0.3 \times 0.7 = 0.42 \\ P(X=0) &= 0.7 \times 0.7 = 0.49 \end{aligned} \] Thus, \(X\) defines a discrete probability distribution over \(T\).


6.1.4 Statistics

Probability and statistics are related but distinct disciplines:

Field Main Focus
Probability Starts with a model and derives what outcomes we expect.
Statistics Starts with observed data and infers the underlying model.

In machine learning, we use both:

  • Probability helps model uncertainty and future generalization.
  • Statistics helps infer model parameters from data.

Probability theory provides the foundation for analyzing generalization error and for developing data-driven models that learn from uncertainty.


Exercises

Exercise 6.1 A hospital researcher is interested in the number of times the average post-op patient will ring the nurse during a 12-hour shift. For a random sample of 50 patients, the following information was obtained. Let \(X =\) the number of times a patient rings the nurse during a 12-hour shift.

\(X\) \(P(X)\)
0 \(\frac{4}{50}\)
1 \(\frac{8}{50}\)
2 \(\frac{16}{50}\)
3 \(\frac{14}{50}\)
4 \(\frac{6}{50}\)
5 \(\frac{2}{50}\)

What is \(T\)? Do these data follow the rules of probability? What is \(P(X\geq 3)\) (use notation)?

Exercise 6.2 Suppose Nancy has classes three days a week. She attends classes three days a week 80% of the time, two days 15% of the time, one day 4% of the time, and no days 1% of the time. Suppose one week is randomly selected. What is \(T\)? Do these data follow the rules of probability? Suppose we select a week at random. What is the probability associated with each value in \(T\)? What is the probability that Nancy attends at least 2 classes in the week (use notation)?