6.3 Sum Rule, Product Rule, and Bayes’ Theorem

Probability theory can be viewed as an extension of logic that allows reasoning under uncertainty. All of probability theory can be built from two fundamental rules — the Sum Rule and the Product Rule — both of which arise naturally from the desiderata of plausible reasoning (Jaynes, 2003).

6.3.1 The Sum Rule (Marginalization Property)

The Sum Rule relates a joint distribution to its marginal distribution by summing or integrating over unobserved variables.

Theorem 6.2 Sum Rule: If \(x\) and \(y\) are random variables with joint probability \(p(x, y)\), then: \[ p(x) = \begin{cases} \sum\limits_{y \in Y} p(x, y), & \text{if } y \text{ is discrete} \\ \int_Y p(x, y) \, dy, & \text{if } y \text{ is continuous} \end{cases} \]

\(p(x, y)\): joint probability of \(x\) and \(y\)
\(p(x)\): marginal probability of \(x\)
\(Y\): set of all possible values of \(y\)

Example 6.12 Discrete Joint Distribution

Let \(X\) and \(Y\) be discrete random variables with joint probability mass function given by:

\(Y \backslash X\)	1	2	3
0	0.10	0.15	0.05
1	0.20	0.30	0.20

Use the sum rule to find the marginal probability \(P(X = 2)\).

Solution For discrete random variables, the sum rule states: \[ P(X=x) = \sum_{y} P(X=x, Y=y) \]

Thus, \[ P(X=2) = P(2,0) + P(2,1) = 0.15 + 0.30 = 0.45 \]

Example 6.13 Continuous Joint Distribution

Let \(X\) and \(Y\) be continuous random variables with joint probability density function: \[ f_{X,Y}(x,y) = \begin{cases} \frac{1}{6}, & 0 \le x \le 3,\; 0 \le y \le 2 \\ 0, & \text{otherwise} \end{cases} \]

Use the sum rule (integral form) to find the marginal probability
\[ P(1 \le X \le 2) \]

Solution For continuous random variables, the sum rule is: \[ P(X \in A) = \int_A \int_{-\infty}^{\infty} f_{X,Y}(x,y)\,dy\,dx \]

So, \[ P(1 \le X \le 2) = \int_1^2 \int_0^2 \frac{1}{6}\,dy\,dx \]

Evaluate the integrals: \[ = \int_1^2 \frac{1}{6}(2)\,dx = \int_1^2 \frac{1}{3}\,dx = \frac{1}{3} \]

For multiple variables \(x = [x_1, \dots, x_D]^\top\), the marginal distribution of one component \(x_i\) is obtained by integrating/summing over all other variables: \[ p(x_i) = \int p(x_1, \dots, x_D) \, dx_{\backslash i} \]

Marginalization often involves high-dimensional integrals or sums, which are computationally expensive to evaluate exactly.

6.3.2 The Product Rule (Factorization Property)

The Product Rule expresses a joint distribution in terms of a marginal and a conditional distribution:

Theorem 6.3 If \(x\) and \(y\) are random variables, then: \[ p(x, y) = p(y \mid x)\,p(x) \] Equivalently, \[ p(x, y) = p(x \mid y)\,p(y) \] where:

\(p(x, y)\) — joint probability of \(x\) and \(y\)
\(p(y \mid x)\) — conditional probability of \(y\) given \(x\)
\(p(x)\) — marginal (or prior) probability of \(x\)

The product rule states that the joint probability can always be factorized into two parts:

The marginal probability \(p(x)\)
The conditional probability \(p(y \mid x)\)

Example 6.14 Let \(X\) and \(Y\) be discrete random variables defined as follows:

\(X\): the type of coin selected
- \(x_1\): Fair coin
- \(x_2\): Biased coin
\(Y\): outcome of a single coin toss
- \(y_1\): Heads
- \(y_2\): Tails

Suppose: \[ P(X = x_1) = 0.6, \quad P(X = x_2) = 0.4 \] Assume the following conditional probabilities:

If the coin is fair: \[ P(y_1 \mid x_1) = 0.5, \quad P(y_2 \mid x_1) = 0.5 \]
If the coin is biased: \[ P(y_1 \mid x_2) = 0.8, \quad P(y_2 \mid x_2) = 0.2 \]

Using the formula \[ P(x,y) = P(y \mid x)\,P(x), \] we compute each joint probability.

\(P(x_1, y_1)\): \[ P(y_1 \mid x_1)P(x_1) = (0.5)(0.6) = 0.30 \]
\(P(x_1, y_2)\): \[ (0.5)(0.6) = 0.30 \]
\(P(x_2, y_1)\): \[ (0.8)(0.4) = 0.32 \]
\(P(x_2, y_2)\): \[ (0.2)(0.4) = 0.08 \]

Thus, we have the following results:

\(Y \backslash X\)	\(x_1\)	\(x_2\)
\(y_1\) (Heads)	0.30	0.32
\(y_2\) (Tails)	0.30	0.08

Because the ordering of variables is arbitrary, the rule is symmetric: \[ p(x, y) = p(x \mid y) \, p(y) = p(y \mid x) \, p(x) \]

6.3.3 Bayes’ Theorem (Probabilistic Inversion)

By combining the Sum and Product Rules, we obtain Bayes’ Theorem:

Theorem 6.4 Let \(p(x)\) be the our initial belief about \(x\) before seeing the data (the prior), \(p(y \mid x)\) the likelihood of data \(y\) given \(x\) (the likelihood) and \(p(y)\) the probability of event \(y\) (the evidence or marginal likelihood). Then, the posterior, \(p(x \mid y)\), (the updated belief about \(x\) after observing \(y\)) is given by:
\[ p(x \mid y) = \frac{p(y \mid x) \, p(x)}{p(y)} \]

Bayes’ Theorem allows us to invert the relationship between \(x\) and \(y\), making it a cornerstone of Bayesian inference.

Example 6.15 Suppose a factory produces light bulbs, and 5% of the bulbs are defective. A quality-control test is used with the following accuracy:

If a bulb is defective, the test correctly identifies it as defective 90% of the time.
If a bulb is not defective, the test incorrectly labels it as defective 8% of the time.

Let

\(D\) = the bulb is defective
\(T\) = the test indicates the bulb is defective

Then, we see that \[ P(D) = 0.05, \quad P(D^c) = 0.95, \quad P(T \mid D) = 0.90, \quad P(T \mid D^c) = 0.08. \]

Using the law of total probability, we can determine the probability that the light bulb is defective. We do so by using the sum rule and summing the situations where the test is defective (ie, the test says the bulb is defective and the bulb is defective, and when the test says the bulb is defective but the bulb is actually not defective). \[ P(T) = P(T \mid D)P(D) + P(T \mid D^c)P(D^c) \]

\[ P(T) = (0.90)(0.05) + (0.08)(0.95) = 0.045 + 0.076 = 0.121 \]

To find the probability of the bulb being defective given that the test says it is defective, \[ P(D \mid T) = \frac{P(T \mid D)P(D)}{P(T)} \]

\[ P(D \mid T) = \frac{(0.90)(0.05)}{0.121} \approx 0.372 \] Therefore, even though the test is fairly accurate, only about 37.2% of bulbs that test defective are actually defective.

The posterior combines all available information from both the prior and the observed data. In many applications, such as machine learning, reinforcement learning, and Bayesian statistics, the posterior is the key object of interest. However, in practice, it is often difficult to compute \(p(y)\) exactly due to the integral over all possible \(x\).

Example 6.16 Consider a spam detection model that classifies emails as spam or not spam. Of all email traffic, 80% are not spam and 20% are spam. The word free appears in 65% of spam emails while it appears in 10% of non-spam emails. What is the probability that an email is spam given that it includes the word “free”?

The events are:

Let \(S\) = the email is spam
Let \(N\) = the email is not spam
Let \(W\) = the email contains the word “free”

From historical email data: \[ P(S) = 0.20, \quad P(N) = 0.80 \] \[ P(W \mid S) = 0.65 \] \[ P(W \mid N) = 0.10 \]

Using the law of total probability: \[ P(W) = P(W \mid S)P(S) + P(W \mid N)P(N) \]

\[ P(W) = (0.65)(0.20) + (0.10)(0.80) = 0.13 + 0.08 = 0.21 \]

We now compute the probability that an email is spam given that it contains the word “free”: \[ P(S \mid W) = \frac{P(W \mid S)P(S)}{P(W)} \] \[ P(S \mid W) = \frac{(0.65)(0.20)}{0.21} \approx 0.619 \] So, about 62% of all emails that contain the word free are spam emails.

Exercises

Exercise 6.10

A doctor is called to see a sick child. The doctor has prior information that 90% of sick children in that neighborhood have the flu, while the other 10% are sick with measles. Let \(F\) stand for an event of a child being sick with flu and \(M\) stand for an event of a child being sick with measles. Assume for simplicity that \(F \cup M = \Omega\), i.e., that there are no other maladies in that neighborhood. A well-known symptom of measles is a rash (the event of having which we denote \(R\)). Assume that the probability of having a rash if one has measles is \(P(R |M)=0.95\). However, occasionally children with flu also develop rash, and the probability of having a rash if one has flu is \(P(R|F)=0.08\). Upon examining the child, the doctor finds a rash. What is the probability that the child has measles?

Solution

Exercise 6.11

In a study, physicians were asked what the odds of breast cancer would be in a woman who was initially thought to have a 1% risk of cancer but who ended up with a positive mammogram result (a mammogram accurately classifies about 80% of cancerous tumors and 90% of benign tumors.) 95 out of a hundred physicians estimated the probability of cancer to be about 75%. Do you agree?

Solution

Exercise 6.12

Suppose we have 3 cards identical in form except that both sides of the first card are colored red, both sides of the second card are colored black, and one side of the third card is colored red and the other side is colored black. The 3 cards are mixed up in a hat, and 1 card is randomly selected and put down on the ground. If the upper side of the chosen card is colored red, what is the probability that the other side is colored black?

Solution

Exercise 6.13

Dangerous fires are rare (1% of the time). Smoke is not rare because of BBQ’s (10% of the time). We know that 90% of dangerous fires make smoke. What is the probability of dangerous fires when we see smoke?

Solution

Exercise 6.14

Suppose an HIV test is 99% accurate (in both directions) and 0.3% of the population is HIV positive. If someone tests positive for HIV, what is the probability that they are actually positive?

Solution

Exercise 6.15

We know that 40% of all spam emails have more exclamation marks than periods. Only 2% of non-spam emails have more exclamation marks than periods. We also know that about 35% of all emails are spam. We just received an email that has more exclamation marks than periods. What is the probability that it is spam?

Solution

Exercise 6.16

State and prove Bayes’ theorem.

Solution