Probability Distributions and Random Variables

A discrete random variable has some finite number of outcomes drawn from a sample space:
- E.g. the sample space for $Y$ might be ${c o l d, f l u, h e a l t h y}$
- The sample space for $X$ might be ${T r u e, F a l s e}$
A probability distribution or probability mass function maps each event in the sample space to a probability between 0 and 1, where the total of all probabilities must sum to 1:
- Disease:
  - $P (Y = c o l d) = .7$
  - $P (Y = f l u) = .2$
  - $P (Y = h e a l t h y) = .1$
  - Or, more concisely:
  - $P (c o l d) = .7$
  - $P (f l u) = .2$
  - $P (h e a l t h y) = .1$
- Fever:
  - $P (X = T r u e) = .1$
  - $P (X = F a l s e) = .9$
  - Or, more concisely:
  - $P (x) = .1$
  - $P (\neg x) = .9$

Joint Probability Distributions

The joint probability distribution represents the probability of two events, described by different random variables, happening together:

So $P (X = T r u e, Y = c o l d) = .04$ .

This could also be written: $P (X = T r u e \cap Y = c o l d) = .04$

Given the joint probability distribution we can use marginalization to retrieve the probability distribution for any individual variable:
- $P (Y) = \sum_{x \in X} P (Y, x)$
For example:
- The probability that someone has the flu, regardless of fever:
  - $P (Y = f l u) = P (X = T r u e, Y = f l u) + P (X = F a l s e, Y = f l u)$
  - $P (Y = f l u) = .14 + .04 = .18$
- The probability that someone has a fever regardless of health:
  - $P (X = T r u e) = P (X = T r u e, Y = c o l d) + P (X = T r u e, Y = f l u) + P (X = T r u e, Y = h e a l t h y)$
  - $P (X = T r u e) = .04 + .14 + .02 = .2$

Definition:
- $P (Y ∣ X) = \frac{P (X, Y)}{P (X)}$
For example:
- $P (Y = f l u ∣ X = T r u e) = \frac{P (X = T r u e, Y = f l u)}{P (X = T r u e)} = \frac{.14}{.2} = .7$
It follows that:
- $P (X, Y) = P (Y ∣ X) P (X)$ (the chain rule)
- $P (X, Y) = P (X ∣ Y) P (Y)$

$P (Y ∣ X) = \frac{P (X ∣ Y) P (Y)}{P (X)}$

Note that

$P (X) = \sum_{y \in Y} P (X ∣ y) P (y)$

(by combining marginalization with the chain rule)

So Bayes rule can be expressed as:

$P (Y ∣ X) = \frac{P (X ∣ Y) P (Y)}{\sum_{y \in Y} P (X ∣ y) P (y)}$

We can use Bayes rule to build a classifier:

$P (Y ∣ X_{1}, X_{2}, . . ., X_{d}) = \frac{P (X_{1}, X_{2}, . . ., X_{d} ∣ Y) P (Y)}{P (X_{1}, X_{2}, . . ., X_{d})}$

Where $Y$ corresponds to the class label and each $X_{i}$ is an attribute.

There is a serious problem with this! What is it?

We assume that the attributes are conditionally independent given class labels, so:
- $P (X_{1}, X_{2}, . . ., X_{d} ∣ Y) = \prod_{i = 1}^{d} P (X_{i} ∣ Y)$
We can also recognize that $P (X_{1}, X_{2}, . . ., X_{d})$ is the same regardless of class so dividing with it won’t change the class with the largest value.
These leads to the naive Bayes classifier:
- $P (Y ∣ X_{1}, X_{2}, . . ., X_{d}) \propto \prod_{i = 1}^{d} P (X_{i} ∣ Y) P (Y)$