JMU CS 445: Introduction to Probability for ML

Probability Distributions and Random Variables

  • A discrete random variable has some finite number of outcomes drawn from a sample space:
    • E.g. the sample space for might be
    • The sample space for might be
  • A probability distribution or probability mass function maps each event in the sample space to a probability between 0 and 1, where the total of all probabilities must sum to 1:
    • Disease:
      • Or, more concisely:
    • Fever:
      • Or, more concisely:

Joint Probability Distributions

The joint probability distribution represents the probability of two events, described by different random variables, happening together:

X Y P(X,Y)
T cold .04
T flu .14
T healthy .02
F cold .08
F flu .04
F healthy
.68

So .

This could also be written:

Marginalization

X Y P(X,Y)
T cold .04
T flu .14
T healthy .02
F cold .08
F flu .04
F healthy
.68
  • Given the joint probability distribution we can use marginalization to retrieve the probability distribution for any individual variable:
  • For example:
    • The probability that someone has the flu, regardless of fever:

    • What is the probability that someone has a fever regardless of health?

Conditional Probability

  • Definition:
  • For example:
  • It follows that:
    • (the chain rule)

Independence and Conditional Independence

  • Two random variables and are independent if and only if:
  • and are conditionally independent given , if and only if:

Bayes Theorem / Bayes Rule

Note that

(by combining marginalization with the chain rule)

So Bayes rule can be expressed as:

Bayes Classifier

We can use Bayes rule to build a classifier:

Where corresponds to the class label and each is an attribute.

There is a serious problem with this! What is it?

Naive Bayes Classifier

  • We assume that the attributes are conditionally independent given class labels, so:
  • We can also recognize that is the same regardless of class so dividing with it won't change the class with the largest value.
  • These leads to the naive Bayes classifier:

P(x) = .2 P(~x) = .8 P(c | x) = .2 P(f | x) = .7 P(h | x) = .1 P(c | ~x) = .1 P(f | ~x) = .05 P(h | ~x) = .85