JMU CS 445: Introduction to Probability for ML

Probability Theory - Basic Terminology

  • Sample Space (): The set of all possible outcomes

    • Example: When rolling a die,
  • Outcome: A single result

    • Example: Rolling
  • Event: A subset of the sample space (a collection of one or more outcomes)

    • Example: "Rolling an even-numbered face"
    • Example: "Rolling a face numbered greater than "
    • Example: "Rolling face 3"

Probability

Probability is a numerical measure of the likelihood that an event will occur.

A probability function (or probability measure) is a function that assigns a number between 0 and 1 to each event in the sample space, where:

  • indicates the event is impossible
  • indicates the event is certain
  • Values between and indicate varying degrees of likelihood

Interpretation:

  • means event has a 20% chance of occurring
  • means event has an 80% chance of occurring

Examples with a fair six-sided die:

  • (technically, we would need to add 7 to the sample space for this to make sense)
  • (16.7% chance)
  • (certain)

Laws of Probability

  1. Non-negativity: For any event ,

    • Example:
  2. Total Probability: The probability of the sample space is 1,

    • Example:
  3. Additivity: For mutually exclusive events and ,

    • Example:

Random Variables

Definition: A random variable is a function that assigns a value (numeric or categorical) to each outcome in the sample space .

Examples:

  • Die roll - Parity: If , then random variable could map:

  • Die roll - Numerical: Same sample space, but random variable maps to numbers:

    • This gives us a numerical random variable with range

Range and Events:

  • The range of a random variable is the set of possible values it can take:
    • Parity RV : range is (categorical)
    • Numerical RV : range is (numerical)
  • For any value , the event is the subset of where the RV takes value :
    • is the event outcomes where die shows odd number
    • is the event outcome where die shows the face with 4 dots

Probability Mass Function (PMF)

A Probability Mass Function (PMF) assigns probabilities to the values of a discrete random variable.

Definition: For a discrete random variable , the PMF is:

Requirements: A valid PMF must satisfy:

  1. Non-negativity: for all
  2. Normalization: (probabilities sum to 1)

Example: For our die parity random variable :

Verification: ✓ and both probabilities ✓

Disease and Fever Example

Consider a medical scenario where we observe a single patient:

Experiment: Examine one patient and record their disease status and fever status

Sample Space:

Disease Random Variable ():

  • is shorthand for the event

Fever Random Variable ():

  • is shorthand for the event

Joint Probability Distributions

The joint probability distribution represents the probability of two events, described by different random variables, happening together:

X Y P(X,Y)
T cold .04
T flu .14
T healthy .02
F cold .08
F flu .04
F healthy
.68

So .

This could also be written:

Notice that the rows sum to 1

"Learning" a Probability Distributions

We can learn a joint probability distribution from data by counting occurrences and computing relative frequencies.

Example: Suppose we examine 10 patients and observe:

Data

Patient Disease Fever
1coldyes
2healthyno
3fluyes
4healthyno
5coldno
6fluyes
7healthyyes
8fluno
9coldyes
10healthyno

Count each combination:

  • (cold, yes fever): 2 patients
  • (cold, no fever): 1 patient
  • (flu, yes fever): 2 patients
  • (flu, no fever): 1 patient
  • (healthy, yes fever): 1 patient
  • (healthy, no fever): 3 patients

Compute probabilities:

Estimated Joint Distribution

Fever Disease P(X,Y)
Tcold0.2
Tflu0.2
Thealthy0.1
Fcold0.1
Fflu0.1
Fhealthy0.3

Notation: Events vs. Random Variables

As we move from specific examples to general probability identities, we shift our notation:

  • Event-based notation:

    • refers to the probability of a specific event — the random variables and taking on values and .
  • Variable-based notation:

    • refers to the joint distribution of the random variables and — a function that assigns probabilities to all combinations of values.

This shift allows us to write general identities like:

Marginalization

X Y P(X,Y)
T cold .04
T flu .14
T healthy .02
F cold .08
F flu .04
F healthy
.68
  • Given the joint probability distribution we can use marginalization to retrieve the probability distribution for any individual variable:
  • For example:
    • The probability that someone has the flu, regardless of fever:

    • What is the probability that someone has a fever regardless of health?

Conditional Probability

  • Definition:
  • For example:
  • It follows that:
    • (the chain rule)

Independence and Conditional Independence

  • Two random variables and are independent if and only if:
  • and are conditionally independent given , if and only if:

Bayes Theorem / Bayes Rule

Note that

(by combining marginalization with the chain rule)

So Bayes rule can be expressed as:

Bayes Classifier

We can use Bayes rule to build a classifier:

Where corresponds to the class label and each is an attribute.

There is a serious problem with this! What is it?

Naive Bayes Classifier

  • We assume that the attributes are conditionally independent given class labels, so:
  • We can also recognize that is the same regardless of class so dividing with it won't change the class with the largest value.
  • These leads to the naive Bayes classifier:

P(x) = .2 P(~x) = .8 P(c | x) = .2 P(f | x) = .7 P(h | x) = .1 P(c | ~x) = .1 P(f | ~x) = .05 P(h | ~x) = .85