Introduction to Probability for ML
Probability Distributions and Random Variables
- A discrete random variable has some finite number of outcomes drawn from a sample space:
- E.g. the sample space for might be
- The sample space for might be
- A probability distribution or probability mass function maps each event in the sample space to a probability between 0 and 1, where the total of all probabilities must sum to 1:
Joint Probability Distributions
The joint probability distribution represents the probability of two events, described by different random variables, happening together:
|
|
|
T
|
cold
|
.04
|
T
|
flu
|
.14
|
T
|
healthy
|
.02
|
F
|
cold
|
.08
|
F
|
flu
|
.04
|
F
|
healthy
|
.68
|
So .
This could also be written:
Marginalization
|
|
|
T
|
cold
|
.04
|
T
|
flu
|
.14
|
T
|
healthy
|
.02
|
F
|
cold
|
.08
|
F
|
flu
|
.04
|
F
|
healthy
|
.68
|
- Given the joint probability distribution we can use marginalization to retrieve the probability distribution for any individual variable:
- For example:
- The probability that someone has the flu, regardless of fever:
- The probability that someone has a fever regardless of health:
Conditional Probability
- Definition:
- For example:
- It follows that:
Bayes Theorem / Bayes Rule
Note that
(by combining marginalization with the chain rule)
So Bayes rule can be expressed as:
Bayes Classifier
We can use Bayes rule to build a classifier:
Where corresponds to the class label and each is an attribute.
There is a serious problem with this! What is it?
Naive Bayes Classifier
- We assume that the attributes are conditionally independent given class labels, so:
- We can also recognize that is the same regardless of class so dividing with it won’t change the class with the largest value.
- These leads to the naive Bayes classifier:
Space, Right Arrow or swipe left to move to next slide, click help below for more details