Naive Bayes

"Training" a Naive Bayes Classifier

  • Recall the naive Bayes classifier:

  • To perform classification we need the:

    • Class priors: , Class-conditional attribute probabilities: for all .
  • These were the tallies from our in-class exercise:

    Spy Golfer Fedora Count
    T T T 1
    T T F 3
    T F T 1
    T F F 0
    F T T 4
    F T F 3
    F F T 6
    F F F 2

Relevant Distributions

  • From this we can easily estimate our priors:
  • We can also calculate the (full) class-conditional probability distributions:
Golfer Fedora P(Golfer, Fedora | Spy = True)
T T 1/5 = .2
T F 3/5 = .6
F T 1/5 = .2
F F 0
Golfer Fedora P(Golfer, Fedora | Spy = False)
T T 4/15 ≈ .27
T F 3/15 = .2
F T 6/15 = .4
F F 2/15 ≈ .13

Naive Bayes Distributions

Golfer P(Golfer | Spy = True)
T 4/5 = .8
F 1/5 = .2
Golfer P(Golfer | Spy = False)
T 7/15 ≈ .47
F 8/15 ≈ .53



Fedora
P(Fedora | Spy = True)
T 2/5 = .4
F 3/5 = .6
Fedora
P(Fedora | Spy = False)
T 10/5 ≈ .66
F 5/15 ≈ .33

Performing Classification (Non-Naive):

  • Assume we have a suspect who is a golfer and who wears a fez, are they a spy?
    • We could apply (non-naive) Bayes rule:

Performing Classification (Naive-Bayes):

Properties of Naive-Bayes

  • Pros:
    • Provides a meaningful class probability, not just a class label
    • Works in the face of missing attributes (just don't include them in the calculation)
    • Relatively easy to interpret: we can examine the class-conditional probabilities for individual attributes.
  • Cons:
    • Classification performance may be worse than other classifiers: Most real classification tasks will violate the independence assumption to some extent.

Implementation Issues

  • Naive Bayes classifier:

  • Each is less then 1.
  • What is ? ?
  • Recall that
  • Also, the log function is monotonic: if then
  • So, practical implementations generally work with logs:

Implementation Issues

  • How to handle zeros for some attributes?

    • If for some attribute value , then the entire product becomes 0
    • This means regardless of other evidence
    • Problem: A single zero probability can dominate the classification
  • Solution: Laplace Smoothing (Add-one smoothing)

    • Instead of:
    • Use:
    • Where is the number of possible values for attribute
  • Example: If we never saw "Golfer=True, Spy=True" in training:

    • Without smoothing:
    • With Laplace: