“Training” a Naive Bayes Classifier

Recall the naive Bayes classifier:
- $P (Y ∣ X_{1}, X_{2}, . . ., X_{d}) \propto P (Y) \prod_{i = 1}^{d} P (X_{i} ∣ Y)$
To perform classification we need the:
- Class priors: $P (Y)$
- Class-conditional attribute probabilities: $P (X_{i} ∣ Y)$ for all $i$ .
These were the tallies from our in-class exercise:
- $S p y$ $G o l f e r$ $F e d o r a$ Count
  
  T T T 1
  
  T T F 3
  
  T F T 1
  
  T F F 0
  
  F T T 4
  
  F T F 3
  
  F F T 6
  
  F F F 2
- From this we can easily estimate our priors:
  - $P (S p y = T r u e) = 5 / 20 = .25$
  - $P (S p y = F a l s e) = 15 / 20 = .75$
- We can also calculate the (full) class-conditional probability distributions:
  - $G o l f e r$ $F e d o r a$ $P (G o l f e r, F e d o r a | S p y = T r u e)$
    
    T T 1/5 = .2
    
    T F 3/5 = .6
    
    F T 1/5 = .2
    
    F F 0
    
    $G o l f e r$ $F e d o r a$ $P (G o l f e r, F e d o r a | S p y = F a l s e)$
    
    T T 4/15 $\approx$ .27
    
    T F 3/15 = .2
    
    F T 6/15 = .4
    
    F F 2/15 $\approx$ .13
However, for naive Bayes classification we instead need this:
- $G o l f e r$ $P (G o l f e r | S p y = T r u e)$
  
  T 4/5 = .8
  
  F 1/5 = .2
  
  $G o l f e r$ $P (G o l f e r | S p y = F a l s e)$
  
  T 7/15 $\approx$ .47
  
  F 8/15 $\approx$ .53
  
  $F e d o r a$
  $P (F e d o r a | S p y = T r u e)$
  
  T 2/5 = .4
  
  F 3/5 = .6
  
  $F e d o r a$
  $P (F e d o r a | S p y = F a l s e)$
  
  T 10/5 $\approx$ .66
  
  F 5/15 $\approx$ .33

$S p y$	$G o l f e r$	$F e d o r a$	Count
T	T	T	1
T	T	F	3
T	F	T	1
T	F	F	0
F	T	T	4
F	T	F	3
F	F	T	6
F	F	F	2

$G o l f e r$	$P (G o l f e r \| S p y = T r u e)$
T	4/5 = .8
F	1/5 = .2

$G o l f e r$	$P (G o l f e r \| S p y = F a l s e)$
T	7/15 $\approx$ .47
F	8/15 $\approx$ .53

$F e d o r a$	$P (F e d o r a \| S p y = T r u e)$
T	2/5 = .4
F	3/5 = .6

$F e d o r a$	$P (F e d o r a \| S p y = F a l s e)$
T	10/5 $\approx$ .66
F	5/15 $\approx$ .33

Performing Classification (Non-Naive):

Assume we have a suspect who is a golfer and who wears a fez, are they a spy?
- We could apply (non-naive) Bayes rule:
- $P (S p y = T ∣ G o l f = T, F e d o r a = F) = \frac{P (G o l f = T, F e d o r a = F ∣ S p y = T) P (S p y = T)}{P (G o l f = T, F e d o r a = F)}$
- $= \frac{P (G o l f = T, F e d o r a = F ∣ S p y = T) P (S p y = T)}{P (G o l f = T, F e d o r a = F ∣ S p y = T) P (S p y = T) + P (G o l f = T, F e d o r a = F ∣ S p y = F) P (S p y = F)} = \frac{.6 \times .25}{.6 \times .25 + .2 \times .75} = .5$

$P (S p y = T ∣ G o l f = T, F e d o r a = F) \propto P (Y) \prod_{i = 1}^{d} P (X_{i} ∣ Y) =$
$.25 \times .8 \times .6 = .12$
$P (S p y = F ∣ G o l f = T, F e d o r a = F) \propto P (Y) \prod_{i = 1}^{d} P (X_{i} ∣ Y) =$
$.75 \times .47 \times .33 \approx .116$

Pros:
- Provides a meaningful class probability, not just a class label
- Works in the face of missing attributes (just don’t include them in the calculation)
- Relatively easy to interpret: we can examine the class-conditional probabilities for individual attributes.
Cons:
- Classification performance may be worse than other classifiers: Most real classification tasks will violate the independence assumption to some extent.

Naive Bayes classifier: $P (Y ∣ X_{1}, X_{2}, . . ., X_{d}) \propto P (Y) \prod_{i = 1}^{d} P (X_{i} ∣ Y)$
Each $P (X_{i} ∣ Y)$ is less then 1.
What is ${.5}^{100}$ ? ${.5}^{1000}$ ?
Recall that $\log (a b) = \log (a) + \log (b)$
Also, the log function is monotonic: if $a > b$ then $\log (a) > \log (b)$
So, practical implementations generally work with logs: $\log (P (Y) \prod_{i = 1}^{d} P (X_{i} ∣ Y)) = \log (P (Y)) + \sum_{i = 1}^{d} \log (P (X_{i} ∣ Y))$