Decision Tree Warmup Lab

Enter your name(s) in the cell below:

YOUR ANSWER HERE

Gini Impurity

We've discussed entropy as a possible measure of impurity for the decision tree construction algorithm. Another option is Gini impurity. Gini impurity is defined as:

$$ \phi(\mathbf{p}) = \sum_i p_i(1-p_i) $$

Where $\mathbf{p} = (p_1, ... , p_n)$ and each $p_i$ is the fraction of elements from class $i$. This expresses the fractions of incorrect predictions in the node if the class of each element was predicted by randomly selecting a label according to the distribution of classes in the node. This value will be 0 if all elements are from the same class, and it increases as the mix becomes more uniform. Because we know that the $p_i$ must sum to one, this can be rewritten as:

$$ \phi(\mathbf{p}) = 1 - \sum_i p_i^2 $$

* Notation follows: Breiman, Leo. ``Technical note: Some properties of splitting criteria.'' Machine Learning 24.1 (1996): 41-47.

Exercise 1

What is the Gini impurity of a node containing 3 items from class A, 7 items from class B and 10 items from class C?

YOUR ANSWER HERE

Split Generator for the Decision Tree PA

I'm providing one potentially useful utility method that you can use in your decision tree construction algorithm. The split_generator method illustrated below will allow you to iterate over all possible splits for a data set. Make sure you understand what's hapenning in this code.

In [1]:
import numpy as np
import decision_tree

# This is the same housing data we worked with in class. 
# The three columns are "homeowner", "marital status" and "income".  
# The label array (y) represents "defaulted borrower".

X = np.array([[1., 0., 120.],
              [0., 1., 100.],
              [1., 0., 70.],
              [0., 0., 150.],
              [1., 2., 85.],
              [0., 1., 80.],
              [0., 0., 75.]])

y = np.array([0, 0, 0, 1, 0, 1, 1])

# Instantiate a generator
split_gen = decision_tree.split_generator(X, y)

# Print the information associated with the first two splits:
print("FIRST SPLIT:")
print(next(split_gen))
print("\nSECOND SPLIT:")
print(next(split_gen))

# Now let's count to see if we get the expected number of splits:
counter = 0
for split in decision_tree.split_generator(X, y):
    counter += 1

print("\nThere are {} possible splits.".format(counter))
FIRST SPLIT:
Split(dim=0, pos=0.5,
X_left=
array([[  0.,   1., 100.],
       [  0.,   0., 150.],
       [  0.,   1.,  80.],
       [  0.,   0.,  75.]]),
y_left=array([0, 1, 1, 1]),
counts_left=Counter({1: 3, 0: 1}),
X_right=
array([[  1.,   0., 120.],
       [  1.,   0.,  70.],
       [  1.,   2.,  85.]]),
y_right=array([0, 0, 0]),
counts_right=Counter({0: 3}))

SECOND SPLIT:
Split(dim=1, pos=0.5,
X_left=
array([[  1.,   0., 120.],
       [  1.,   0.,  70.],
       [  0.,   0., 150.],
       [  0.,   0.,  75.]]),
y_left=array([0, 0, 1, 1]),
counts_left=Counter({0: 2, 1: 2}),
X_right=
array([[  0.,   1., 100.],
       [  0.,   1.,  80.],
       [  1.,   2.,  85.]]),
y_right=array([0, 1, 0]),
counts_right=Counter({0: 2, 1: 1}))

There are 9 possible splits.

Exercise 2 - Implement Gini Impurity

Complete the two unfinished functions below. These will be useful in your decision tree implementation.

Note that np.fromiter can be used to convert any iterable into a numpy array:

In [ ]:
a = [1.0, 2.0, 3.0]
array = np.fromiter(a, dtype=float)
print(array)
In [ ]:
from collections import Counter

def impurity(y, y_counts=None):
    """ Calculate Gini impurity for the class labels y.
        If y_counts is provided it will be the counts of the labels in y.
    """
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# TESTS FOR IMPURITY

np.testing.assert_allclose(impurity(y), 0.48979, atol=.001)

split_gen = decision_tree.split_generator(X, y)

split = next(split_gen)
np.testing.assert_allclose(impurity(split.y_left, split.counts_left), 0.375, atol=.001)
np.testing.assert_allclose(impurity(split.y_right, split.counts_right), 0, atol=.001)
In [ ]:
def weighted_impurity(split):
    """ Weighted gini impurity for a possible split. """ 
    # YOUR CODE HERE
    raise NotImplementedError()
In [ ]:
# TESTS FOR WEIGHTED IMPURITY

split_gen = decision_tree.split_generator(X, y)

split = next(split_gen)
np.testing.assert_allclose(weighted_impurity(split), 0.214286, atol=.001)
split = next(split_gen)
np.testing.assert_allclose(weighted_impurity(split), 0.47619, atol=.001)