Enter your name(s) in the cell below:
YOUR ANSWER HERE
We've discussed entropy as a possible measure of impurity for the decision tree construction algorithm. Another option is Gini impurity. Gini impurity is defined as:
$$ \phi(\mathbf{p}) = \sum_i p_i(1-p_i) $$Where $\mathbf{p} = (p_1, ... , p_n)$ and each $p_i$ is the fraction of elements from class $i$. This expresses the fractions of incorrect predictions in the node if the class of each element was predicted by randomly selecting a label according to the distribution of classes in the node. This value will be 0 if all elements are from the same class, and it increases as the mix becomes more uniform. Because we know that the $p_i$ must sum to one, this can be rewritten as:
$$ \phi(\mathbf{p}) = 1 - \sum_i p_i^2 $$* Notation follows: Breiman, Leo. ``Technical note: Some properties of splitting criteria.'' Machine Learning 24.1 (1996): 41-47.
What is the Gini impurity of a node containing 3 items from class A, 7 items from class B and 10 items from class C?
YOUR ANSWER HERE
I'm providing one potentially useful utility method that you can use in your decision tree construction algorithm. The split_generator
method illustrated below will allow you to iterate over all possible splits for a data set. Make sure you understand what's hapenning in this code.
import numpy as np
import decision_tree
# This is the same housing data we worked with in class.
# The three columns are "homeowner", "marital status" and "income".
# The label array (y) represents "defaulted borrower".
X = np.array([[1., 0., 120.],
[0., 1., 100.],
[1., 0., 70.],
[0., 0., 150.],
[1., 2., 85.],
[0., 1., 80.],
[0., 0., 75.]])
y = np.array([0, 0, 0, 1, 0, 1, 1])
# Instantiate a generator
split_gen = decision_tree.split_generator(X, y)
# Print the information associated with the first two splits:
print("FIRST SPLIT:")
print(next(split_gen))
print("\nSECOND SPLIT:")
print(next(split_gen))
# Now let's count to see if we get the expected number of splits:
counter = 0
for split in decision_tree.split_generator(X, y):
counter += 1
print("\nThere are {} possible splits.".format(counter))
Complete the two unfinished functions below. These will be useful in your decision tree implementation.
Note that np.fromiter
can be used to convert any iterable into a numpy array:
a = [1.0, 2.0, 3.0]
array = np.fromiter(a, dtype=float)
print(array)
from collections import Counter
def impurity(y, y_counts=None):
""" Calculate Gini impurity for the class labels y.
If y_counts is provided it will be the counts of the labels in y.
"""
# YOUR CODE HERE
raise NotImplementedError()
# TESTS FOR IMPURITY
np.testing.assert_allclose(impurity(y), 0.48979, atol=.001)
split_gen = decision_tree.split_generator(X, y)
split = next(split_gen)
np.testing.assert_allclose(impurity(split.y_left, split.counts_left), 0.375, atol=.001)
np.testing.assert_allclose(impurity(split.y_right, split.counts_right), 0, atol=.001)
def weighted_impurity(split):
""" Weighted gini impurity for a possible split. """
# YOUR CODE HERE
raise NotImplementedError()
# TESTS FOR WEIGHTED IMPURITY
split_gen = decision_tree.split_generator(X, y)
split = next(split_gen)
np.testing.assert_allclose(weighted_impurity(split), 0.214286, atol=.001)
split = next(split_gen)
np.testing.assert_allclose(weighted_impurity(split), 0.47619, atol=.001)