Bagging (or bootstrap aggregration) describes a class of ensemble methods where a collection of base-classifiers are each trained on a different subset of the original training data. At classificition time, each of the base classifiers then "votes" on the final class label. In the examples below we will explore the use of bagging to train an ensemble of decision tree regressors. In the case of regression, the ensemble prediction is just the mean of the predictions made by the component trees.
First, let's look at what happens when we train a single decision tree on the full training set...
import numpy as np
import matplotlib.pyplot as plt
import datasource
import fit_experiments
plt.rcParams["figure.figsize"] = (10.5,7.5)
ds = datasource.DataSource(seed=200)
x, y = ds.gen_data(80)
model1 = fit_experiments.BaggingTreeRegression(num_trees=1,
max_leaf_nodes=10000000)
model1.fit(x,y)
model1.plot(x, y)
This looks like overfitting. We could try tuning the hyperparameters of the tree, but instead, let's try training 30 decision trees, each of which will see a different draw from from our training set:
num_trees2 = 30
model2 = fit_experiments.BaggingTreeRegression(num_trees=num_trees2,
max_leaf_nodes=10000000)
model2.fit(x,y)
model2.plot(x, y)
As you can see, each individual training point has less of an impact on the final prediction because not all trees are trained using every point. It looks like the ensemble result is doing a better job of capturing the trends in the training data.
We can quantify this by running some experiments to estimate the bias and variance of a single decision tree vs. a bagged ensemble:
num_trees1 = 1
max_leaf_nodes1 = 1000000
num_trees2 = 30
max_leaf_nodes2 = 1000000
plt.subplot(121)
print("Single Tree:")
fit_experiments.bias_variance_experiment(num_trials=100, train_size=80,
max_leaf_nodes=max_leaf_nodes1,
num_trees=num_trees1,
source=ds)
plt.title('trees: {}'.format(num_trees1))
plt.subplot(122)
print("\n{} Tree Ensemble:".format(num_trees2))
fit_experiments.bias_variance_experiment(num_trials=100, train_size=80,
max_leaf_nodes=max_leaf_nodes2,
num_trees=num_trees2,
source=ds)
plt.title('trees: {}'.format(num_trees2))
plt.show()
Notice that bagging significantly decreases the variance, while not impacting the already-low bias. This suggests that bagging should lead to overall lower generalization error.
Our textbook suggests that bagging can also improve bias. Let's try a highly-biased decision tree limited to only four leaf nodes:
model1 = fit_experiments.BaggingTreeRegression(num_trees=1,
max_leaf_nodes=4)
model1.fit(x,y)
model1.plot(x, y)
It looks like this model is underfitting the data... Let's see what happens when we use bagging to create an ensemble of 30 of these high-bias decision trees:
num_trees2 = 30
model2 = fit_experiments.BaggingTreeRegression(num_trees=num_trees2,
max_leaf_nodes=4)
model2.fit(x,y)
model2.plot(x, y)
It looks like the ensemble prediction is still underfitting the data, but not as badly as before. Once again, we can quantify this experimentally:
num_trees1 = 1
max_leaf_nodes1 = 4
num_trees2 = 30
max_leaf_nodes2 = 4
plt.subplot(121)
print("Single Tree:")
fit_experiments.bias_variance_experiment(num_trials=100, train_size=80,
max_leaf_nodes=max_leaf_nodes1,
num_trees=num_trees1,
source=ds)
plt.title('trees: {}'.format(num_trees1))
plt.subplot(122)
print("\n{} Tree Ensemble:".format(num_trees2))
fit_experiments.bias_variance_experiment(num_trials=100, train_size=80,
max_leaf_nodes=max_leaf_nodes2,
num_trees=num_trees2,
source=ds)
plt.title('trees: {}'.format(num_trees2))
plt.show()
Sure enough, the ensemble predictions are better both in terms of bias and variance.
Ensemble methods work best when the errors made by the component models are as uncorrelated as possible. Bagging attempts to de-correlate the models by training each one on a different subset of the data. Random forests introduce additional randomization into the tree construction process to increase the variability of the component models. We'll talk more about random forests next time.