CS 445 Machine Learning

PA1: Decision Tree

Learning Objectives

After completing this activity, students should be able to:
  • Construct a Python object that can build a decision tree and classify new data
  • Utilize tree depth and leaf density/cardinality and how it relates to under and overfitting
  • Utilize matplotlib to visual and presents results to analyze decision trees

Partners

This assignment may be completed individually or in pairs. If you are doing this in pairs, you must notify me at the beginning of the project. My expectation for pairs is that both members are actively involved, and take full responsibility for all aspects of the project. In other words, I expect that you are either sitting or virtually together to work, and not that you are splitting up tasks to be completed separately. If both members of the group are not able to fully explain the code to me, then this does not meet this expectation.

Part 1: Implementation

Complete the following stubbed-out decision tree classifier so that all public methods and attributes correspond to the provided docstring comments:
  • decision_tree.py
    • impurity (completed as part of tree warmup in-class exercise)
    • weighted_impurity
    • The DecisionTree class methods. These methods must be present and keep the same signature (parameters) as in the stub code. You can add more methods as required:
      • __init__
      • fit
      • predict
      • get_depth
The following modules provide some diagnostics for validating your decision tree object: If you run the decision_tree_tests.py file, it will build a decision tree and validate some of the output. These tests are not exhaustive, so consider adding a few tests of your own. The main method exists solely for you to perform testing. It is not required for this to be implemented and it will not be graded. All of my internal and Autolab testing happens by instantiating of a DecisionTree object and calling member functions.

Part 2: Analysis on a Demographics Dataset

You will apply your decision tree to the problem of determining a respondent's age based on their answers to an online quiz. The quiz consists of 30 Yes/No questions like

  • Have you ever broken a bone?
  • Have you ever been on the radio or television?
  • etc.
Determining demographic information about customers based on seeminly innocuous data revealed online is big business these days.

For this part of the assignment, you will create and submit a jupyter notebook that satisfies the following requirements:

  • Write code to perform hyper-paramter tuning using cross validation, including:
    • a plot of the cross validation error for various hyper-parameter settings
    • Short discussion on the number of partitions used in the cross-validation
  • illustrate model evaluation by determining the classification error on the test set
  • Construct a 4x4 confusion matrix illustrates the results on the test set
  • Discussion on which attributes were used by your decision tree
All plots and table should contain text explaining them. Plots without accompanying text will not receive any credit. You may use Sklearn modules for cross-validation (creating training/test, splits for k-folds, etc). No sklearn methods should appear in decision_tree.py.

Demographics Dataset
The following files contain the training and test data you must use for your analysis: The class labels are integers in the range 0-3 where:
  • 0 represents 0-18 years old
  • 1 represents 19-24 years old
  • 2 represents 25-34 years old
  • 3 represents 35+ years old
The attributes all have values of 1 (representing Yes) or 0 (representing No). A reference for this dataset, Have you Ever collected by Benjamin Soyka, is available on Kaggle (a machine learning competition site). This data is distributed under the CC BY-SA 4.0 license.

Submission

This PA has the following required submissions:
Deliverable Location to Submit Description
decision_tree.py Autolab Graded using the test file provided plus a few hidden tests
decision_tree.py Canvas Submit to Canvas too so that your ipynb file will run
HaveYouEverAnalysis.ipynb notebook Canvas Your notebook can expect that the data files and your decision_tree.py file are located in the same directory. Your notebook MUST include the plots already rendered (in other words, nothing should need to be run).

Grading

Grades will be calculated according to the following distribution.
Criterion Percentage Description
Overall Readability/Style 5% Your code should follow PEP8 conventions. It should be well documented and well organized.
Part 1: Reference Tests 40% decision_tree_tests.py
Part 1: Efficiency 5% Main concern is clarity and correctness. That said, your implemented must be efficient enough to execute the testing code within 3 seconds. This means that you should avoid Python loops where possible. One strategy might be to use loops for the initial implementation, test for correctness, and then try removing the loops in lieu of numpy methods.
Part 1: Hidden Tests 10% These tests are run on Autolab, so, you can keep submitting till you pass these as well (but the contents of the test remain hidden).
Part 2: Hyperparameter Tuning 10% Notebook illustrates hyperparameter tuning utilizing cross validation.
Part 2: Plot and Selection of Hyperparameter 10% Plot showing the performance of your model (average validation error or average validation accuracy) for different setting of hyperparameters. Discuss your selection of the hyperparameter.
Part 2:Confusion Matrix 10% After retraining your model with the best hyperparameters using all the training data, construct a 4x4 confusion matrix using python code and print the results. Discuss any conclusions you can reach from the confusion matrix.
Part 2:Explanation 10% Discuss and possible explain your model's choice of feature/attributes. Recommended to use distribution plots if they help support your discussion.
This assignment was co-developed by Kevin Molloy and Nathan Sprague.