CS 445 Machine Learning
PA1: Decision Tree
Learning Objectives
After completing this activity, students should be able to:- Construct a Python object that can build a decision tree and classify new data
- Utilize tree depth and leaf density/cardinality and how it relates to under and overfitting
- Utilize matplotlib to visual and presents results to analyze decision trees
Partners
This assignment may be completed individually or in pairs. If you are doing this in pairs, you must notify me at the beginning of the project. My expectation for pairs is that both members are actively involved, and take full responsibility for all aspects of the project. In other words, I expect that you are either sitting or virtually together to work, and not that you are splitting up tasks to be completed separately. If both members of the group are not able to fully explain the code to me, then this does not meet this expectation.Part 1: Implementation
Complete the following stubbed-out decision tree classifier so that all public methods and attributes correspond to the provided docstring comments:- decision_tree.py
impurity
(completed as part of tree warmup in-class exercise)weighted_impurity
- The DecisionTree class methods. These methods must be present and keep the same signature (parameters) as in the stub code. You can add more methods as required:
- __init__
- fit
- predict
- get_depth
Part 2: Analysis on a Demographics Dataset
You will apply your decision tree to the problem of determining a respondent's age based on their answers to an online quiz. The quiz consists of 30 Yes/No questions like
- Have you ever broken a bone?
- Have you ever been on the radio or television?
- etc.
For this part of the assignment, you will create and submit a jupyter notebook that satisfies the following requirements:
- Write code to perform hyper-paramter tuning
using cross validation, including:
- a plot of the cross validation error for various hyper-parameter settings
- Short discussion on the number of partitions used in the cross-validation
- illustrate model evaluation by determining the classification error on the test set
- Construct a 4x4 confusion matrix illustrates the results on the test set
- Discussion on which attributes were used by your decision tree
Demographics Dataset
The following files contain the training and test data you must use
for your analysis:
The class labels are integers in the range 0-3 where:
- 0 represents 0-18 years old
- 1 represents 19-24 years old
- 2 represents 25-34 years old
- 3 represents 35+ years old
Submission
This PA has the following required submissions:Deliverable | Location to Submit | Description |
decision_tree.py | Autolab | Graded using the test file provided plus a few hidden tests |
decision_tree.py | Canvas | Submit to Canvas too so that your ipynb file will run |
HaveYouEverAnalysis.ipynb notebook | Canvas | Your notebook can expect that the data files and your decision_tree.py file are located in the same directory. Your notebook MUST include the plots already rendered (in other words, nothing should need to be run). |
Grading
Grades will be calculated according to the following distribution.Criterion | Percentage | Description |
Overall Readability/Style | 5% | Your code should follow PEP8 conventions. It should be well documented and well organized. |
Part 1: Reference Tests | 40% | decision_tree_tests.py |
Part 1: Efficiency | 5% | Main concern is clarity and correctness. That said, your implemented must be efficient enough to execute the testing code within 3 seconds. This means that you should avoid Python loops where possible. One strategy might be to use loops for the initial implementation, test for correctness, and then try removing the loops in lieu of numpy methods. |
Part 1: Hidden Tests | 10% | These tests are run on Autolab, so, you can keep submitting till you pass these as well (but the contents of the test remain hidden). |
Part 2: Hyperparameter Tuning | 10% | Notebook illustrates hyperparameter tuning utilizing cross validation. |
Part 2: Plot and Selection of Hyperparameter | 10% | Plot showing the performance of your model (average validation error or average validation accuracy) for different setting of hyperparameters. Discuss your selection of the hyperparameter. |
Part 2:Confusion Matrix | 10% | After retraining your model with the best hyperparameters using all the training data, construct a 4x4 confusion matrix using python code and print the results. Discuss any conclusions you can reach from the confusion matrix. |
Part 2:Explanation | 10% | Discuss and possible explain your model's choice of feature/attributes. Recommended to use distribution plots if they help support your discussion. |