留学生作业代写do not hesitate to contact me!
WeChat:lovexc60
CSCI 5521: Machine Learning Fundamentals (Spring 2021)
Homework 3
1. (15 points) Consider a binary classification problem with two possible choices (Price and WaitEstimate) for the root of a decision tree for the restaurant dataset as shown in the figure. The green (top row) and red circles (bottom row) refer to the data points from the two classes where the green (top row) circles belong to class 0 and red (bottom row) belong to class 1.
Calculate the information gain for the attributes ‘Price’ and ‘WaitEstimate’ showing all steps. Based on the calculated information gain, which attribute will you use as the root of the decision tree (Price or WaitEstimate)? Justify your answer by describing what the information gain numbers mean.
Programming assignment:
2. (85 points) In this problem, we will use the Digits dataset to classify ‘3’ vs ‘8’. Your task is to build three classifiers: (a) bagging, (b) random forests, and (c) adaboost. You must implement each classifier, from scratch, using only the provided code and functions specified in this problem description and additional instructions below.
(a) (25 points) Write code for class MyBagging in file MyBagging.py with two key functions: MyBagging.fit(self, X, r) and MyBagging.predict(self, X).
For the class, the init (self, num trees, max depth) function takes as input num trees which is the number of tree classifiers in the ensemble (must be an integer ≥ 1), and max depth which is the maximum depth of the trees.
For fit(self, X, r), the inputs (X, r) are the feature matrix and class labels respec- tively. The fit function will construct num trees number of bootstrap datasets and learn a distinct decision tree on each bootstrap dataset. To learn a decision tree, you must use the class DecisionTreeClassifier from scikit-learn. You must set the decision tree hyperparameters as: criterion=‘entropy’, and random state = 0.
For predict(self, X), the input X is the feature matrix corresponding to the validation set and the output should a list of length equal to num trees. Each element in the list should be an array of length equal to the number of data points in the validation set. The array should contain the predicted labels for each data point in the validation set using the ensemble of trees built up to that point. For example, the predictions in the array in the first element of the output list should be made only using the first decision tree constructed in the ensemble. The predictions in the array in the second element of the output list should be made using the first two decision trees constructed in the ensemble. The predictions in the array in the t-th element of the output list should be made using the first t decision trees constructed in the ensemble. The predictions in the array in the last element of the output list should be made using all decision trees in the ensemble. By constructing the output list in this way, we can use the hw3q2.py script to plot the mean error rate as we increase the number of trees in the ensemble.
(b) (25 points) Write code for class MyRandomForest in file MyRandomForest.py with two key functions: MyRandomForest.fit(self,X,r) and MyRandomForest.predict(self,X).
(Due Tue, Mar. 30, 11:59 PM central)