## Description

1 Recitation Exercises

These exercises are to be found in: Introduction to Data Mining, 2nd

Edition by Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar.

1.1 Chapter 3

Exercises: 2,3,5,6,7,8,12

2 Practicum Problems

These problems will primarily reference the lecture materials and the examples

given in class using Python. It is suggested that a Jupyter/IPython notebook

be used for the programmatic components.

2.1 Problem 1

Load the iris sample dataset from sklearn (load iris()) into Python using a

Pandas dataframe. Induce a set of binary Decision Trees with a minimum of

2 instances in the leaves, no splits of subsets below 5, and an maximal tree

depth from 1 to 5 (you can leave other parameters at their defaults). Which

depth values result in the highest Recall? Why? Which value resulted in the

lowest Precision? Why? Which value results in the best F1 score? Explain the

difference between the micro/macro/weighted methods of score calculation.

2.2 Problem 2

Load the Breast Cancer Wisconsin (Diagnostic) sample dataset from the UCI

Machine Learning Repository (The discrete version at: breast-cancerwisconsin.data) into Python using a Pandas dataframe. Induce a binary

Decision Tree with a minimum of 2 instances in the leaves, no splits of subsets

below 5, and a maximal tree depth of 2 (use the default Gini criterion). Calculate

the Entropy, Gini, and Misclassification Error of the first split – what is the

Information Gain? What is the feature selected for the first split, and what

value determines the decision boundary?

2.3 Problem 3

Load the Breast Cancer Wisconsin (Diagnostic) sample dataset from the UCI

Machine Learning Repository (The continuous version at: wdbc.data) into

Prof. Panchal:

Wed. 6:45PM-9:35PM

CS 422 – Data Mining Spring 2021:

All Sections

Assigned:

February 14, 2021 Homework 2

Due:

February 28, 2021

Python using a Pandas dataframe. Induce the same binary Decision Tree

as above (now using the continuous data) but perform a PCA dimensionality

reduction beforehand. Using only the first principal component of the data for

a model fit, what is the F1, Precision, and Recall of the PCA-based single factor

model compared to the original (continuous) data? Repeat using the first and

second principal components. Using the Confusion Matrix, what are the values

for FP and TP as well as FPR/TPR? Is using continuous data in this case

beneficial within the model? How?

2.4 Problem 4

Simulate a binary classification dataset with a single feature using a mixture of

normal distributions with NumPy (Hint: Generate two data frames with the

random number and a class label, and combine them together). The normal

distribution parameters (np.random.normal) should be (5,2) and (-5,2) for

the pair of samples. Induce a binary Decision Tree of maximum depth 2, and

obtain the threshold value for the feature in the first split. How does this value

compare to the empirical distribution of the feature?

Prof. Panchal:

Wed. 6:45PM-9:35PM

CS 422 – Data Mining Spring 2021:

All Sections