\$50.00

5/5 - (1 vote)

Q1. Decision Tree Classifier (50 points)

Q1.1 Growing Decison Trees from scratch (40 points)

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal of this question in the assignment is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. You must also print the Decision Tree. Use information gain based on entropy as the splitting measure.

Use the data.csv dataset for this particular question. The dataset should be uploaded on Canvas with Assignment 1. Split the dataset into training and test data and calculate testing accuracy.

Q1.2 Decision Tree using Sklearn Library (10 points)

Use the Decision Tree Classifier from the Sklearn Library and use gini index as a splitting measure. Use the data.csv dataset. Calculate accuracy for this model. Print the Decision tree and compare the Decision Trees generated from your code and Sklearn.

Q2 Linear Regression (40 points)

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

θ+=θ+αm(yih(xi))x¯�+=�−+��(��−ℎ(��))�¯

This minimizes the following cost function

J(x,θ,y)=12mi=1m(h(xi)yi)2�(�,�,�)=12�∑�=1�(ℎ(��)−��)2

where

h(xi)=θTx¯ℎ(��)=���¯
# Do not change the code in this cell
true_slope = 15
true_intercept = 2.4
input_var = np.arange(0.0,100.0)
output_var = true_slope * input_var + true_intercept + 300.0 * np.random.rand(len(input_var))
# Do not change the code in this cell
plt.figure()
plt.scatter(input_var, output_var)
plt.xlabel('x')
plt.ylabel('y')
plt.show()
def compute_cost(ip, op, params):
"""
Cost function in linear regression where the cost is calculated
ip: input variables
op: output variables
params: corresponding parameters
Returns cost
"""
num_samples = len(ip)
cost_sum = 0.0
for x,y in zip(ip, op):
y_hat = np.dot(params, np.array([1.0, x]))
cost_sum += (y_hat - y) ** 2

cost = cost_sum / (num_samples)

return cost

Q2.1 Implement Linear Regression using Batch Gradient Descent from scratch. (15 points)

Algorithm can be given as follows:

    for i in 0 -> m:
theta += (alpha / m) * (y[i] - h(x[i])) * x_bar
def linear_regression_using_batch_gradient_descent(ip, op, params, alpha, max_iter):
"""
Compute the params for linear regression using batch gradient descent
ip: input variables
op: output variables
params: corresponding parameters
alpha: learning rate
max_iter: maximum number of iterations
Returns parameters, cost, params_store
"""
# initialize iteration, number of samples, cost and parameter array
iteration = 0
num_samples = len(ip)
cost = np.zeros(max_iter)
params_store = np.zeros([2, max_iter])

# Compute the cost and store the params for the corresponding cost
while iteration < max_iter:
cost[iteration] = compute_cost(ip, op, params)
params_store[:, iteration] = params

print('--------------------------')
print(f'iteration: {iteration}')
print(f'cost: {cost[iteration]}')

None

return params, cost, params_store
# Do not change the code in this cell
# Training the model
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(input_var, output_var, test_size=0.20)

params_0 = np.array([20.0, 80.0])

alpha_batch = 1e-3
max_iter = 100
params_hat_batch, cost_batch, params_store_batch =\
linear_regression_using_batch_gradient_descent(x_train, y_train, params_0, alpha_batch, max_iter)
--------------------------
iteration: 0
cost: 13192814.727851184
--------------------------
iteration: 1
cost: 22978.459683996098
--------------------------
iteration: 2
cost: 11885.25642780868
--------------------------
iteration: 3
cost: 11964.5078971518
--------------------------
iteration: 4
cost: 11965.559740786983
--------------------------
iteration: 5
cost: 11963.600050005338
--------------------------
iteration: 6
cost: 11961.542961665567
--------------------------
iteration: 7
cost: 11959.483798227724
--------------------------
iteration: 8
cost: 11957.425661469053
--------------------------
iteration: 9
cost: 11955.36865172363
--------------------------
iteration: 10
cost: 11953.31277165639
--------------------------
iteration: 11
cost: 11951.258020754422
--------------------------
iteration: 12
cost: 11949.20439840172
--------------------------
iteration: 13
cost: 11947.151903979264
--------------------------
iteration: 14
cost: 11945.100536868216
--------------------------
iteration: 15
cost: 11943.050296450132
--------------------------
iteration: 16
cost: 11941.001182106873
--------------------------
iteration: 17
cost: 11938.95319322067
--------------------------
iteration: 18
cost: 11936.90632917405
--------------------------
iteration: 19
cost: 11934.860589349903
--------------------------
iteration: 20
cost: 11932.81597313147
--------------------------
iteration: 21
cost: 11930.772479902294
--------------------------
iteration: 22
cost: 11928.73010904629
--------------------------
iteration: 23
cost: 11926.6888599477
--------------------------
iteration: 24
cost: 11924.6487319911
--------------------------
iteration: 25
cost: 11922.609724561396
--------------------------
iteration: 26
cost: 11920.571837043859
--------------------------
iteration: 27
cost: 11918.535068824054
--------------------------
iteration: 28
cost: 11916.499419287935
--------------------------
iteration: 29
cost: 11914.46488782174
--------------------------
iteration: 30
cost: 11912.431473812087
--------------------------
iteration: 31
cost: 11910.399176645897
--------------------------
iteration: 32
cost: 11908.367995710456
--------------------------
iteration: 33
cost: 11906.337930393363
--------------------------
iteration: 34
cost: 11904.308980082558
--------------------------
iteration: 35
cost: 11902.281144166343
--------------------------
iteration: 36
cost: 11900.254422033293
--------------------------
iteration: 37
cost: 11898.228813072385
--------------------------
iteration: 38
cost: 11896.204316672915
--------------------------
iteration: 39
cost: 11894.18093222447
--------------------------
iteration: 40
cost: 11892.158659117047
--------------------------
iteration: 41
cost: 11890.1374967409
--------------------------
iteration: 42
cost: 11888.117444486666
--------------------------
iteration: 43
cost: 11886.098501745273
--------------------------
iteration: 44
cost: 11884.080667908069
--------------------------
iteration: 45
cost: 11882.063942366662
--------------------------
iteration: 46
cost: 11880.048324512985
--------------------------
iteration: 47
cost: 11878.033813739334
--------------------------
iteration: 48
cost: 11876.020409438344
--------------------------
iteration: 49
cost: 11874.008111002993
--------------------------
iteration: 50
cost: 11871.996917826538
--------------------------
iteration: 51
cost: 11869.98682930262
--------------------------
iteration: 52
cost: 11867.977844825207
--------------------------
iteration: 53
cost: 11865.969963788566
--------------------------
iteration: 54
cost: 11863.96318558733
--------------------------
iteration: 55
cost: 11861.957509616444
--------------------------
iteration: 56
cost: 11859.952935271203
--------------------------
iteration: 57
cost: 11857.94946194721
--------------------------
iteration: 58
cost: 11855.947089040437
--------------------------
iteration: 59
cost: 11853.945815947141
--------------------------
iteration: 60
cost: 11851.945642063933
--------------------------
iteration: 61
cost: 11849.946566787756
--------------------------
iteration: 62
cost: 11847.948589515901
--------------------------
iteration: 63
cost: 11845.951709645935
--------------------------
iteration: 64
cost: 11843.955926575803
--------------------------
iteration: 65
cost: 11841.961239703789
--------------------------
iteration: 66
cost: 11839.967648428456
--------------------------
iteration: 67
cost: 11837.975152148732
--------------------------
iteration: 68
cost: 11835.983750263902
--------------------------
iteration: 69
cost: 11833.993442173512
--------------------------
iteration: 70
cost: 11832.004227277463
--------------------------
iteration: 71
cost: 11830.01610497601
--------------------------
iteration: 72
cost: 11828.029074669745
--------------------------
iteration: 73
cost: 11826.043135759533
--------------------------
iteration: 74
cost: 11824.058287646603
--------------------------
iteration: 75
cost: 11822.07452973253
--------------------------
iteration: 76
cost: 11820.091861419156
--------------------------
iteration: 77
cost: 11818.110282108735
--------------------------
iteration: 78
cost: 11816.129791203788
--------------------------
iteration: 79
cost: 11814.150388107148
--------------------------
iteration: 80
cost: 11812.172072222062
--------------------------
iteration: 81
cost: 11810.194842952025
--------------------------
iteration: 82
cost: 11808.218699700868
--------------------------
iteration: 83
cost: 11806.243641872796
--------------------------
iteration: 84
cost: 11804.269668872272
--------------------------
iteration: 85
cost: 11802.296780104165
--------------------------
iteration: 86
cost: 11800.324974973595
--------------------------
iteration: 87
cost: 11798.35425288606
--------------------------
iteration: 88
cost: 11796.38461324735
--------------------------
iteration: 89
cost: 11794.416055463607
--------------------------
iteration: 90
cost: 11792.448578941287
--------------------------
iteration: 91
cost: 11790.48218308719
--------------------------
iteration: 92
cost: 11788.516867308379
--------------------------
iteration: 93
cost: 11786.552631012328
--------------------------
iteration: 94
cost: 11784.58947360679
--------------------------
iteration: 95
cost: 11782.627394499821
--------------------------
iteration: 96
cost: 11780.66639309983
--------------------------
iteration: 97
cost: 11778.70646881558
--------------------------
iteration: 98
cost: 11776.747621056118
--------------------------
iteration: 99
cost: 11774.789849230787


Q2.2 Implement Stochastic Gradient Descent from scratch. (15 points)

Algorithm can be given as follows:

for i in 0 -> m:
theta += (alpha / m) * (y[i] - h(x[i])) * x_bar  
def lin_reg_stoch_gradient_descent(ip, op, params, alpha):
"""
Compute the params for linear regression using stochastic gradient descent
ip: input variables
op: output variables
params: corresponding parameters
alpha: learning rate
Returns parameters, cost, params_store
"""

# initialize iteration, number of samples, cost and parameter array
num_samples = len(input_var)
cost = np.zeros(num_samples)
params_store = np.zeros([2, num_samples])

i = 0
# Compute the cost and store the params for the corresponding cost
for x,y in zip(input_var, output_var):
cost[i] = compute_cost(input_var, output_var, params)
params_store[:, i] = params

print('--------------------------')
print(f'iteration: {i}')
print(f'cost: {cost[i]}')

None

return params, cost, params_store
# Do not change the code in this cell
alpha = 1e-3
params_0 = np.array([20.0, 80.0])
params_hat, cost, params_store =\
lin_reg_stoch_gradient_descent(x_train, y_train, params_0, alpha)
--------------------------
iteration: 0
cost: 13144888.680606477
--------------------------
iteration: 1
cost: 13144904.387611339
--------------------------
iteration: 2
cost: 13145433.87395811
--------------------------
iteration: 3
cost: 13144673.959294325
--------------------------
iteration: 4
cost: 13144982.09824148
--------------------------
iteration: 5
cost: 13141302.14178
--------------------------
iteration: 6
cost: 13137333.663826264
--------------------------
iteration: 7
cost: 13127653.43123418
--------------------------
iteration: 8
cost: 13118100.471445723
--------------------------
iteration: 9
cost: 13103429.241717
--------------------------
iteration: 10
cost: 13086270.30504895
--------------------------
iteration: 11
cost: 13062221.19901191
--------------------------
iteration: 12
cost: 13041581.498801691
--------------------------
iteration: 13
cost: 13014437.570788614
--------------------------
iteration: 14
cost: 12968631.126795016
--------------------------
iteration: 15
cost: 12916766.787084088
--------------------------
iteration: 16
cost: 12857324.379810132
--------------------------
iteration: 17
cost: 12795466.857070174
--------------------------
iteration: 18
cost: 12719754.324476423
--------------------------
iteration: 19
cost: 12642643.537183702
--------------------------
iteration: 20
cost: 12562841.960312825
--------------------------
iteration: 21
cost: 12468195.55929051
--------------------------
iteration: 22
cost: 12353952.231334643
--------------------------
iteration: 23
cost: 12235362.022596583
--------------------------
iteration: 24
cost: 12101505.232380655
--------------------------
iteration: 25
cost: 11970148.898274984
--------------------------
iteration: 26
cost: 11840014.799244458
--------------------------
iteration: 27
cost: 11693774.684158767
--------------------------
iteration: 28
cost: 11517755.8728081
--------------------------
iteration: 29
cost: 11338463.968997922
--------------------------
iteration: 30
cost: 11163003.789907044
--------------------------
iteration: 31
cost: 10964990.48298915
--------------------------
iteration: 32
cost: 10749375.85907021
--------------------------
iteration: 33
cost: 10551193.927620357
--------------------------
iteration: 34
cost: 10345975.45112719
--------------------------
iteration: 35
cost: 10109007.003828494
--------------------------
iteration: 36
cost: 9870363.53835074
--------------------------
iteration: 37
cost: 9625852.02628601
--------------------------
iteration: 38
cost: 9371381.232873833
--------------------------
iteration: 39
cost: 9105355.929632533
--------------------------
iteration: 40
cost: 8839278.911337996
--------------------------
iteration: 41
cost: 8574391.358475495
--------------------------
iteration: 42
cost: 8298675.14298877
--------------------------
iteration: 43
cost: 8004246.449074103
--------------------------
iteration: 44
cost: 7707139.256249027
--------------------------
iteration: 45
cost: 7399894.848703644
--------------------------
iteration: 46
cost: 7122350.871890693
--------------------------
iteration: 47
cost: 6841296.454498828
--------------------------
iteration: 48
cost: 6554897.243249216
--------------------------
iteration: 49
cost: 6250356.593686969
--------------------------
iteration: 50
cost: 5978655.702241723
--------------------------
iteration: 51
cost: 5671154.553501252
--------------------------
iteration: 52
cost: 5380408.390804444
--------------------------
iteration: 53
cost: 5082131.721512554
--------------------------
iteration: 54
cost: 4816442.372667548
--------------------------
iteration: 55
cost: 4564043.615089802
--------------------------
iteration: 56
cost: 4296906.083235078
--------------------------
iteration: 57
cost: 4022835.084239657
--------------------------
iteration: 58
cost: 3773125.9156562807
--------------------------
iteration: 59
cost: 3546088.742107074
--------------------------
iteration: 60
cost: 3295471.581822032
--------------------------
iteration: 61
cost: 3081459.3994124234
--------------------------
iteration: 62
cost: 2844263.7899843347
--------------------------
iteration: 63
cost: 2618115.02697635
--------------------------
iteration: 64
cost: 2408772.5150151644
--------------------------
iteration: 65
cost: 2214062.1342881643
--------------------------
iteration: 66
cost: 2035100.917899321
--------------------------
iteration: 67
cost: 1862300.496673255
--------------------------
iteration: 68
cost: 1701301.9864399252
--------------------------
iteration: 69
cost: 1539924.5180808145
--------------------------
iteration: 70
cost: 1400707.4834754874
--------------------------
iteration: 71
cost: 1278238.396088922
--------------------------
iteration: 72
cost: 1165937.6138728214
--------------------------
iteration: 73
cost: 1047226.4938960373
--------------------------
iteration: 74
cost: 951116.8225384253
--------------------------
iteration: 75
cost: 841703.3440724618
--------------------------
iteration: 76
cost: 743039.3566551396
--------------------------
iteration: 77
cost: 670031.3103910342
--------------------------
iteration: 78
cost: 592097.915411781
--------------------------
iteration: 79
cost: 514874.7222320943
--------------------------
iteration: 80
cost: 447063.0496916281
--------------------------
iteration: 81
cost: 393054.5319264231
--------------------------
iteration: 82
cost: 339943.01094985014
--------------------------
iteration: 83
cost: 295877.342145335
--------------------------
iteration: 84
cost: 253530.7280115814
--------------------------
iteration: 85
cost: 221793.93231878805
--------------------------
iteration: 86
cost: 197994.98902654447
--------------------------
iteration: 87
cost: 165553.93704799848
--------------------------
iteration: 88
cost: 138222.5916104802
--------------------------
iteration: 89
cost: 113819.29468310646
--------------------------
iteration: 90
cost: 94815.69718471098
--------------------------
iteration: 91
cost: 83956.24685990743
--------------------------
iteration: 92
cost: 75358.92655089678
--------------------------
iteration: 93
cost: 65718.79206477552
--------------------------
iteration: 94
cost: 54727.15652442515
--------------------------
iteration: 95
cost: 45382.97190382245
--------------------------
iteration: 96
cost: 40020.56372322326
--------------------------
iteration: 97
cost: 35793.0599352847
--------------------------
iteration: 98
cost: 30775.80357452722
--------------------------
iteration: 99
cost: 24597.82512646985


Q2.3 Calculate Root Mean Square error in batch gradient descent algorithm and stochastic gradient descent algorithm (5 points)

# Calculate Root Mean Square error in batch gradient descent algorithm and stochastic gradient descent algorithm

Batch RMS:      93.7947330891927
Stochastic RMS: 125.24063958434985

# Do not change the code in this cell
plt.figure()
plt.plot(np.arange(max_iter), cost_batch, 'r', label='batch')
plt.plot(np.arange(len(cost)), cost, 'g', label='stochastic')
plt.xlabel('iteration')
plt.ylabel('normalized cost')
plt.legend()
plt.show()
print(f'min cost with BGD: {np.min(cost_batch)}')
print(f'min cost with SGD: {np.min(cost)}')
min cost with BGD: 11774.789849230787
min cost with SGD: 24597.82512646985


Q3. Linear Regression Analytical Problem (10 points)

Consider the following training data.

X1 X2 Y
0 0 0
0 1 1.5
1 0 2
1 1 2.5
Suppose the data comes from a model y = θ0θ0 +θ1θ1x1 +θ2θ2x2 for unknown constants θ0θ0,θ1θ1,θ2θ2. Use least squares linear regression to find an estimate of θ0θ0,θ1θ1,θ2θ2.