CMPT 733 Assignment 4: Correlation Analysis and Bootstrapping solved

$35.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

Part 1. Correlation Analysis
As a data scientist, you often face this kind of question: “Are A and B correlated?” For example,

Do Canadian Currency and Oil Price move together?
Do Vancouver Housing Price and US Stock Market have any correlation?
Are GPA and Gender independent?
To answer these questions, you need to conduct a correlation analysis.

Imagine you are a data scientist working at a real-estate company. You download a property_tax_report from this webpage. The dataset contains information on properties from BC Assessment (BCA) and City sources in 2019. You can find the schema information of the dataset from this webpage.

You may think that for a newly built house, it tends to have a higher price than the ones built decades ago. In this assignment, your first job is to figure out whether YEAR_BUILT and HOUSE_PRICE are correlated.

We first load the data as a DataFrame.

In [ ]:
import pandas as pd

df = pd.read_csv(“property_tax_report_2019.csv”)

df[‘HOUSE_PRICE’] = df.apply(lambda x: (x[‘CURRENT_LAND_VALUE’] \
+x[‘CURRENT_IMPROVEMENT_VALUE’])/1000000.0, axis = 1)
Task A. Visualizations
Since the housing price varies a lot by locations, we will only consider the houses whose postcode starts with ‘V6A’. Furthermore, we remove the houses that were built before 1900.

In the following, please make two subplots in one row. For the left subplot, it is a scatter plot with X = YEAR_BUILT and Y = HOUSE_PRICE; for the right subplot, it is a hexbin plot (gridsize = 20) with X = YEAR_BUILT and Y = HOUSE_PRICE.

In [ ]:
#<– Write Your Code –>
Please write down the two most interesting findings that you draw from the plot.

Findings

[ADD TEXT]
[ADD TEXT]
The above plots provide a general impression of the relationship between variables. There are some other visualizations that can provide more insights. One option is to bin one variable and plot percentiles of the other.

In the following, please make three subplots in a row, where each subplot is a scatter plot with X = YEAR_BUILT and Y = HOUSE_PRICE.

The first subplot shows how the 25th percentile of HOUSE_PRICE changes over years (X = YEAR_BUILT, Y = 25TH_HOUSE_PRICE);
The second subplot shows how the 50th percentile of HOUSE_PRICE changes over years (X = YEAR_BUILT, Y = 50TH_HOUSE_PRICE);
The third subplot shows how the 75th percentile of HOUSE_PRICE changes over years (X = YEAR_BUILT, Y = 75TH_HOUSE_PRICE);
In [ ]:
#<– Write Your Code –>
Please write down the two most interesting findings that you draw from the plot.

Findings

[ADD TEXT]
[ADD TEXT]
Task B. Correlation Coefficient
A correlation coefficient is a numerical measure of some type of correlation, meaning a statistical relationship between a pair of variables.

In the following, please implement calc_pearson() and calc_spearman(), respectively. Note that you are NOT allowed to use corr from Pandas or pearsonr/spearmanr from scipy.stats to do this task. In other words, you need to implement the actual algorithms for pearson and spearman.

In [ ]:
def calc_pearson(df, x, y):
#<– Write Your Code –>

def calc_spearman(df, x, y):
#<– Write Your Code –>
Then, you can use these two functions to compute the Pearson’s correlation as well as Spearman’s rank correlation for three pairs of variables: <25TH_HOUSE_PRICE, YEAR_BUILT>, <50TH_HOUSE_PRICE, YEAR_BUILT>, and <75TH_HOUSE_PRICE, YEAR_BUILT>.

In [ ]:
print(dfcor.head(10))
print()

for TH in [“25TH”, “50TH”, “75TH”]:
print(TH+”_HOUSE_PRICE\t pearson=%f\t spearman=%f” \
%(calc_pearson(dfcor, “YEAR_BUILT”, TH+”_HOUSE_PRICE”), \
calc_spearman(dfcor, “YEAR_BUILT”, TH+”_HOUSE_PRICE”)))

Please write down the two most interesting findings that you draw from the result.

Findings

[ADD TEXT]
[ADD TEXT]
Part 2. Bootstrapping
In reality, it is more often than not that you can only collect a sample of the data. Whenever you derive a conclusion from a sample (e.g., Vancouver’s housing price has increased by 10% since last year), you should ALWAYS ask yourself: “CAN I TRUST IT?”. In other words, you want to know that if the same analysis was conducted on the full data, would the same conclusion be derived? In Part 2, you will learn how to use bootstrapping to answer this question.

In [ ]:
df_sample = pd.read_csv(“property_tax_report_2019_sample.csv”)

df_sample[‘CURRENT_PRICE’] = df_sample.apply(lambda x: x[‘CURRENT_LAND_VALUE’] \
+x[‘CURRENT_IMPROVEMENT_VALUE’], axis = 1)

df_sample[‘PREVIOUS_PRICE’] = df_sample.apply(lambda x: x[‘PREVIOUS_LAND_VALUE’] \
+x[‘PREVIOUS_IMPROVEMENT_VALUE’], axis = 1)

df_sample = df_sample[df_sample[‘LEGAL_TYPE’] == ‘STRATA’]
Task 1. Analysis Result Without Bootstrapping
Please compute the median of PREVIOUS_PRICE and CURRENT_PRICE, respectively, and compare them in a bar chart.

In [ ]:
# — Write your code below —
Task 2. Analysis Result With Bootstrapping
From the above chart, we find that the median of PREVIOUS_PRICE is about 0.77 M, and the median of CURRENT_PRICE is about 0.72 M. Since the numbers were obtained from the sample, “CAN WE TRUST THESE NUMBERS?”

In the following, please implement the bootstrap by yourself, compute a 95%-confidence interval for each number, and add the confidence intervals to the above bar chart. This document gives a good tutorial about the bootstrap. You can find the description of the algorithm in Section 7.

In [ ]:
# — Write your code below —
Submission
Complete the code in this notebook, and submit it to the CourSys activity Assignment 4.