Regression solved

$55.00

Category: You will receive a download link of the .ZIP file upon Payment

Description

5/5 - (1 vote)

This assignment will require you to implement and interpret some of the regression concepts
that were introduced in class. Keep in mind that the main objective of this assignment is to
highlight the insights that we can derive from applying these techniques { the coding aspect
is secondary. Accordingly, you are welcome to consult any online documentation and/or
code that has been posted to the course repo, so long as all references and sources are
properly cited. You are also encouraged to use code libraries, but please acknowledge any
source code that was not written by you by mentioning the original author(s) directly in your
source code (comment or header).
We will be using a collection of datasets that illustrate various weather and air-quality related
metrics for ve Chinese cities: Shenyang, Shanghai, Guangzhou, Chengdu and Beijing. The
available data for these cities covers dates from Jan 2010 to Dec 2015.
Objectives:
1. Analyze a weather time-series and provide comments on observed patterns
2. Generate visualizations to illustrate speci c metrics for speci c time windows
3. Apply linear regression to a dataset containing numerical features and evaluate it using
R-squared
Grading Criteria:
Follow the instructions in the pdf, and complete each task. You will be graded on the
application of the modules’ topics, the completeness of your answers to the questions in the
assignment notebook, and the clarity of your writing and code.
1
1 Regression
The Data
For this assignment you will be using a modi ed version of the PM2.5 Data of Five Chinese
Cities dataset which you can nd in our github repo here. This dataset was initially collected
and used on this publication.
The features on this dataset include:
Feature Description
year year of data in this row
month month of data in this row
day day of data in this row
hour hour of data in this row
season season of data in this row (1-4)
PM PM2.5 concentration (ug/m
3
)
DEWP Dew Point (Celsius Degree)
TEMP Temperature (Celsius Degree)
HUMI Humidity (%)
PRES Pressure (hPa)
lws Cumulated wind speed (m/s)
precipitation hourly precipitation (mm)
lprec Cumulated precipitation (mm)
Data for each city is provided as a separate le and the format is consistent across each of them.
Load each dataset individually and spend sometime familiarizing yourself with the data.
1.1 Answer the following questions
1 Which of the 5 cities has the largest temperature range (i.e., highest temperature – lowest
temperature) during the period the dataset was collected?
2 Which cities would you consider to be the most and least polluted? Explain the logic for
your answer in detail.
3 What is the average temperature for each of the four seasons in each of the cities?
4 Where are the hottest summers and coldest winters observed? Explain how you de ned
hottest and coldest.
5 Which feature appears to contain the largest amount of missing data overall?
2
1.2 Visualizations
For some of the visualizations you’ll create, it may be useful to reset your dataframe’s index
as a combined date-time index using the pandas built-in datetime type. You can use the
following snippet to do that:
import pandas as pd
# …
# Make sure to load your data first and repeat the process for each dataset you’re using
df.set_index(pd.to_datetime(df[[“year”,”month”,”day”,”hour”]]), inplace=True)
1 Using your library of choice, generate a line chart showing the temperate (y-axis) and
dates (x-axis) for one of the ve cities. Is there a noticeable seasonal pattern?
2 Create a boxplot showing the temperature values aggregated by month for one of the ve
cities.
3 Create a scatter plot using two features of your choice. Choose a pair of features that you
believe have some correlation between them. Based on your visualization, do they seem to
be correlated?
4 Create a single plot that illustrates the value of the PM column over time for each of the
four cities. Color and label each city di erently so that they can be distinguished easily.
3
1.3 Regression
To make sure you evaluate your model fairly, split your dataset into train/test as the example
below:
from sklearn.model_selection import train_test_split
# Drop rows that have missing values
df = df.dropna()
# Treat the column PM as our predictive objective
y = df[“PM”]
# All other columns will be used as features when training our model
X = df.drop([“PM”], axis=1)
# Split 70% of the data for training and leave out 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
1 Train and evaluate (using r-squared) a linear regression model to predict the PM value.
Repeat this for each city and be sure to evaluate your model using the X test set.
2 Use the dataset for one city of your choice for training and evaluate your linear regression
on the dataset of another city. How do your results compare to when you used data from the
same city for both training and testing?
3 [Using a city of your choice, train and evaluate a linear regression to predict the values of
one of the other columns (not PM). Is there a particular column that seems to be easier to
predict than others? Why do you think that is?
4
1.4
If you had to choose one of these cities as your next home, based on the data you analyzed
from these datasets, which one would you choose?
Post your answer along with a detailed explanation of how you made that choice and any
supporting visualizations to #assignment5-bonus on Slack.
5