Introduction to Regression

import library

%matplotlib inline
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
plt.rcParams["figure.figsize"] = [12,9]
:0: FutureWarning: IPython widgets are experimental and may change in the future.

Load some house value vs. crime rate data

Dataset is from Philadelphia, PA and includes average house sales price in a number of neighborhoods. The attributes of each neighborhood we have include the crime rate (‘CrimeRate’), miles from Center City (‘MilesPhila’), town name (‘Name’), and county name (‘County’).

sales = pd.read_csv('Philadelphia_Crime_Rate_noNA.csv')
sales.head(10)
HousePrice HsPrc ($10,000) CrimeRate MilesPhila PopChg Name County
0 140463 14.0463 29.7 10 -1.0 Abington Montgome
1 113033 11.3033 24.1 18 4.0 Ambler Montgome
2 124186 12.4186 19.5 25 8.0 Aston Delaware
3 110490 11.0490 49.4 25 2.7 Bensalem Bucks
4 79124 7.9124 54.1 19 3.9 Bristol B. Bucks
5 92634 9.2634 48.6 20 0.6 Bristol T. Bucks
6 89246 8.9246 30.8 15 -2.6 Brookhaven Delaware
7 195145 19.5145 10.8 20 -3.5 Bryn Athyn Montgome
8 297342 29.7342 20.2 14 0.6 Bryn Mawr Montgome
9 264298 26.4298 20.4 26 6.0 Buckingham Bucks

Exploring the data

The house price in a town is correlated with the crime rate of that town. Low crime towns tend to be associated with higher house prices and vice versa.

sns.lmplot('CrimeRate', 'HousePrice', data=sales, fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0xa3af3c8>

png

Fit the regression model using crime as the feature

sales_X_crimerate  = sales['CrimeRate'].reshape(sales['CrimeRate'].shape[0],1)
sales_y_houseprice = sales['HousePrice'].reshape(sales['HousePrice'].shape[0],1)
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the datset
regr.fit(sales_X_crimerate, sales_y_houseprice)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Let’s see what our fit looks like

sns.lmplot('CrimeRate', 'HousePrice', data=sales)
<seaborn.axisgrid.FacetGrid at 0x187795f8>

png

Above: dots are original data, blue line is the fit from the simple regression.

Remove Center City and redo the analysis

Center City is the one observation with an extremely high crime rate, yet house prices are not very low. This point does not follow the trend of the rest of the data very well. A question is how much including Center City is influencing our fit on the other datapoints. Let’s remove this datapoint and see what happens.

sales_noCC = sales[sales['MilesPhila'] != 0.0]
sns.lmplot('CrimeRate', 'HousePrice', data=sales_noCC, fit_reg=False)
<seaborn.axisgrid.FacetGrid at 0x18c1ce48>

png

Refit our simple regression model on this modified dataset:

sales_noCC_X_crimerate  = sales_noCC['CrimeRate'].reshape(sales_noCC['CrimeRate'].shape[0],1)
sales_noCC_y_houseprice = sales_noCC['HousePrice'].reshape(sales_noCC['HousePrice'].shape[0],1)


# Create linear regression object
regr_noCC = linear_model.LinearRegression()

# Train the model using the training sets
regr_noCC.fit(sales_noCC_X_crimerate, sales_noCC_y_houseprice)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Look at the fit:

sns.lmplot('CrimeRate', 'HousePrice', data=sales_noCC)
<seaborn.axisgrid.FacetGrid at 0x18c27358>

png

Compare coefficients for full-data fit versus no-Center-City fit

Visually, the fit seems different, but let’s quantify this by examining the estimated coefficients of our original fit and that of the modified dataset with Center City removed.

params_dict = {'interecpt':regr.intercept_[0],'CrimeRate':regr.coef_[0][0]}
pd.DataFrame(params_dict.items(), columns=['name','value'])
name value
0 interecpt 176629.408107
1 CrimeRate -576.908128
params_noCC_dict = {'interecpt':regr_noCC.intercept_[0],'CrimeRate':regr_noCC.coef_[0][0]}
pd.DataFrame(params_noCC_dict.items(), columns=['name','value'])
name value
0 interecpt 225233.551839
1 CrimeRate -2288.689430

Above: We see that for the “no Center City” version, per unit increase in crime, the predicted decrease in house prices is 2,287. In contrast, for the original dataset, the drop is only 576 per unit increase in crime. This is significantly different!

High leverage points:

Center City is said to be a “high leverage” point because it is at an extreme x value where there are not other observations. As a result, recalling the closed-form solution for simple regression, this point has the potential to dramatically change the least squares line since the center of x mass is heavily influenced by this one point and the least squares line will try to fit close to that outlying (in x) point. If a high leverage point follows the trend of the other data, this might not have much effect. On the other hand, if this point somehow differs, it can be strongly influential in the resulting fit.

Influential observations:

An influential observation is one where the removal of the point significantly changes the fit. As discussed above, high leverage points are good candidates for being influential observations, but need not be. Other observations that are not leverage points can also be influential observations (e.g., strongly outlying in y even if x is a typical value).

Remove high-value outlier neighborhoods and redo analysis

Based on the discussion above, a question is whether the outlying high-value towns are strongly influencing the fit. Let’s remove them and see what happens.

#remove outliying-value towns
sales_nohighend = sales_noCC[sales_noCC['HousePrice'] < 350000]

sales_noCCnohighend_X_crimerate  = sales_nohighend['CrimeRate'].reshape(sales_nohighend['CrimeRate'].shape[0],1)
sales_noCCnohighend_y_houseprice = sales_nohighend['HousePrice'].reshape(sales_nohighend['HousePrice'].shape[0],1)

# Create linear regression object
regr_noCCnohighend = linear_model.LinearRegression()

# Train the model using the training sets
regr_noCCnohighend.fit(sales_noCCnohighend_X_crimerate, sales_noCCnohighend_y_houseprice)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Do the coefficients change much?

params_noCC_dict = {'interecpt':regr_noCC.intercept_[0],'CrimeRate':regr_noCC.coef_[0][0]}
pd.DataFrame(params_noCC_dict.items(), columns=['name','value'])
name value
0 interecpt 225233.551839
1 CrimeRate -2288.689430
params_noCCnohighend_dict = {'interecpt':regr_noCCnohighend.intercept_[0],'CrimeRate':regr_noCCnohighend.coef_[0][0]}
pd.DataFrame(params_noCCnohighend_dict.items(), columns=['name','value'])
name value
0 interecpt 199098.852670
1 CrimeRate -1838.562649

Above: We see that removing the outlying high-value neighborhoods has some effect on the fit, but not nearly as much as our high-leverage Center City datapoint.

last edit 26/10/2016

Go to top