Simple Linear Regression

In this notebook I use data on house sales in King County to predict house prices using simple (one input) linear regression.

  • I will use pandas descripttive to compute summary statistics
  • I will compute the Simple Linear Regression weights using the closed form solution
  • I will make predictions of the output given the input feature
  • I will also try turning the regression around to predict the input given the output
  • Finally, I will compare two different models for predicting house prices

Import library

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(color_codes=True)

Load house sales data

Dataset is from house sales in King County, the region where the city of Seattle, WA is located.

The following section is to load the entire dataset in and made a dictionary of column names and data type as this was used latter when importing training datset and test dataset

data = pd.read_csv("kc_house_data.csv")
colname_lst = list(data.columns.values)
coltype_lst =  [str, str, float, float, float, float, int, str, int, int, int, int, int, int, int, int, str, float, float, float, float]
col_type_dict = dict(zip(colname_lst, coltype_lst))

Split data into training data and test data

From the entire dataset above, we spit data into training and test set using numpy.

idx = np.random.rand(len(data)) < 0.8
train = data[idx]
test = data[~idx]
#inspect first five elements in the training set
train.head(5)
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900 3 1.00 1180 5650 1 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
2 5631500400 20150225T000000 180000 2 1.00 770 10000 1 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000 4 3.00 1960 5000 1 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
5 7237550310 20140512T000000 1225000 4 4.50 5420 101930 1 0 0 ... 11 3890 1530 2001 0 98053 47.6561 -122.005 4760 101930
6 1321400060 20140627T000000 257500 3 2.25 1715 6819 2 0 0 ... 7 1715 0 1995 0 98003 47.3097 -122.327 2238 6819

5 rows × 21 columns

#inspect first five elements in the test set
test.head(5)
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
1 6414100192 20141209T000000 538000 3 2.25 2570 7242 2 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
4 1954400510 20150218T000000 510000 3 2.00 1680 8080 1 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503
10 1736800520 20150403T000000 662500 3 2.50 3560 9796 1 0 0 ... 8 1860 1700 1965 0 98007 47.6007 -122.145 2210 8925
18 16000397 20141205T000000 189000 2 1.00 1200 9850 1 0 0 ... 7 1200 0 1921 0 98002 47.3089 -122.210 1060 5095
23 8091400200 20140516T000000 252700 2 1.50 1070 9643 1 0 0 ... 7 1070 0 1985 0 98030 47.3533 -122.166 1220 8386

5 rows × 21 columns

Now let’s look at descriptive statistics for this data set

train.describe()
id price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
count 1.726500e+04 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000 17265.000000
mean 4.571240e+09 539130.778859 3.373183 2.114292 2081.189458 15239.504663 1.495743 0.007356 0.235042 3.409962 7.658384 1788.401390 292.788068 1971.046858 82.659485 98077.574109 47.559089 -122.213586 1989.928294 12918.148856
std 2.875593e+09 363000.241270 0.937721 0.767328 916.391531 41507.204789 0.541861 0.085453 0.767452 0.650456 1.172731 827.495956 444.716563 29.354455 397.708493 53.492064 0.138561 0.140198 688.513788 28142.809544
min 1.000102e+06 75000.000000 0.000000 0.000000 290.000000 572.000000 1.000000 0.000000 0.000000 1.000000 1.000000 290.000000 0.000000 1900.000000 0.000000 98001.000000 47.155900 -122.519000 460.000000 651.000000
25% 2.114700e+09 323000.000000 3.000000 1.750000 1430.000000 5065.000000 1.000000 0.000000 0.000000 3.000000 7.000000 1190.000000 0.000000 1952.000000 0.000000 98033.000000 47.469000 -122.328000 1490.000000 5100.000000
50% 3.904100e+09 450000.000000 3.000000 2.250000 1920.000000 7620.000000 1.500000 0.000000 0.000000 3.000000 7.000000 1560.000000 0.000000 1975.000000 0.000000 98065.000000 47.571000 -122.229000 1850.000000 7621.000000
75% 7.305300e+09 641250.000000 4.000000 2.500000 2550.000000 10766.000000 2.000000 0.000000 0.000000 4.000000 8.000000 2210.000000 560.000000 1997.000000 0.000000 98117.000000 47.677300 -122.124000 2370.000000 10123.000000
max 9.900000e+09 7700000.000000 33.000000 8.000000 13540.000000 1651359.000000 3.500000 1.000000 4.000000 5.000000 13.000000 9410.000000 4820.000000 2015.000000 2015.000000 98199.000000 47.777600 -121.315000 6210.000000 871200.000000

Data visualisation

In addition to descriptive statistics, data visualisation also helps to understand the gist of data. Therefore we will spend a bit of time looking at some of the variables, we interested.

sns.distplot(train["price"]);

png

sns.distplot(train["bedrooms"])
<matplotlib.axes._subplots.AxesSubplot at 0x9f1f2b0>

png

sns.distplot(train["bathrooms"])
<matplotlib.axes._subplots.AxesSubplot at 0x17945710>

png

sns.distplot(train["sqft_living"])
<matplotlib.axes._subplots.AxesSubplot at 0x1a5e47f0>

png

Build a generic simple linear regression function

This section we will create generic simple linear regression function. This function will take two parameters.

  • one parameter is for input variable
  • the other one is for output variable

aim: we want to find the line that give the least residual sum of square (RSS).

take derivation of the above function gives:

Then, we set gradient to 0.
Top term will be:

Bottom term will be:

which these will be implemented below.

def simple_linear_regression(input_feature, output):
    # compute the sum of input_feature and output
    input_sum = input_feature.sum()
    output_sum = output.sum()
    # compute the product of the output and the input_feature and its sum
    product_of_input_output = (input_feature*output).sum()
    # compute the squared value of the input_feature and its sum
    sum_of_sqrt_input = (input_feature*input_feature).sum()
    # use the formula for the slope or the top term
    slope = (product_of_input_output-(input_sum*output_sum/output.size))/(sum_of_sqrt_input-input_sum*input_sum/output.size)
    # use the formula for the intercept or the bottom term
    intercept = output_sum/output.size - slope*input_sum/output.size
    return (intercept, slope)

We can test that our function works by passing it something where we know the answer. In particular we can generate a feature and then put the output exactly on a line: output = 1 + 1*input_feature then we know both our slope and intercept should be 1

test_feature = pd.DataFrame(range(5))
test_output = pd.DataFrame(1+1*test_feature)
(test_intercept, test_slope) =  simple_linear_regression(test_feature, test_output)
print "Intercept: " + str(test_intercept[0])
print "Slope: " + str(test_slope[0])
Intercept: 1.0
Slope: 1.0

Now that we know it works. Let’s build a regression model for predicting price based on sqft_living. Rembember that we train on train_data!

sqft_intercept, sqft_slope = simple_linear_regression(train['sqft_living'], train['price'])

print "Intercept: " + str(sqft_intercept)
print "Slope: " + str(sqft_slope)
Intercept: -39378.4243047
Slope: 277.970465792

Predicting Values

Now that we have the model parameters: intercept & slope we can make predictions. The following function will return the predicted output given the input_feature, slope and intercept:

def get_regression_predictions(input_feature, intercept, slope):
    # calculate the predicted values:
    predicted_values = intercept + slope*input_feature
    return predicted_values

Now that we can calculate a prediction given the slope and intercept let’s make a prediction. Now we will predict the estimated price for a house with 2650 squarefeet according to the squarefeet model we estiamted above.

my_house_sqft = 2650
estimated_price = get_regression_predictions(my_house_sqft, sqft_intercept, sqft_slope)
print "The estimated price for a house with %d squarefeet is $%.2f" % (my_house_sqft, estimated_price)
The estimated price for a house with 2650 squarefeet is $697243.31

Residual Sum of Squares

Now that we have a model and can make predictions let’s evaluate our model using Residual Sum of Squares (RSS). Recall that RSS is the sum of the squares of the residuals and the residuals is just a fancy word for the difference between the predicted output and the true output.

The following function to compute the RSS of a simple linear regression model given the input_feature, output, intercept and slope:

def get_residual_sum_of_squares(input_feature, output, intercept, slope):
    # First get the predictions
    predicted_values_series = intercept + slope*input_feature
    # then compute the residuals (since we are squaring it doesn't matter which order you subtract)
    residual_series = output - predicted_values_series
    # square the residuals and add them up
    residual_square_series = residual_series * residual_series
    RSS = residual_square_series.sum()
    return(RSS)

Let’s test our get_residual_sum_of_squares function by applying it to the test model where the data lie exactly on a line. Since they lie exactly on a line the residual sum of squares should be zero!

print get_residual_sum_of_squares(test_feature, test_output, test_intercept, test_slope)
0    0
dtype: float64

Now use your function to calculate the RSS on training data from the squarefeet model calculated above.

rss_prices_on_sqft = get_residual_sum_of_squares(train['sqft_living'], train['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)
The RSS of predicting Prices based on Square Feet is : 1.15464936338e+15

Predict the squarefeet given price

What if we want to predict the squarefoot given the price? Since we have an equation y = a + b*x we can solve the function for x. So that if we have the intercept (a) and the slope (b) and the price (y) we can solve for the estimated squarefeet (x).

The following function will predict the input_feature given the output!

def inverse_regression_predictions(output, intercept, slope):
    # solve output = intercept + slope*input_feature for input_feature. Use this equation to compute the inverse predictions:
    estimated_feature = (output-intercept)/slope
    return estimated_feature

Now that we have a function to compute the squarefeet given the price from our simple regression model let’s see how big we might expect a house that costs $800,000 to be.

my_house_price = 800000
estimated_squarefeet = inverse_regression_predictions(my_house_price, sqft_intercept, sqft_slope)
print "The estimated squarefeet for a house worth $%.2f is %d" % (my_house_price, estimated_squarefeet)
The estimated squarefeet for a house worth $800000.00 is 3019

New Model: estimate prices from bedrooms

We have made one model for predicting house prices using squarefeet, but there are many other features in the sales dataframe. Now we will make a simple linear regression function to estimate the regression parameters from predicting Prices based on number of bedrooms.

# Estimate the slope and intercept for predicting 'price' based on 'bedrooms'
(bedrooms_intercept, bedrooms_slope) =  simple_linear_regression(train['bedrooms'], train['price'])
(bedrooms_intercept, bedrooms_slope)
(135689.62284521962, 119602.51997969164)

Test our Linear Regression Algorithm

Now we have two models for predicting the price of a house. How do we know which one is better? Calculate the RSS on the TEST data (remember this data wasn’t involved in learning the model). Let’s compute the RSS from predicting prices using bedrooms and from predicting prices using squarefeet.

# Compute RSS when using squarefeet on TEST data:
rss_prices_on_sqft = get_residual_sum_of_squares(test['sqft_living'], test['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Square Feet is : ' + str(rss_prices_on_sqft)
The RSS of predicting Prices based on Square Feet is : 3.22792745249e+14
# Compute RSS when using bedrooms on TEST data:
rss_prices_on_sqft = get_residual_sum_of_squares(test['bedrooms'], test['price'], sqft_intercept, sqft_slope)
print 'The RSS of predicting Prices based on Bedrooms is : ' + str(rss_prices_on_sqft)
The RSS of predicting Prices based on Bedrooms is : 2.11218065967e+15

*last edited:23/10/16 *

Go to top