Feature Selection and LASSO

In this notebook, we will use LASSO to select features, building on a pre-implemented solver for LASSO.

** outline for this notebook**

  • We will run LASSO with different L1 penalties.
  • We will choose best L1 penalty using a validation set.
  • We will choose best L1 penalty using a validation set, with additional constraint on the size of subset.

In the second notebook, we will implement our own LASSO solver, using coordinate descent.

import library

In this section, we will import library for later use.

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from math import log, sqrt
import seaborn as sns
sns.set(color_codes=True)
import matplotlib.pyplot as plt
from sklearn import linear_model

read data in

Dataset used in this notebook is from house sales in King County, the region where the city of Seattle, WA is located.

data = pd.read_csv("kc_house_data.csv")
colname_lst = list(data.columns.values)
coltype_lst =  [str, str, float, float, float, float, int, str, int, int, int, int, int, int, int, int, str, float, float, float, float]
col_type_dict = dict(zip(colname_lst, coltype_lst))
data.head()
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900 3 1.00 1180 5650 1 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000 3 2.25 2570 7242 2 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000 2 1.00 770 10000 1 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000 4 3.00 1960 5000 1 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000 3 2.00 1680 8080 1 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

create new features

In this notebook, we consider to transform some input features.

data['sqft_living_sqrt'] = data['sqft_living'].apply(sqrt)
data['sqft_lot_sqrt'] = data['sqft_lot'].apply(sqrt)
data['bedrooms_square'] = data['bedrooms']*data['bedrooms']

# In the dataset, 'floors' was defined with type string,
# so we'll convert them to float, before creating a new feature.
data['floors'] = data['floors'].astype(float)
data['floors_square'] = data['floors']*data['floors']
  • Squaring bedrooms will increase the separation between not many bedrooms (e.g. 1) and lots of bedrooms (e.g. 4) since 1^2 = 1 but 4^2 = 16. Consequently this variable will mostly affect houses with many bedrooms.
  • On the other hand, taking square root of sqft_living will decrease the separation between big house and small house. The owner may not be exactly twice as happy for getting a house that is twice as big.

learn regression weights with L1 penalty

Let’ss fit a model with all the features available in addition to the features we just created above.

all_features = ['bedrooms', 'bedrooms_square','bathrooms', 'sqft_living', 'sqft_living_sqrt',
            'sqft_lot', 'sqft_lot_sqrt', 'floors', 'floors_square', 'waterfront', 'view',
            'condition', 'grade','sqft_above','sqft_basement','yr_built', 'yr_renovated']

Applying L1 penalty requires linear_model.Lasso() from linear_model. In this section we will use l1 penalty = 1e10 to the linear regression.

regfit = linear_model.Lasso(alpha=1e5)
regfit.fit(data[all_features], data["price"])
Lasso(alpha=100000.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

Find what features had non-zero weight.

pd.DataFrame(np.append(regfit.intercept_,regfit.coef_), columns=["coefficient"])
coefficient
0 4622788.430771
1 -0.000000
2 -1590.124357
3 0.000000
4 291.126810
5 -0.000000
6 1.067444
7 -950.345152
8 0.000000
9 0.000000
10 0.000000
11 0.000000
12 0.000000
13 0.000000
14 38.764596
15 -0.000000
16 -2364.943534
17 39.865638

Note that a majority of the weights have been set to zero. So by setting an L1 penalty that’s large enough, we are performing a subset selection.

selecting an L1 penalty

To find a good L1 penalty, we will explore multiple values using a validation set. Let us do three way split into train, validation, and test sets:

  • Split our data into 2 sets: training and test
  • Further split our training data into two sets: train, validation
idx = np.random.rand(len(data))<0.8
train = data[idx]; test= data[~idx]
idx = np.random.rand(len(train))<0.8
valid = train[~idx]; train= train[idx]

Next, we will write a loop that does the following:

  • For l1_penalty in [10^1, 10^1.5, 10^2, 10^2.5, …, 10^7] (to get this in Python, type np.logspace(1, 7, num=13).)
    • Fit a regression model with a given l1_penalty on TRAIN data. Specify l1_penalty=l1_penalty and l2_penalty=0. in the parameter list.
    • Compute the RSS on VALIDATION data (here you will want to use .predict()) for that l1_penalty
  • Report which l1_penalty produced the lowest RSS on validation data.
l1_penality = np.logspace(1, 7, num=13)
print l1_penality
[  1.00000000e+01   3.16227766e+01   1.00000000e+02   3.16227766e+02
   1.00000000e+03   3.16227766e+03   1.00000000e+04   3.16227766e+04
   1.00000000e+05   3.16227766e+05   1.00000000e+06   3.16227766e+06
   1.00000000e+07]
rss_validation =[]
rss_min = float('inf')
for penalty in np.logspace(1, 7, num=13):
    lassofit = linear_model.Lasso(alpha=1e5)
    lassofit.fit(train[all_features], train["price"])
    predicted = lassofit.predict(valid[all_features])
    residules = predicted - valid['price']
    rss = (residules * residules).sum()
    rss_validation.append(rss)
    if rss < rss_min:
        # re-assign new min
        rss_max = rss
        # kepp the best model found so far
        model_with_best_rss = lassofit
#print rss_validation
print 'best rss for validation set', rss_max
best rss for validation set 2.04059895908e+14

Now we will compute RSS from our test set.

residule_test = model_with_best_rss.predict(test[all_features]) - test['price']
print 'RSS from our best model with test data',(residule_test * residule_test).sum()
RSS from our best model with test data 2.82389002842e+14

Now, we will take a look at the coefficients of our model.

pd.DataFrame(np.append(model_with_best_rss.intercept_,model_with_best_rss.coef_), columns=["coefficient"])
coefficient
0 4579313.014629
1 -0.000000
2 -1174.696168
3 0.000000
4 289.501457
5 -0.000000
6 1.048240
7 -969.058142
8 0.000000
9 0.000000
10 0.000000
11 0.000000
12 0.000000
13 0.000000
14 43.450123
15 -0.000000
16 -2345.830952
17 38.317930

limit the number of nonzero weights

What if we absolutely wanted to limit ourselves to, say, 7 features? This may be important if we want to derive “a rule of thumb” — an interpretable model that has only a few features in them.

In this section, you are going to implement a simple, two phase procedure to achive this goal:

  1. Explore a large range of l1_penalty values to find a narrow region of l1_penalty values where models are likely to have the desired number of non-zero weights.
  2. Further explore the narrow region you found to find a good value for l1_penalty that achieves the desired sparsity. Here, we will again use a validation set to choose the best value for l1_penalty.
max_nonzeros = 7

exploring the larger range of values to find a narrow range with the desired sparsity

Let’s define a wide range of possible l1_penalty_values:

l1_penalty_values = np.logspace(2, 10, num=20)
l1_penalty_values
array([  1.00000000e+02,   2.63665090e+02,   6.95192796e+02,
         1.83298071e+03,   4.83293024e+03,   1.27427499e+04,
         3.35981829e+04,   8.85866790e+04,   2.33572147e+05,
         6.15848211e+05,   1.62377674e+06,   4.28133240e+06,
         1.12883789e+07,   2.97635144e+07,   7.84759970e+07,
         2.06913808e+08,   5.45559478e+08,   1.43844989e+09,
         3.79269019e+09,   1.00000000e+10])

Now, we will implement a loop that search through this space of possible l1_penalty values:

  • For l1_penalty in np.logspace(2, 10, num=20):
    • Fit a regression model with a given l1_penalty on TRAIN data.
    • Extract the weights of the model and count the number of nonzeros. Save the number of nonzeros to a list.
non_zero_l1 = []
for l1 in l1_penalty_values:
    lassofit = linear_model.Lasso(alpha=l1)
    lassofit.fit(train[all_features], train["price"])
    non_zero_l1.append(np.count_nonzero(np.append(lassofit.intercept_, lassofit.coef_)))

print non_zero_l1
[18, 18, 17, 17, 16, 13, 12, 9, 7, 7, 5, 4, 4, 3, 3, 3, 2, 2, 1, 1]

Out of this large range, we want to find the two ends of our desired narrow range of l1_penalty. At one end, we will have l1_penalty values that have too few non-zeros, and at the other end, we will have an l1_penalty that has too many non-zeros.

More formally, we wnt to find:

  • The largest l1_penalty that has more non-zeros than max_nonzero and we will store this value in the variable l1_penalty_min
  • The smallest l1_penalty that has fewer non-zeros than max_nonzeroand we will store this value in the variable l1_penalty_max
i = 0
while (non_zero_l1[i] > max_nonzeros):
    i += 1
l1_penalty_min = l1_penalty_values[i - 1]
print 'largest l1 penalty %s with non-zero params more than max-non-zero index %s' % (l1_penalty_min, i-1)
l1_penalty_max = l1_penalty_values[i]
print 'smallest l1 penalty %s with non-zero params less than max-non-zero index %s' % (l1_penalty_max, i)
largest l1 penalty 88586.679041 with non-zero params more than max-non-zero index 7
smallest l1 penalty 233572.146909 with non-zero params less than max-non-zero index 8

exploring the narrow range of values to find the solution with the right number of non-zeros that has lowest RSS on the validation set

We will now explore the narrow region of l1_penalty values we found:

l1_penalty_values = np.linspace(l1_penalty_min,l1_penalty_max,20)
print l1_penalty_values
[  88586.67904101   96217.49313932  103848.30723764  111479.12133596
  119109.93543427  126740.74953259  134371.5636309   142002.37772922
  149633.19182754  157264.00592585  164894.82002417  172525.63412248
  180156.4482208   187787.26231912  195418.07641743  203048.89051575
  210679.70461406  218310.51871238  225941.3328107   233572.14690901]
  • For l1_penalty in np.linspace(l1_penalty_min,l1_penalty_max,20):
    • Fit a regression model with a given l1_penalty on TRAIN data. Specify l1_penalty=l1_penalty and l2_penalty=0. in the parameter list. When you call linear_regression.create() make sure you set validation_set = None
    • Measure the RSS of the learned model on the VALIDATION set

Now let’s find the model that the lowest RSS on the VALIDATION set and has sparsity equal to max_nonzero.

rss_validation2 =[]
rss_max2 = float('inf')

for l1 in l1_penalty_values:
    lassofit = linear_model.Lasso(alpha=l1)
    lassofit.fit(train[all_features], train["price"])
    if (np.count_nonzero(np.append(lassofit.intercept_, lassofit.coef_)) == max_nonzeros):
        predicted = lassofit.predict(valid[all_features])
        residules = predicted - valid['price']
        rss = (residules * residules).sum()
        rss_validation2.append(rss)
        if rss < rss_max2:
            l1_penalty_with_lowest_rss_max_nonzero = penalty
            rss_max2 = rss
            model_with_best_rss2 = lassofit
print rss_validation2
print 'best rss for validation set in narrow range of l1 penalty', rss_max2
print 'l1_penalty with lowerst rss on validation and max non zero', l1_penalty_with_lowest_rss_max_nonzero
pd.DataFrame(np.append(model_with_best_rss2.intercept_, model_with_best_rss2.coef_), columns=["coefficient"])
[205602567450156.88]
best rss for validation set in narrow range of l1 penalty 2.0560256745e+14
l1_penalty with lowerst rss on validation and max non zero 10000000.0
coefficient
0 4092923.302669
1 -0.000000
2 -0.000000
3 0.000000
4 283.712672
5 -0.000000
6 0.813702
7 -790.985527
8 0.000000
9 0.000000
10 0.000000
11 0.000000
12 0.000000
13 0.000000
14 38.996378
15 -0.000000
16 -2103.695347
17 41.471200

last edited: 05/11/2016

Go to top