Multivariate Linear Regression in Python – Step 6.) Backward Elimination

with No Comments

The current dataset does not yield the optimal model.

This Multivariate Linear Regression Model takes all of the independent variables into consideration.

In reality, not all of the variables observed are highly statistically important.

That means, some of the variables make greater impact to the dependent variable Y, while some of the variables are not statistically important at all.

For that reason, backward elimination will be employed to remove less important variables or data.

What is Backward Elimination

  • The idea of Backward Elimination is to remove independent variables that are not statistically significant.
  • If your dataset is huge, this could make a great difference, because your model can run with less data.
  • Our goal here is to find a group of independent variables that all big impact to the dependent variable.

Mechanism of Backward Elimination

  1. Select a significant level (ie: 0.05 ; If the P value is greater than this significant level, then we will remove it)
  2. First fit ALL variables to the model.
  3. Find the P values for ALL variables.
  4. Remove the variable with the largest P value.
  5. Fit the model with a variable removed from Step 4.
  6. Repeat Step 4 & 5 , until all P values are smaller than the significant level defined in Step 1.
  7. Model is ready.

How to do Backward Elimination in Python

  • Recall that the formula to Linear Regression is :  y= b + x(n)…x(n)
    • b is the intercept.
    • We do not have b here in this model.
    • To create a fake intercept, we concat our model to a column that fills with 1.
    • X=np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)
    • np stands for numpy, which is a library that we have imported at the beginning.
  • We are going to use statsmodels.formula.api. Hence we need to import it as sm.
  • xelimination is created. (ie: xelimination = X[:,[0,1,2,3,4,5]] ) The first “:” means that it is going to include all of the rows, and then the barack after the commas indicate the column index that it includes. This way, we can remove a column ( or remove an independent variable easily)
    • For example, if we want to remove the second column ( column index =1) now, then we are going to write xelimination = X[:,[0, 2,3,4,5]].
  • regressorOLS = sm.OLS(y, xelimination).fit() ; is going to fit the Multivariate Linear Regression. Beware to indicate the y variable and the x variable correctly.
  • regressorOLS.summary() ; this is going to show you the p value in the regression !
    • if regressorOLS.summary() does not work, then try print(regressorOLS.summary()).
  • OLS here stands for “Ordinary Least Square”.
  • The example below shows you how to do Backward Elimination in Python.

 

Example of Backward Elimination in Python

#Import libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#Import data
dataset = pd.read_csv(‘multivariate_data.csv’)
x = dataset.iloc[:,:-1].values
y =dataset.iloc[:,4].values

#Encode Categorical Data using LabelEncoder and OneHotEncoder
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_x=LabelEncoder()
x[:,3]=labelencoder_x.fit_transform(x[:,3])
onehotencoder=OneHotEncoder(categorical_features =[3])
x=onehotencoder.fit_transform(x).toarray()

#Remove Dummy Variable Trap
x=x[:, 1:]

#splitting training set and testing set
from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest =train_test_split(x,y,test_size=0.2)

# Training the Multivariate Linear Regression Model
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(xtrain, ytrain)

# Predicting the Test set results
y_prediction= regressor.predict(xtest)

# Backward Eliminiation

# Insert B Intercept
X=np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)

# Call Ordinary Least Square
import statsmodels.formula.api as sm
xelimination = X[:,[0,1,2,3,4,5]]
regressorOLS = sm.OLS(y, xelimination).fit()
regressorOLS.summary()
xelimination = X[:,[0,1,3,4,5]]
regressorOLS = sm.OLS(y, xelimination).fit()
regressorOLS.summary()
xelimination = X[:,[0,3,4,5]]
regressorOLS = sm.OLS(y, xelimination).fit()
regressorOLS.summary()
xelimination = X[:,[0,3,5]]
regressorOLS = sm.OLS(y, xelimination).fit()
regressorOLS.summary()
xelimination = X[:,[0,3]]
regressorOLS = sm.OLS(y, xelimination).fit()
regressorOLS.summary()

  • If all of the independent variables have large P values, then you should try model that are not linear.
    • Try Kernal SVM  

Leave a Reply