How the Handle Missing Data with Imputer in Python

with No Comments

 

Some of the problem that you will encounter while practicing data science is to the case where you have to deal with missing data. In real life, missing data happens quite a lot. This trick demonstrated in the below section is going to help you handle blank data so that your machine learning program can run correctly.

 

  • Say we have a list of data with some missing data. The easiest idea is to remove rows with missing data, but this is very dangerous because you are removing observations.
  • A better solution which handle the missing data is to replace the missing records with the MEAN value of that specific column.
  • The library that we going to use here is scikit-learn, and the function name is Imputer.
  • We are going to replace ALL NaN values (missing data) in one go.
  • This function also allows users to replace empty records with Median or the Most Frequent data in the dataset.
  • In another word, we can now predict the missing value using scikit-learn’s Imputer module.

Visualize Example of Imputer from sklearn:

Original Data

[‘b’, 2]

[‘c’, 3]

[‘a’, 8]

[‘b’, 4]

[‘a’, NaN]

[‘a’, 4]

[‘b’, 8]

[‘c’, 8]

[‘d’, 3]

[‘a’, 3]

[‘c’, 6]

New Data

[‘b’, 2]

[‘c’, 3]

[‘a’, 8]

[‘b’, 4]

[‘a’, 4.9]

[‘a’, 4]

[‘b’, 8]

[‘c’, 8]

[‘d’, 3]

[‘a’, 3]

[‘c’, 6]

 

The following codes will calculate the mean of the column, and it will replace the missing value with the mean.

import pandas as pd
dataset=pd.read_csv(“Data_imputer.csv”)
x=dataset.iloc[:,1:2].values

from sklearn.preprocessing import Imputer
Imputer = Imputer(missing_values =‘NaN’, strategy=‘mean’, axis=0)
Imputer = Imputer.fit(x[:,0:1])
x[:,0:1]=Imputer.transform(x[:,0:1])

 

Step 1. ) Import Imputer from sklearn.processing

from sklearn.preprocessing import Imputer

 

Step 2.) Calling the function Imputer

This function has some parameters that we have to specify.

Example 1: take the Mean of Column

Imputer = Imputer(missing_values = ‘NaN’, strategy = ‘mean’, axis =0)

Example 2: take the Median of row

Imputer = Imputer(missing_values = ‘NaN’, strategy = ‘mean’, axis =1)

Example 3: take the Most Frequent value of column

Imputer = Imputer(missing_values = ‘NaN’, strategy = ‘most_frequent’, axis =0)

missing_values:

Define your missing value. In python, if you double click and visualize your data, then you will see the blank data = ‘NaN. So in this case, we are going to set missing_values=’NaN’

strategy:

You can replace the missing data with  the following values
1.) Mean
2.) Median
3.) Most_frequent

axis:

1.) axis =0  This means that the computer will take the mean per column

2.) axis =1  This means that the computer will take the mean per row

Step 3.) Fitting Imputer to Data

Apply imputer to your data.

If your missing data is in column 1, then you would like to fit the calculated mean into NaN row within column 1

Imputer= Imputer.fit(dataset[: , 1:2])

Step 4.) Transform Data

Replace your blank observations with the calcuated value. In this case, it is going to transform NaN to the mean value.

dataset[:,1:2] = Imputer.transform(dataset[: , 1:2])

Other Sections on Data Handling in Python

1.) How to Import Libraries 

2.) How to Know and Change the Working Directory 

3.) How to Import CSV Data using Pandas

4.) How to Set Dependent Variables and Independent Variables using iloc

5.) How to Handle Missing data with Imputer

6.) How to Set Categorical Data (Dummy Variable) using LabelEncoder and OneHotEncoder

7.) How to Split Data into Training Set and Testing Set

8.) How to Apply Feature Scaling

Leave a Reply