Some of the problem that you will encounter while practicing data science is to the case where you have to deal with missing data. In real life, missing data happens quite a lot. This trick demonstrated in the below section is going to help you handle blank data so that your machine learning program can run correctly.
- Say we have a list of data with some missing data. The easiest idea is to remove rows with missing data, but this is very dangerous because you are removing observations.
- A better solution which handle the missing data is to replace the missing records with the MEAN value of that specific column.
- The library that we going to use here is scikit-learn, and the function name is Imputer.
- We are going to replace ALL NaN values (missing data) in one go.
- This function also allows users to replace empty records with Median or the Most Frequent data in the dataset.
- In another word, we can now predict the missing value using scikit-learn’s Imputer module.
Visualize Example of Imputer from sklearn:
Original Data
–[‘b’, 2]
–[‘c’, 3]
–[‘a’, 8]
–[‘b’, 4]
–[‘a’, NaN]
–[‘a’, 4]
–[‘b’, 8]
–[‘c’, 8]
–[‘d’, 3]
–[‘a’, 3]
–[‘c’, 6]
New Data
–[‘b’, 2]
–[‘c’, 3]
–[‘a’, 8]
–[‘b’, 4]
–[‘a’, 4.9]
–[‘a’, 4]
–[‘b’, 8]
–[‘c’, 8]
–[‘d’, 3]
–[‘a’, 3]
–[‘c’, 6]
The following codes will calculate the mean of the column, and it will replace the missing value with the mean.
import pandas as pd
dataset=pd.read_csv(“Data_imputer.csv”)
x=dataset.iloc[:,1:2].values
from sklearn.preprocessing import Imputer
Imputer = Imputer(missing_values =‘NaN’, strategy=‘mean’, axis=0)
Imputer = Imputer.fit(x[:,0:1])
x[:,0:1]=Imputer.transform(x[:,0:1])
Step 1. ) Import Imputer from sklearn.processing
from sklearn.preprocessing import Imputer
Step 2.) Calling the function Imputer
This function has some parameters that we have to specify.
Example 1: take the Mean of Column
Imputer = Imputer(missing_values = ‘NaN’, strategy = ‘mean’, axis =0)
Example 2: take the Median of row
Imputer = Imputer(missing_values = ‘NaN’, strategy = ‘mean’, axis =1)
Example 3: take the Most Frequent value of column
Imputer = Imputer(missing_values = ‘NaN’, strategy = ‘most_frequent’, axis =0)
missing_values:
Define your missing value. In python, if you double click and visualize your data, then you will see the blank data = ‘NaN. So in this case, we are going to set missing_values=’NaN’
strategy:
You can replace the missing data with the following values
1.) Mean
2.) Median
3.) Most_frequent
axis:
1.) axis =0 This means that the computer will take the mean per column
2.) axis =1 This means that the computer will take the mean per row
Step 3.) Fitting Imputer to Data
Apply imputer to your data.
If your missing data is in column 1, then you would like to fit the calculated mean into NaN row within column 1
Imputer= Imputer.fit(dataset[: , 1:2])
Step 4.) Transform Data
Replace your blank observations with the calcuated value. In this case, it is going to transform NaN to the mean value.
dataset[:,1:2] = Imputer.transform(dataset[: , 1:2])
Other Sections on Data Handling in Python
2.) How to Know and Change the Working Directory
3.) How to Import CSV Data using Pandas
4.) How to Set Dependent Variables and Independent Variables using iloc
5.) How to Handle Missing data with Imputer
6.) How to Set Categorical Data (Dummy Variable) using LabelEncoder and OneHotEncoder
Leave a Reply