How to Encode Categorical Data using LabelEncoder and OneHotEncoder in Python

with No Comments

 Why use LabelEncoder and OneHotEncoder

  • The idea is that, we only want numeric and continuous values in the dataset.
  • It is quite simple to convert dummy variables using encoder in python. Encoder will convert the text in the dataset into numeric value ( 0 and 1).
  • We are going to use “LabelEncoder” and “OneHotEncoder” functions from scikit-learn  
  • Here is the link to scikit explanation on LabelEncoder and OneHotEncoder :

Categorical Data in Dataset

Regression models and machine learning models yield the best performance when all the observations are quantifiable. Since regressions and machine learning are based on mathematical functions, you can imagine that its is not ideal to have categorical data (observations that you can not describe mathematically)  in the dataset. When your dataset contains variables that you can not quantify, you need to convert those observations into dummy variables first. For example, your observation is ( male vs female ) or (different countries names). What we want to do is to convert these observations into 0 and 1.

LabelEncoder Example and OneHotEncoder Example

The data shown below are LableEncoder Example and OneHotEncoder Example for Python sci-kit.

Original Data

[‘b’, 2]

[‘c’, 3]

[‘a’, 8]

[‘b’, 4]

[‘a’, 4.9]

[‘a’, 4]

[‘b’, 8]

[‘c’, 8]

[‘d’, 3]

[‘a’, 3]

[‘c’, 6]

Applied LabelEncoder

 [1, 2]

[2, 3]

[0, 8]

[1, 4]

[0, 4.9]

[0, 4]

[1, 8]

[2, 8]

[3, 3]

[0, 3]

[2, 6]

Applied OneHotEncoder

 [a,b,c,d, value]

[0, 1, 0, 0, 2]

[0, 0, 1, 0, 3]

[1, 0, 0, 0, 8]

[0, 1, 0, 0, 4]

[1, 0, 0, 0, 4.9]

[1, 0, 0, 0, 4]

[0, 1, 0, 0, 8]

[0, 0, 1, 0, 8]

[0, 0, 0, 1, 3]

[1, 0, 0, 0, 3]

[0, 0, 1, 0, 6]

Removed Dummy Trap

 [b,c,d, value]

[1, 0, 0, 2]

[0, 1, 0, 3]

[0, 0, 0, 8]

[1, 0, 0, 4]

[0, 0, 0, 4.9]

[0, 0, 0, 4]

[1, 0, 0, 8]

[0, 1, 0, 8]

[0, 0, 1, 3]

[0, 0, 0, 3]

[0, 1, 0, 6]

 

  • To start, we are going to create a variable name encoder_x that do the encoding job.
  • After running to code, your categorical variables (in column 0) will be converted into numeric values. (0,1,2,3).
  • However, this is not what we wanted, because the computer might think that data with value 2 is greater than value 1, or data with value 1 is greater than value 0. We need to split these numeric values to Dummy Variables.
  • Hence, we are going to use OneHotEncoder to create Dummy Variables.
  • OneHotEncoder is going to split the data into different columns, each column represent the existence of one value using 0 and 1.
  • We need to specify the column that we want to apply OneHotEncoder. In this case, we would like to encode our dummy variables in the first column (index=0). Therefore, we need to target our OneHotEncoder to column index =0 in the categorical_features section. (ie, [0] in “onehotencoder = onehotencoder(categorical_features = [0])” represents the column index.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

encoder_x=LabelEncoder()

x[:,0]=encoder_x.fit_transform(x[:,0])

onehotencoder = onehotencoder(categorical_features = [0])

x=onehotencoder.fit_transform(x).toarray()

x=[:,1:]

Removing the Dummy Variable Trap

  • If the dataset has more than one dummy variables and those variables are related to each other, the dataset might fall into the dummy variable trap after LabelEncoder and OneHotEncoder are applied.
  • In our case, it is obvious that if the observation is not B, C ,D, then the observation must be A !
  • Since it can not be A or B or C or D at the same time. This is an example of Dummy Variable Trap.
  • Hence, we need to remove one dummy variable column at the end of encoding.
  • At the end of the code, X=[:, 1:]  means that x will copy all the rows and all the columns starting at column 1 until the last column.
  • This would remove column 1; and removed the dummy variable trap.

Difference between LabelEncoder and OneHotEncoder

    • LabelEncoder turn text value in a column into numeric values.
    • For example, [apple, orange, apple, banana] = [0,2,0,1]
    • visualize LabelEncoder and OneHotEncoder 1
    • OneHotEncoder turn text value in a column into one or more binary columns that only have [0,1]
    • For example, [apple, orange, apple, banana] = [1,2,1,3] will be split into 3 binary columns
    • visualize LabelEncoder and OneHotEncoder 2

Another Quick Example for LabelEncoder(LE) and OneHotEncoder(OHE)

code example of LabelEncoder and OneHotEncoder

Other Sections on Data Handling in Python

1.) How to Import Libraries 

2.) How to Know and Change the Working Directory 

3.) How to Import CSV Data using Pandas

4.) How to Set Dependent Variables and Independent Variables using iloc

5.) How to Handle Missing data with Imputer

6.) How to Set Categorical Data (Dummy Variable) using LabelEncoder and OneHotEncoder

7.) How to Split Data into Training Set and Testing Set

8.) How to Apply Feature Scaling

Leave a Reply