Before we build our machine learning model, we have to make sure that no independent variable is dominating other variables in the model. For example, if we have 2 independent variables, Variable A one has value from 0 to 10, and the other Variable B has value from 0 to 100,000. Since the value of Variable B is so much greater than the Variable A, the mathematic model behind the regression or machine learning is going to think that Variable B is dominating the Variable A. This is a very common mistake in statistic or machine learning, because some of the models draws the relationship of variables using the Euclidean Distance.
The idea is that, large values in a variable does not mean necessarily means that it is more important than other variables. To fix this concept in a mathematical sense, we have to apply feature scaling to the dataset.
In another words, feature scaling to a method to Standardize the independent variables in the model.
This is also known as Normalization.
The Standardized data implicitly weights all the variables in the model Equally respectively.
In Machine Learning, data that are Normalized tend to converge FASTER.
2 Concepts of Feature Scaling
- X standard = (x – mean of x) / (standard deviation of x)
- X normalized = (x – min of x) / (max of x – min of x)
Example of Feature Scaling in Python
Before Feature Scaling
–[0, 1, 0, 6]
–[1, 0, 0, 8]
–[0, 1, 0, 3]
–[0, 1, 0, 8]
–[0, 0, 1, 3]
–[1, 0, 0, 4]
–[1, 0, 0, 2]
–[0, 0, 0, 4]
After Feature Scaling
–[-0.77, 1.29, -0.37, 0.57]
–[ 1.29, -0.77, -0.37, 1.50]
[-0.77, 1.29, -0.37, -0.80]
[-0.77, 1.29, -0.37, 1.50]
[-0.77, -0.77, 2.64, -0.80]
–[ 1.29, -0.77, -0.37, -0.34]
–[ 1.29, -0.77, -0.37, -1.27]
–[-0.77, -0.77, -0.37, -0.34]
Note that we are going to fit_transform the training set, and then we will transform the testing set.
Feature Scaling using StandardScaler in Python
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
xtrain = ss.fit_transform(xtrain)
xtest = ss.transform(xtest)
When you collect data and extract features, many times the data is collected on different scales. For example, the age of employees in a company may be between 21-70 years, the size of the house they live is 500-5000 Sq feet and their salaries may range from 30000−30000−80000. In this situation if you use a simple Euclidean metric, the age feature will not play any role because it is several order smaller than other features. However, it may contain some important information that may be useful for the task. Here, you may want to normalize the features independently to the same scale, say [0,1], so they contribute equally while computing the distance.
Other Sections on Data Handling in Python
8.) How to Apply Feature Scaling