How to Split Data into Training Set and Testing Set in Python

with No Comments

 

When we are building mathematical model to predict the future, we must split the dataset into “Training Dataset” and “Testing Dataset”.

For example, if we are building a machine learning model, the model is going to learn the relationship of the data first. The model is going to “Learn” the mathematical relationship in the data using the “Training Dataset”.

In order to verify whether the model is valid, we have to test the model with data that are different with the “Training Dataset”. Therefore, we are going to check the model using the “Testing Dataset”.

  • The idea is like this-
    • If we have 1000 observations, then we are going to train our model using 75% or 750 observations.
    • After the model is built, we are going check the model using the testing set, which is 25% or 250 observations.
  • The results or the accuracies of the training set and the testing set should be similar.
  • “Overfitting” might occur when the model learned too much on the training set and failed to predict the testing set result.
  • We are going to use Cross_validation from scikit-learn.  
  • Cross_validation can declare 4 variables xtrain xtest ytrain ytest at once.
  • It is going to split the data RANDOMLY. (If you want your data to be split by Random, you can set the random_state

 

Example on Split dataset in to Training Set and Testing Set:

Say your data has 5 columns.

Column 0 to Column 4 are the dependent variables (Y).

The last column on the right is the independent variables (X).

This is how you create the training set and testing set.

#import dataset 

import pandas as pd
dataset=pd.read_csv(‘dataset.csv’).values

#split dependent variable and independent variable

y=dataset[:,4]

x=dataset[:,1:4]

#split training set and testing set

from sklearn.cross_validation import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25)

Change the Parameter of the function

If you would like to have training set = 80% and testing set = 20%, then you should change your test_size.

This is how you do it.

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20)

If your dependent variables and independent variable names are other than X and Y,  then you should change the parameter of the function.

This is how you do it.

xtrain, xtest, ytrain, ytest = train_test_split(independent_variable, dependent_Variable, test_size=0.20)

If you do not want to split the training set and testing set randomly, then you should set the random state.

This is how you do it.

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, random_state = 0)

Other Sections on Data Handling in Python

1.) How to Import Libraries 

2.) How to Know and Change the Working Directory 

3.) How to Import CSV Data using Pandas

4.) How to Set Dependent Variables and Independent Variables using iloc

5.) How to Handle Missing data with Imputer

6.) How to Set Categorical Data (Dummy Variable) using LabelEncoder and OneHotEncoder

7.) How to Split Data into Training Set and Testing Set

8.) How to Apply Feature Scaling

Leave a Reply