The Why
You might have heard about the train-test split of data. Training data is, as the name suggests, used to train your model. Test data is the unknown data that the model hasn’t seen during the training. You report the performance of the model based on its performance on the test data. What if you develop a model which works excellently on the training data, but gives very poor results on the test data? This suggests that you may have an overfitted model. Now, looking at the performance of the model on the test data, you may go back to the training, and make some tweaks to reduce the overfitting, and improve performance on the test data.
Did you see what happened here? The test data indirectly influenced your model! While the model is still not trained on the test data, its performance on the test data made you re-train it. And in real-world scenarios, this may happen again and again. Training the model is often an iterative process. And by the time your model is finalized, you may have tweaked it based on the performance on the test data so many times, that it may not seem right to report the performance of the model based on test data results.
Therefore, the concept of the validation set comes into the picture. The idea is that the test set has to be kept sacred, and is to be used only when reporting the final numbers. Till then, the model is to be trained on the training data, and its performance can be evaluated on the validation set. If the performance is not up to the mark, you can go ahead and retrain it. Only after all iterations are done, and you are satisfied with the performance of the model on the validation set, do you evaluate it for the test set, and report the final numbers.
The How (Scikit Learn)
sklearn.model_selection contains a train_test_split() function that is used to split your data in training and test sets. But what if you want to split your data into train, test, and validation sets? While there is no direct function to do it elegantly, you can perform train_test_split twice on the data. In the first split, you get the training set, and in the next split, you split the remainder of the data, after removing the training set, into test and validation sets.
The sample code below illustrates how that can be done:
from sklearn.model_selection import train_test_split
train_ratio = 0.6
validation_ratio = 0.2
test_ratio = 0.2
#Get the train set
X_train, X_rem, y_train, y_rem = train_test_split(X, y, test_size=1-train_ratio, random_state=10)
#Get the validation and test sets
X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=test_ratio/(test_ratio + validation_ratio), random_state=10)
As you can see, we first split the data into train and remainder sets. Then we split the remainder set further into validation and test sets.
Check out other articles on python on iotespresso.com. If you are interested in data science, visualization, and machine learning using Python, you may find this course by Jose Portilla on Udemy to be very helpful. It has been the foundation course in Python for me and several of my colleagues.