K-fold cross-validation in Scikit Learn


You may have heard of the train-test split, wherein we split the available data in training and test sets. The data is trained on the training set and tested on the test set. If you try training the model again, using a different random_state argument in the train_test_split() function, you will most likely get a different score or a different result, because both the training and test sets are now different.

The K-fold cross-validation approach builds on this idea that we get different results for different train test splits, and endeavors to estimate the performance of the model with lesser variance. Under this approach, the data is divided into K parts. It is then trained on (K-1) parts and tested on the remaining one part. This is repeated K times, such that each part gets to be the test part once. The outcome is that you get K different performance scores, which can then be summarized (using aggregations like mean, standard deviation, etc.).

Usage of K-Fold Cross Validation generally results in a less biased and more realistic estimate of the model performance.

The choice of K is left to you. It should be such that a single part is large enough to act as a test set. K values of 3,5 and 10 are common in general.


The example below shows the usage of the K fold cross-validation in scikit learn, using the breast cancer dataset.

from sklearn.svm import SVC
from sklearn.model_selection import KFold
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer

#Load the data
data = load_breast_cancer()
X = pd.DataFrame(data.data,columns=data.feature_names)
y = data.target

#Perform K fold CV
num_folds = 5
seed = 42
kfold = KFold(n_splits=num_folds,shuffle=True,random_state=seed)

#Define the model
model = SVC()
results = cross_val_score(model, X, y, cv=kfold)

print("Accuracy = " + str(results.mean()))
>> Accuracy = 0.9173109765564353

As you can see, the KFold function just generates a cross-validation iterable. The function which performs the evaluation of the model on the iterable is the cross_val_score function. It uses the default scorer of the model unless we specify a different scorer. We used SVC, and so the default scorer is accuracy.

Note that if you don’t specify shuffle=True, there is no need for the random_state argument in the KFold function.

You can read more about these functions here:

  1. KFold
  2. Cross_val_score
  3. SVC

Check out other articles on python on iotespresso.com. If you are interested in data science, visualization, and machine learning using Python, you may find this course by Jose Portilla on Udemy to be very helpful. It has been the foundation course in Python for me and several of my colleagues.

Leave a comment

Your email address will not be published.