Skip to main content

K-Fold Cross-Validation in Scikit-Learn: Tutorial

Nimrod Kramer Nimrod Kramer
Link copied!
K-Fold Cross-Validation in Scikit-Learn: Tutorial
Quick take

Learn how K-Fold Cross-Validation improves machine learning models by providing reliable performance estimates and preventing overfitting.

K-Fold Cross-Validation helps you build better machine learning models. Here's what you need to know:

  • Splits data into K parts for training and testing
  • Uses all data for both training and testing
  • Gives more reliable performance estimates
  • Helps prevent overfitting

Key steps:

  1. Pick number of folds (K)
  2. Split data into K equal parts
  3. Train on K-1 parts, test on 1 part
  4. Repeat K times
  5. Average the results

Scikit-Learn code:

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Train and evaluate model here

Quick comparison:

Method

Pros

Cons

K-Fold CV

Uses all data, reduces bias

More computationally expensive

Simple Split

Fast, easy

Less reliable estimates

LOOCV

Low bias

Very computationally expensive

K-Fold Cross-Validation helps you build more reliable models by giving a fuller picture of performance.

What You Need to Know First

Before diving in, make sure you have:

  1. Python (3.5+) installed
  2. Scikit-Learn installed:
    • Via pip: pip install -U scikit-learn
    • Via conda: conda install scikit-learn

Key libraries:

Library

Min Version

NumPy

1.11.0

SciPy

0.17.0

Joblib

0.11

Matplotlib

1.5.1

Pandas

0.18.0

Understand these concepts:

  • Model training and testing
  • Overfitting and underfitting
  • Model evaluation metrics

What is K-Fold Cross-Validation?

K-Fold Cross-Validation assesses model performance on new data. It works like this:

  1. Split data into K equal parts
  2. Train on K-1 parts, test on 1 part
  3. Repeat K times
  4. Average the results

Example with 5-fold:

Iteration

Training Folds

Testing Fold

1

2, 3, 4, 5

1

2

1, 3, 4, 5

2

3

1, 2, 4, 5

3

4

1, 2, 3, 5

4

5

1, 2, 3, 4

5

Why it's better:

  1. Uses more data for training
  2. Reduces bias
  3. Gives better performance estimates
  4. Allows confidence interval calculation

"Cross-validation can detect overfitting, showing if a model isn't generalizing well to new data." - Nisha Arya, Data Scientist

Getting Ready to Code

Set up your environment:

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris, load_diabetes

iris = load_iris()
X_iris, y_iris = iris.data, iris.target

diabetes = load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target

print("Iris dataset shape:", X_iris.shape)
print("Diabetes dataset shape:", X_diabetes.shape)

For real-world data:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
banknote_df = pd.read_csv(url, header=None, names=['variance', 'skewness', 'curtosis', 'entropy', 'class'])
sbb-itb-bfaad5b

Using K-Fold Cross-Validation in Scikit-Learn

Scikit-Learn

Basic setup:

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Train and evaluate model here

With a model:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

log_reg = LogisticRegression(solver='liblinear')
cv_results = cross_validate(log_reg, X_iris, y_iris, cv=kf, scoring='accuracy')

print("Cross-validation scores:", cv_results['test_score'])
print("Mean accuracy:", cv_results['test_score'].mean())

Changing K-Fold Settings

Adjust folds:

kf = KFold(n_splits=10, shuffle=True, random_state=42)

Use Stratified K-Fold:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Checking How Well Your Model Works

Get scores:

from sklearn.model_selection import cross_val_score
from sklearn import svm

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.2f}")
print(f"Standard Deviation: {scores.std():.2f}")

Use different metrics:

scores = cross_val_score(clf, X, y, cv=5, scoring='f1')

Tips and Common Mistakes

  1. Pick 5-10 folds for most datasets
  2. Use Stratified K-Fold for uneven data
  3. Prevent data leakage:
    • Split before preprocessing
    • Use Scikit-Learn's Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

K-Fold Cross-Validation helps build reliable models by providing fuller performance estimates and preventing overfitting.

Read more, every new tab

Posts like this, on every new tab.

daily.dev curates a feed of articles ranked against what you actually care about. Free forever.

Link copied!