close icon
daily.dev platform

Discover more from daily.dev

Personalized news feed, dev communities and search, much better than whatโ€™s out there. Maybe ;)

Start reading - Free forever
Start reading - Free forever
Continue reading >

K-Fold Cross-Validation in Scikit-Learn: Tutorial

K-Fold Cross-Validation in Scikit-Learn: Tutorial
Author
Nimrod Kramer
Related tags on daily.dev
toc
Table of contents
arrow-down

๐ŸŽฏ

Learn how K-Fold Cross-Validation improves machine learning models by providing reliable performance estimates and preventing overfitting.

K-Fold Cross-Validation helps you build better machine learning models. Here's what you need to know:

  • Splits data into K parts for training and testing
  • Uses all data for both training and testing
  • Gives more reliable performance estimates
  • Helps prevent overfitting

Key steps:

  1. Pick number of folds (K)
  2. Split data into K equal parts
  3. Train on K-1 parts, test on 1 part
  4. Repeat K times
  5. Average the results

Scikit-Learn code:

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Train and evaluate model here

Quick comparison:

Method Pros Cons
K-Fold CV Uses all data, reduces bias More computationally expensive
Simple Split Fast, easy Less reliable estimates
LOOCV Low bias Very computationally expensive

K-Fold Cross-Validation helps you build more reliable models by giving a fuller picture of performance.

What You Need to Know First

Before diving in, make sure you have:

  1. Python (3.5+) installed
  2. Scikit-Learn installed:
    • Via pip: pip install -U scikit-learn
    • Via conda: conda install scikit-learn

Key libraries:

Library Min Version
NumPy 1.11.0
SciPy 0.17.0
Joblib 0.11
Matplotlib 1.5.1
Pandas 0.18.0

Understand these concepts:

  • Model training and testing
  • Overfitting and underfitting
  • Model evaluation metrics

What is K-Fold Cross-Validation?

K-Fold Cross-Validation assesses model performance on new data. It works like this:

  1. Split data into K equal parts
  2. Train on K-1 parts, test on 1 part
  3. Repeat K times
  4. Average the results

Example with 5-fold:

Iteration Training Folds Testing Fold
1 2, 3, 4, 5 1
2 1, 3, 4, 5 2
3 1, 2, 4, 5 3
4 1, 2, 3, 5 4
5 1, 2, 3, 4 5

Why it's better:

  1. Uses more data for training
  2. Reduces bias
  3. Gives better performance estimates
  4. Allows confidence interval calculation

"Cross-validation can detect overfitting, showing if a model isn't generalizing well to new data." - Nisha Arya, Data Scientist

Getting Ready to Code

Set up your environment:

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.datasets import load_iris, load_diabetes

# Load datasets
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

diabetes = load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target

print("Iris dataset shape:", X_iris.shape)
print("Diabetes dataset shape:", X_diabetes.shape)

For real-world data:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
banknote_df = pd.read_csv(url, header=None, names=['variance', 'skewness', 'curtosis', 'entropy', 'class'])
sbb-itb-bfaad5b

Using K-Fold Cross-Validation in Scikit-Learn

Scikit-Learn

Basic setup:

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, test_index in kf.split(X):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # Train and evaluate model here

With a model:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate

log_reg = LogisticRegression(solver='liblinear')
cv_results = cross_validate(log_reg, X_iris, y_iris, cv=kf, scoring='accuracy')

print("Cross-validation scores:", cv_results['test_score'])
print("Mean accuracy:", cv_results['test_score'].mean())

Changing K-Fold Settings

Adjust folds:

kf = KFold(n_splits=10, shuffle=True, random_state=42)

Use Stratified K-Fold:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

Checking How Well Your Model Works

Get scores:

from sklearn.model_selection import cross_val_score
from sklearn import svm

clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, X, y, cv=5)
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.2f}")
print(f"Standard Deviation: {scores.std():.2f}")

Use different metrics:

scores = cross_val_score(clf, X, y, cv=5, scoring='f1')

Tips and Common Mistakes

  1. Pick 5-10 folds for most datasets
  2. Use Stratified K-Fold for uneven data
  3. Prevent data leakage:
    • Split before preprocessing
    • Use Scikit-Learn's Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC())
])

K-Fold Cross-Validation helps build reliable models by providing fuller performance estimates and preventing overfitting.

Related posts

Why not level up your reading with

Stay up-to-date with the latest developer news every time you open a new tab.

Read more