Foundations6 min read

Train-Test Split

Practice with homework, get graded on new questions β€” why you must split your data
training set:What the model learns fromtest set:What the model is graded ontypical split:80% train Β· 20% test

Imagine you're a teacher writing a final exam. You've been giving students homework all semester β€” 100 practice problems. Now it's exam day.

Would you put the exact same 100 problems on the exam? Of course not! A student who memorized the answers without understanding anything would ace it. That exam wouldn't test real knowledge at all.

Instead, you write new questions that test the same concepts but aren't identical to the homework. If a student truly learned the material (not just memorized answers), they'll do fine on the new questions too.

This is exactly why we split data in machine learning. The training set is the homework β€” the model practices on it. The test set is the exam β€” we grade the model on data it has never seen before.

Why can't the model just train on everything?

If you let the model see all the data during training, you have no way to know if it actually learned general patterns or just memorized the answers. It might score 99% on the training data but completely fall apart on new, real-world data.

By holding back a chunk of data (the test set), you get an honest grade. The model has never seen these examples, so its performance on them tells you how it'll do in the real world.

The typical split

The most common approach is an 80/20 split:

  • 80% of the data β†’ training set (the homework)
  • 20% of the data β†’ test set (the exam)

Some people use 70/30 or 90/10 β€” there's no magic number. The idea is always the same: train on most of the data, test on a held-out portion.

Critical rule: The model must never see the test set during training. Not even a peek. If it does, your evaluation is worthless.

Splitting Data with sklearn

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Sample dataset
X = [
[6, 8], [2, 4], [7, 7], [1, 5],
[5, 6], [8, 9], [3, 3], [4, 7],
[9, 8], [2, 6], [6, 5], [7, 9],
]
y = [1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1] # pass/fail
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
# Train on training data only
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Grade on test data (never seen!)
predictions = model.predict(X_test)
print(f"\nTest accuracy: {accuracy_score(y_test, predictions):.0%}")
Output
Training samples: 9
Test samples:     3

Test accuracy: 67%

Don't forget to shuffle!

Before splitting, always shuffle your data randomly. Why? Imagine your dataset is sorted β€” all the "pass" examples first, then all the "fail" examples. If you take the first 80% for training and the last 20% for testing, your training set would be all passes and your test set would be all fails. The model would learn a completely skewed picture.

Shuffling ensures both the training and test sets are representative of the overall data.

The validation set: a third piece

In practice, many teams use three splits:

  • Training set (60-70%) β€” the model learns on this
  • Validation set (15-20%) β€” used to tune settings and compare models
  • Test set (15-20%) β€” the final, untouched exam

Think of validation as practice exams. You use them to adjust your study strategy, but the real exam (test set) is still a surprise. This prevents you from accidentally overfitting to the test set by tweaking your model until it scores well on it.

Note: Data leakage is the #1 beginner mistake. It happens when information from the test set "leaks" into training β€” like a student who somehow got a copy of the exam beforehand. Common causes: not splitting before preprocessing (like normalizing using the whole dataset), time-series data split without respecting time order, or duplicate rows that end up in both sets. Always split FIRST, then preprocess each set independently.

The Right Way: Split THEN Preprocess

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[100, 3], [200, 6], [150, 4],
[300, 8], [250, 7], [120, 3]])
y = np.array([200, 400, 280, 550, 480, 220])
# Step 1: Split FIRST
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Step 2: Fit scaler on training data ONLY
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit + transform
X_test_scaled = scaler.transform(X_test) # transform only!
print("Train mean:", X_train_scaled.mean(axis=0).round(2))
print("Test mean:", X_test_scaled.mean(axis=0).round(2))
print("\n(Test mean isn't [0,0] β€” that's correct!)")
Output
Train mean: [-0.  0.]
Test mean: [-0.63 -0.58]

(Test mean isn't [0,0] β€” that's correct!)

Quick check

Why do we split data into training and test sets?
Challenge

Continue reading