Train-Test Split
Imagine you're a teacher writing a final exam. You've been giving students homework all semester β 100 practice problems. Now it's exam day.
Would you put the exact same 100 problems on the exam? Of course not! A student who memorized the answers without understanding anything would ace it. That exam wouldn't test real knowledge at all.
Instead, you write new questions that test the same concepts but aren't identical to the homework. If a student truly learned the material (not just memorized answers), they'll do fine on the new questions too.
This is exactly why we split data in machine learning. The training set is the homework β the model practices on it. The test set is the exam β we grade the model on data it has never seen before.
Why can't the model just train on everything?
If you let the model see all the data during training, you have no way to know if it actually learned general patterns or just memorized the answers. It might score 99% on the training data but completely fall apart on new, real-world data.
By holding back a chunk of data (the test set), you get an honest grade. The model has never seen these examples, so its performance on them tells you how it'll do in the real world.
The typical split
The most common approach is an 80/20 split:
- 80% of the data β training set (the homework)
- 20% of the data β test set (the exam)
Some people use 70/30 or 90/10 β there's no magic number. The idea is always the same: train on most of the data, test on a held-out portion.
Critical rule: The model must never see the test set during training. Not even a peek. If it does, your evaluation is worthless.
Splitting Data with sklearn
Don't forget to shuffle!
Before splitting, always shuffle your data randomly. Why? Imagine your dataset is sorted β all the "pass" examples first, then all the "fail" examples. If you take the first 80% for training and the last 20% for testing, your training set would be all passes and your test set would be all fails. The model would learn a completely skewed picture.
Shuffling ensures both the training and test sets are representative of the overall data.
The validation set: a third piece
In practice, many teams use three splits:
- Training set (60-70%) β the model learns on this
- Validation set (15-20%) β used to tune settings and compare models
- Test set (15-20%) β the final, untouched exam
Think of validation as practice exams. You use them to adjust your study strategy, but the real exam (test set) is still a surprise. This prevents you from accidentally overfitting to the test set by tweaking your model until it scores well on it.
The Right Way: Split THEN Preprocess
Quick check
Continue reading