Model Evaluation
A doctor at a free clinic sees 1,000 patients a year. Only 10 actually have a rare disease. One day, the clinic gets a new AI diagnostic tool. The doctor runs it on all 1,000 patients and it announces: "Nobody has the disease!"
Accuracy? 99%. Impressive, right?
Except it missed all 10 people who actually have the disease. Those 10 patients walk out thinking they're healthy. The model is 99% accurate and 100% useless.
This is why accuracy alone is a terrible metric for many real-world problems. We need a toolkit of evaluation measures, each revealing a different angle of model performance.
Beyond accuracy: the metrics toolkit
Precision & Recall (refresher)
- Precision: Of all the patients the model flagged as sick, how many actually were? (Quality of alarms)
- Recall: Of all the actually sick patients, how many did the model catch? (Coverage)
F1 Score
The harmonic mean of precision and recall. It punishes you hard if either one is low. A model with 99% precision but 1% recall gets an F1 of ~0.02 β not 50% as a regular average would suggest.
ROC Curve & AUC
Most classifiers don't just say "yes/no" β they output a probability (e.g., "78% chance of disease"). The ROC curve shows what happens to the True Positive Rate vs False Positive Rate as you move the decision threshold from 0% to 100%.
AUC (Area Under the Curve) summarizes the ROC curve as a single number: 1.0 = perfect, 0.5 = random guessing.
Comprehensive Model Evaluation
Choosing the right metric
| Use Case | Priority | Best Metric |
|---|---|---|
| Spam filter | Don't block real emails | High Precision |
| Cancer detection | Don't miss any cancer | High Recall |
| Balanced dataset | Overall performance | Accuracy or F1 |
| Comparing classifiers | Threshold-independent | ROC-AUC |
The threshold tradeoff
Most classifiers give you a probability. You decide the cutoff. Set it at 50%? That's standard. Set it at 20%? You'll catch more true positives but get more false alarms too. The ROC curve shows you every possible tradeoff at every possible threshold.