Model Evaluation

A doctor at a free clinic sees 1,000 patients a year. Only 10 actually have a rare disease. One day, the clinic gets a new AI diagnostic tool. The doctor runs it on all 1,000 patients and it announces: "Nobody has the disease!"

Accuracy? 99%. Impressive, right?

Except it missed all 10 people who actually have the disease. Those 10 patients walk out thinking they're healthy. The model is 99% accurate and 100% useless.

This is why accuracy alone is a terrible metric for many real-world problems. We need a toolkit of evaluation measures, each revealing a different angle of model performance.

Beyond accuracy: the metrics toolkit

Precision & Recall (refresher)

Precision: Of all the patients the model flagged as sick, how many actually were? (Quality of alarms)
Recall: Of all the actually sick patients, how many did the model catch? (Coverage)

F1 Score

The harmonic mean of precision and recall. It punishes you hard if either one is low. A model with 99% precision but 1% recall gets an F1 of ~0.02 — not 50% as a regular average would suggest.

ROC Curve & AUC

Most classifiers don't just say "yes/no" — they output a probability (e.g., "78% chance of disease"). The ROC curve shows what happens to the True Positive Rate vs False Positive Rate as you move the decision threshold from 0% to 100%.

AUC (Area Under the Curve) summarizes the ROC curve as a single number: 1.0 = perfect, 0.5 = random guessing.

Comprehensive Model Evaluation

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)
import numpy as np

# Simulated: 100 patients, 10 actually sick
np.random.seed(42)
y_true = np.array([1]*10 + [0]*90)

# Model A: "nobody is sick" (99% accuracy but useless)
y_pred_a = np.zeros(100)

# Model B: catches 8/10 sick, but also flags 15 healthy
y_pred_b = np.zeros(100)
y_pred_b[:8] = 1   # catches 8 sick
y_pred_b[15:30] = 1  # 15 false alarms

print("=== Model A: 'Nobody is sick' ===")
print(f"Accuracy: {accuracy_score(y_true, y_pred_a):.2f}")
print(f"Recall:   {recall_score(y_true, y_pred_a):.2f}")
print(f"F1:       {f1_score(y_true, y_pred_a):.2f}")

print("\n=== Model B: Catches most cases ===")
print(f"Accuracy: {accuracy_score(y_true, y_pred_b):.2f}")
print(f"Recall:   {recall_score(y_true, y_pred_b):.2f}")
print(f"F1:       {f1_score(y_true, y_pred_b):.2f}")

Output

=== Model A: 'Nobody is sick' ===
Accuracy: 0.90
Recall:   0.00
F1:       0.00

=== Model B: Catches most cases ===
Accuracy: 0.83
Recall:   0.80
F1:       0.52

Choosing the right metric

Use Case	Priority	Best Metric
Spam filter	Don't block real emails	High Precision
Cancer detection	Don't miss any cancer	High Recall
Balanced dataset	Overall performance	Accuracy or F1
Comparing classifiers	Threshold-independent	ROC-AUC

The threshold tradeoff

Most classifiers give you a probability. You decide the cutoff. Set it at 50%? That's standard. Set it at 20%? You'll catch more true positives but get more false alarms too. The ROC curve shows you every possible tradeoff at every possible threshold.

Key Metrics

🎯 Accuracy

Misleading with imbalanced classes — avoid as sole metric

(TP+TN)/Total Simple, intuitive

⚖️ F1 Score

Use when false positives and false negatives both matter

Harmonic mean Balances precision & recall

📈 ROC-AUC

Best for comparing classifiers across all thresholds

Area under curve Threshold-independent

📉 Log Loss

Use when you care about calibrated probabilities

Probability-based Penalizes confident mistakes