Model Deployment8 min read

Model Evaluation

Accuracy isn't everything β€” ROC, AUC, F1 explained
accuracy:Simple but misleading on imbalanced dataROC-AUC:Threshold-independent Β· robust metricF1 score:Balances precision and recall

A doctor at a free clinic sees 1,000 patients a year. Only 10 actually have a rare disease. One day, the clinic gets a new AI diagnostic tool. The doctor runs it on all 1,000 patients and it announces: "Nobody has the disease!"

Accuracy? 99%. Impressive, right?

Except it missed all 10 people who actually have the disease. Those 10 patients walk out thinking they're healthy. The model is 99% accurate and 100% useless.

This is why accuracy alone is a terrible metric for many real-world problems. We need a toolkit of evaluation measures, each revealing a different angle of model performance.

Beyond accuracy: the metrics toolkit

Precision & Recall (refresher)

  • Precision: Of all the patients the model flagged as sick, how many actually were? (Quality of alarms)
  • Recall: Of all the actually sick patients, how many did the model catch? (Coverage)

F1 Score

The harmonic mean of precision and recall. It punishes you hard if either one is low. A model with 99% precision but 1% recall gets an F1 of ~0.02 β€” not 50% as a regular average would suggest.

ROC Curve & AUC

Most classifiers don't just say "yes/no" β€” they output a probability (e.g., "78% chance of disease"). The ROC curve shows what happens to the True Positive Rate vs False Positive Rate as you move the decision threshold from 0% to 100%.

AUC (Area Under the Curve) summarizes the ROC curve as a single number: 1.0 = perfect, 0.5 = random guessing.

Comprehensive Model Evaluation

from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report
)
import numpy as np
# Simulated: 100 patients, 10 actually sick
np.random.seed(42)
y_true = np.array([1]*10 + [0]*90)
# Model A: "nobody is sick" (99% accuracy but useless)
y_pred_a = np.zeros(100)
# Model B: catches 8/10 sick, but also flags 15 healthy
y_pred_b = np.zeros(100)
y_pred_b[:8] = 1 # catches 8 sick
y_pred_b[15:30] = 1 # 15 false alarms
print("=== Model A: 'Nobody is sick' ===")
print(f"Accuracy: {accuracy_score(y_true, y_pred_a):.2f}")
print(f"Recall: {recall_score(y_true, y_pred_a):.2f}")
print(f"F1: {f1_score(y_true, y_pred_a):.2f}")
print("\n=== Model B: Catches most cases ===")
print(f"Accuracy: {accuracy_score(y_true, y_pred_b):.2f}")
print(f"Recall: {recall_score(y_true, y_pred_b):.2f}")
print(f"F1: {f1_score(y_true, y_pred_b):.2f}")
Output
=== Model A: 'Nobody is sick' ===
Accuracy: 0.90
Recall:   0.00
F1:       0.00

=== Model B: Catches most cases ===
Accuracy: 0.83
Recall:   0.80
F1:       0.52

Choosing the right metric

Use CasePriorityBest Metric
Spam filterDon't block real emailsHigh Precision
Cancer detectionDon't miss any cancerHigh Recall
Balanced datasetOverall performanceAccuracy or F1
Comparing classifiersThreshold-independentROC-AUC

The threshold tradeoff

Most classifiers give you a probability. You decide the cutoff. Set it at 50%? That's standard. Set it at 20%? You'll catch more true positives but get more false alarms too. The ROC curve shows you every possible tradeoff at every possible threshold.

Note: Always ask: 'What's the cost of each type of error?' If a false negative costs a life (missed cancer) and a false positive costs a phone call (follow-up appointment), you should heavily optimize for recall β€” even at the expense of precision.

Key Metrics

🎯 Accuracy
Misleading with imbalanced classes β€” avoid as sole metric
(TP+TN)/Total Simple, intuitive
βš–οΈ F1 Score
Use when false positives and false negatives both matter
Harmonic mean Balances precision & recall
πŸ“ˆ ROC-AUC
Best for comparing classifiers across all thresholds
Area under curve Threshold-independent
πŸ“‰ Log Loss
Use when you care about calibrated probabilities
Probability-based Penalizes confident mistakes

Quick check

A model achieves 99% accuracy on a dataset where 99% of samples are class A. Is this model useful?
Challenge

Continue reading