Practical ML7 min read

Confusion Matrix

True positives, false alarms โ€” measure what matters
true positive:Correct alarm ยท real fire detectedfalse positive:False alarm ยท burnt toast triggered itfalse negative:Missed detection ยท real fire, no alarm

Your apartment has a fire alarm. It can do four things:

  • True Positive (TP) โ€” There's a real fire, and the alarm goes off. Perfect. This is what you want.
  • True Negative (TN) โ€” No fire, no alarm. Also perfect. Quiet night.
  • False Positive (FP) โ€” You burnt some toast, and the alarm screams at 3 AM. Annoying, but you're alive.
  • False Negative (FN) โ€” There's a real fire, but the alarm stays silent. You're asleep. This is the worst case.

A confusion matrix is simply a table that counts how many of each type your model produced. It's called "confusion" because it shows you exactly where your model gets confused.

Reading the matrix

For a binary classifier (like fire/no-fire), the confusion matrix is a 2ร—2 grid:

Predicted: FirePredicted: No Fire
Actual: FireTP (correct alarm)FN (missed fire!)
Actual: No FireFP (false alarm)TN (quiet night)

The metrics that flow from it

  • Accuracy = (TP + TN) / Total โ€” how often you're right overall
  • Precision = TP / (TP + FP) โ€” when you say "fire," how often is it real?
  • Recall = TP / (TP + FN) โ€” of all real fires, how many did you catch?
  • F1 Score = 2 ร— (Precision ร— Recall) / (Precision + Recall) โ€” harmonic mean of both

For the fire alarm, recall matters most. You'd rather have 10 false alarms than miss one real fire.

Building a Confusion Matrix

from sklearn.metrics import confusion_matrix, classification_report
# Actual labels and predictions
y_true = [1, 1, 1, 0, 0, 0, 1, 0, 1, 0]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 0, 0]
# 1 = fire, 0 = no fire
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)
print()
print(classification_report(y_true, y_pred,
target_names=['No Fire', 'Fire']))
Output
Confusion Matrix:
[[4 1]
 [2 3]]

              precision    recall  f1-score
   No Fire       0.67      0.80      0.73
      Fire       0.75      0.60      0.67
  accuracy                           0.70

When accuracy lies

Imagine 1,000 people go through airport security. Only 2 are actually carrying something dangerous. A lazy detector that says "nobody is dangerous" gets 998/1000 = 99.8% accuracy. Sounds incredible โ€” but it missed both actual threats. Its recall is 0%.

This is why the confusion matrix exists. Accuracy alone is dangerous for imbalanced datasets. You need to see the full picture: where exactly is the model failing?

Note: The "cost" of errors is often asymmetric. A false negative in cancer screening (missed cancer) is far worse than a false positive (unnecessary follow-up). Always ask: which type of error is more expensive for YOUR use case?

Key Metrics

๐ŸŽฏ Accuracy
Misleading when classes are imbalanced
(TP+TN)/Total Overall correctness
๐Ÿ” Precision
High precision = few false alarms
TP/(TP+FP) Quality of positive predictions
๐Ÿ”” Recall
High recall = few missed cases
TP/(TP+FN) Coverage of actual positives
โš–๏ธ F1 Score
Use when you need both precision and recall
Harmonic mean Balance of both

Quick check

A spam filter blocks 100 emails. 90 are actually spam, 10 are legitimate. What is the filter's precision for spam detection?
Challenge

Continue reading