Random Forests
Imagine you're trying to guess the number of jellybeans in a jar. You ask 100 different people to guess. Individually, most guesses are wildly off. But if you average all 100 guesses, the result is eerily close to the real number.
This is the wisdom of the crowd, and it's the core idea behind Random Forests. Instead of relying on one decision tree (which might overfit), you build hundreds of slightly different trees and let them vote.
Each tree might be mediocre on its own. But their combined vote? Surprisingly accurate.
What makes each tree "random"?
If you built 100 identical trees, they'd all make the same mistakes β voting wouldn't help. The trick is to make each tree different by introducing two types of randomness:
1. Random data (Bootstrap sampling)
Each tree gets a random subset of the training data, sampled with replacement. Imagine putting all your data into a bag, pulling out samples one at a time (putting each back before the next draw). Some examples appear twice, some get left out. Each tree sees a slightly different version of reality.
2. Random features
At each split, the tree doesn't consider all features β only a random subset. If you have 10 features, each split might only look at 3 or 4. This forces trees to find different patterns and prevents them from all latching onto the same dominant feature.
Random Forest from Simple Trees
Key Metrics
Quick check
Continue reading