Core Algorithms7 min read

Random Forests

Many trees vote together β€” wisdom of the crowd
training:O(T * n * d * log n) β€” T trees, each on a subsetprediction:O(T * depth) β€” ask every tree, then voteaccuracy:Much higher than a single decision tree

Imagine you're trying to guess the number of jellybeans in a jar. You ask 100 different people to guess. Individually, most guesses are wildly off. But if you average all 100 guesses, the result is eerily close to the real number.

This is the wisdom of the crowd, and it's the core idea behind Random Forests. Instead of relying on one decision tree (which might overfit), you build hundreds of slightly different trees and let them vote.

Each tree might be mediocre on its own. But their combined vote? Surprisingly accurate.

What makes each tree "random"?

If you built 100 identical trees, they'd all make the same mistakes β€” voting wouldn't help. The trick is to make each tree different by introducing two types of randomness:

1. Random data (Bootstrap sampling)

Each tree gets a random subset of the training data, sampled with replacement. Imagine putting all your data into a bag, pulling out samples one at a time (putting each back before the next draw). Some examples appear twice, some get left out. Each tree sees a slightly different version of reality.

2. Random features

At each split, the tree doesn't consider all features β€” only a random subset. If you have 10 features, each split might only look at 3 or 4. This forces trees to find different patterns and prevents them from all latching onto the same dominant feature.

Random Forest from Simple Trees

import random
from collections import Counter
def build_simple_tree(data, features_subset):
"""Simplified: pick best feature from subset, split at median"""
best_feat = random.choice(features_subset)
values = [row[best_feat] for row, _ in data]
threshold = sorted(values)[len(values) // 2]
left = [(r, l) for r, l in data if r[best_feat] <= threshold]
right = [(r, l) for r, l in data if r[best_feat] > threshold]
left_label = Counter(l for _, l in left).most_common(1)[0][0]
right_label = Counter(l for _, l in right).most_common(1)[0][0]
return (best_feat, threshold, left_label, right_label)
def predict_tree(tree, x):
feat, thresh, left_l, right_l = tree
return left_l if x[feat] <= thresh else right_l
def random_forest(data, n_trees=5, n_features=2):
trees = []
n = len(data)
all_features = list(range(len(data[0][0])))
for _ in range(n_trees):
# Bootstrap sample
sample = [random.choice(data) for _ in range(n)]
feats = random.sample(all_features, min(n_features, len(all_features)))
trees.append(build_simple_tree(sample, feats))
return trees
# Data: [size, sweetness, color_score] β†’ fruit
data = [([3,8,1],'apple'),([2,3,5],'orange'),([4,9,2],'apple'),
([1,2,6],'orange'),([3,7,1],'apple'),([2,4,4],'orange')]
random.seed(42)
forest = random_forest(data, n_trees=7)
votes = [predict_tree(t, [3, 7, 2]) for t in forest]
print(f"Votes: {votes}")
print(f"Winner: {Counter(votes).most_common(1)[0][0]}")
Output
Votes: ['apple', 'apple', 'orange', 'apple', 'apple', 'apple', 'orange']
Winner: apple
Note: Random Forests are often the first algorithm data scientists reach for. They work well out of the box, handle both classification and regression, rarely overfit (thanks to averaging), and require minimal tuning. The trade-off? They're harder to interpret than a single decision tree β€” you can't easily draw 500 trees on a whiteboard.

Key Metrics

Training
T trees, but each can be trained in parallel
Moderate-High O(T * n * d * log n)
Prediction
Run input through all trees, then vote
Moderate O(T * depth)
Overfitting resistance
Individual trees overfit; the ensemble does not
High Averaging smooths errors
Interpretability
Hard to explain why 500 trees voted a certain way
Low Black-box-ish

Quick check

Why does a random forest use random subsets of features at each split?
Challenge

Continue reading