Random Forests

Imagine you're trying to guess the number of jellybeans in a jar. You ask 100 different people to guess. Individually, most guesses are wildly off. But if you average all 100 guesses, the result is eerily close to the real number.

This is the wisdom of the crowd, and it's the core idea behind Random Forests. Instead of relying on one decision tree (which might overfit), you build hundreds of slightly different trees and let them vote.

Each tree might be mediocre on its own. But their combined vote? Surprisingly accurate.

What makes each tree "random"?

If you built 100 identical trees, they'd all make the same mistakes — voting wouldn't help. The trick is to make each tree different by introducing two types of randomness:

1. Random data (Bootstrap sampling)

Each tree gets a random subset of the training data, sampled with replacement. Imagine putting all your data into a bag, pulling out samples one at a time (putting each back before the next draw). Some examples appear twice, some get left out. Each tree sees a slightly different version of reality.

2. Random features

At each split, the tree doesn't consider all features — only a random subset. If you have 10 features, each split might only look at 3 or 4. This forces trees to find different patterns and prevents them from all latching onto the same dominant feature.

Random Forest from Simple Trees

import random
from collections import Counter

def build_simple_tree(data, features_subset):
    """Simplified: pick best feature from subset, split at median"""
    best_feat = random.choice(features_subset)
    values = [row[best_feat] for row, _ in data]
    threshold = sorted(values)[len(values) // 2]
    left = [(r, l) for r, l in data if r[best_feat] <= threshold]
    right = [(r, l) for r, l in data if r[best_feat] > threshold]
    left_label = Counter(l for _, l in left).most_common(1)[0][0]
    right_label = Counter(l for _, l in right).most_common(1)[0][0]
    return (best_feat, threshold, left_label, right_label)

def predict_tree(tree, x):
    feat, thresh, left_l, right_l = tree
    return left_l if x[feat] <= thresh else right_l

def random_forest(data, n_trees=5, n_features=2):
    trees = []
    n = len(data)
    all_features = list(range(len(data[0][0])))
    for _ in range(n_trees):
        # Bootstrap sample
        sample = [random.choice(data) for _ in range(n)]
        feats = random.sample(all_features, min(n_features, len(all_features)))
        trees.append(build_simple_tree(sample, feats))
    return trees

# Data: [size, sweetness, color_score] → fruit
data = [([3,8,1],'apple'),([2,3,5],'orange'),([4,9,2],'apple'),
        ([1,2,6],'orange'),([3,7,1],'apple'),([2,4,4],'orange')]

random.seed(42)
forest = random_forest(data, n_trees=7)
votes = [predict_tree(t, [3, 7, 2]) for t in forest]
print(f"Votes: {votes}")
print(f"Winner: {Counter(votes).most_common(1)[0][0]}")

Output

Votes: ['apple', 'apple', 'orange', 'apple', 'apple', 'apple', 'orange']
Winner: apple

Key Metrics

Training

T trees, but each can be trained in parallel

Moderate-High O(T * n * d * log n)

Prediction

Run input through all trees, then vote

Moderate O(T * depth)

Overfitting resistance

Individual trees overfit; the ensemble does not

High Averaging smooths errors

Interpretability

Hard to explain why 500 trees voted a certain way

Low Black-box-ish

What makes each tree "random"?

1. Random data (Bootstrap sampling)

2. Random features

Random Forest from Simple Trees

Key Metrics

Quick check