RL & Generative AI10 min read

Reinforcement Learning Basics

Learn by trial, error, and reward

scope:Core Conceptdifficulty:Intermediate

Training a Dog

You're teaching a puppy to sit. You say "sit." The puppy spins in a circle. Nothing happens. The puppy lies down. Nothing. The puppy accidentally sits. Treat! The puppy's eyes light up. After enough repetitions, the puppy learns: sit = treat.

That's reinforcement learning (RL) — learning by trial, error, and reward. No one showed the puppy a manual. No one labeled 10,000 examples of sitting. The puppy just tried things and learned from the consequences.

The RL Framework

Every RL problem has four components:

Agent — the learner (the puppy, a game bot, a robot)
Environment — the world the agent interacts with (the room, the game, the factory floor)
State — what the agent observes right now (the room layout, the game screen, sensor readings)
Action — what the agent can do (sit, jump, move left, grab an object)

The loop: the agent observes a state, takes an action, receives a reward (positive or negative), and transitions to a new state. Repeat forever. The goal: learn a policy — a strategy that maximizes total reward over time.

The Exploration-Exploitation Dilemma

Imagine you're in a new city, hungry for dinner. You found a decent restaurant yesterday (7/10). Do you go back (exploit what you know) or try a new place (explore and maybe find a 9/10 — or a 3/10)?

This is the exploration-exploitation dilemma, and it's at the heart of RL. Too much exploitation: you settle for mediocrity. Too much exploration: you never capitalize on what you've learned. Every RL algorithm must balance the two.

A common approach: epsilon-greedy. With probability epsilon (say 10%), take a random action (explore). With probability 1-epsilon (90%), take the best known action (exploit). Over time, gradually decrease epsilon — explore a lot at first, then settle into exploiting what you've learned.

Reward Shaping: The Art of Incentives

Designing the right reward function is crucial — and tricky. A robot told to "reach the door" might learn to fall forward because it's the fastest way to move the sensor closer. A game bot rewarded for score might discover an exploit instead of playing as intended.

This is called reward hacking, and it's a preview of the alignment challenges facing AI today. The agent will do exactly what you incentivize, which might not be what you actually want.

A Simple RL Agent in a Grid World

import random

class GridWorld:
    """A 4x4 grid. Agent starts at (0,0), goal at (3,3)."""
    def __init__(self):
        self.size = 4
        self.reset()
    
    def reset(self):
        self.pos = (0, 0)
        return self.pos
    
    def step(self, action):
        x, y = self.pos
        if action == 'up':    y = max(0, y - 1)
        elif action == 'down':  y = min(self.size - 1, y + 1)
        elif action == 'left':  x = max(0, x - 1)
        elif action == 'right': x = min(self.size - 1, x + 1)
        
        self.pos = (x, y)
        
        if self.pos == (3, 3):
            return self.pos, 10, True   # Big reward for goal!
        return self.pos, -1, False      # Small penalty per step

def train_agent(episodes=500):
    env = GridWorld()
    actions = ['up', 'down', 'left', 'right']
    
    # Q-table: state -> action -> value
    Q = {}
    for x in range(4):
        for y in range(4):
            Q[(x,y)] = {a: 0.0 for a in actions}
    
    epsilon = 0.3
    alpha = 0.1    # Learning rate
    gamma = 0.9    # Discount factor
    
    for ep in range(episodes):
        state = env.reset()
        for _ in range(50):  # Max steps
            # Epsilon-greedy action selection
            if random.random() < epsilon:
                action = random.choice(actions)
            else:
                action = max(actions, key=lambda a: Q[state][a])
            
            next_state, reward, done = env.step(action)
            
            # Q-learning update
            best_next = max(Q[next_state].values())
            Q[state][action] += alpha * (reward + gamma * best_next - Q[state][action])
            
            state = next_state
            if done:
                break
    
    return Q

random.seed(42)
Q = train_agent()

# Show learned policy
print("Learned policy (best action per cell):")
arrows = {'up': '^', 'down': 'v', 'left': '<', 'right': '>'}
for y in range(4):
    row = ""
    for x in range(4):
        if (x, y) == (3, 3):
            row += " G "
        else:
            best = max(Q[(x,y)], key=Q[(x,y)].get)
            row += f" {arrows[best]} "
    print(row)

Output

Learned policy (best action per cell):
 v  v  >  v
 >  v  >  v
 >  >  >  v
 >  >  >  G

Note: RL's greatest hits: AlphaGo (2016) used RL to master Go, beating world champion Lee Sedol. OpenAI Five (2019) used RL to beat professional Dota 2 teams. RL has also been used to optimize data center cooling (saving Google 40% on energy), control nuclear fusion reactors, and fine-tune large language models (RLHF — the technique that made ChatGPT so conversational).

RL vs Supervised Learning

In supervised learning, you have a dataset of correct answers: "this image is a cat." In RL, there are no correct answers — only rewards that come after the action. The agent has to figure out which of its past actions led to the reward. Did it get the treat because it sat, because it wagged its tail, or because it barked three times first?

This is called the credit assignment problem — figuring out which actions deserve credit for a delayed reward. It's what makes RL both powerful and challenging.

Quick check

What is the exploration-exploitation dilemma?

Challenge

Q-Learning

Build a cheat sheet of how good each move is

→

Minimax Algorithm

Think ahead — what's the best move if your opponent plays perfectly?

→

AI Safety

Making sure AI does what we actually want

→