RL & Generative AI10 min read

Reinforcement Learning Basics

Learn by trial, error, and reward
scope:Core Conceptdifficulty:Intermediate

Training a Dog

You're teaching a puppy to sit. You say "sit." The puppy spins in a circle. Nothing happens. The puppy lies down. Nothing. The puppy accidentally sits. Treat! The puppy's eyes light up. After enough repetitions, the puppy learns: sit = treat.

That's reinforcement learning (RL) β€” learning by trial, error, and reward. No one showed the puppy a manual. No one labeled 10,000 examples of sitting. The puppy just tried things and learned from the consequences.

The RL Framework

Every RL problem has four components:

  • Agent β€” the learner (the puppy, a game bot, a robot)
  • Environment β€” the world the agent interacts with (the room, the game, the factory floor)
  • State β€” what the agent observes right now (the room layout, the game screen, sensor readings)
  • Action β€” what the agent can do (sit, jump, move left, grab an object)

The loop: the agent observes a state, takes an action, receives a reward (positive or negative), and transitions to a new state. Repeat forever. The goal: learn a policy β€” a strategy that maximizes total reward over time.

The Exploration-Exploitation Dilemma

Imagine you're in a new city, hungry for dinner. You found a decent restaurant yesterday (7/10). Do you go back (exploit what you know) or try a new place (explore and maybe find a 9/10 β€” or a 3/10)?

This is the exploration-exploitation dilemma, and it's at the heart of RL. Too much exploitation: you settle for mediocrity. Too much exploration: you never capitalize on what you've learned. Every RL algorithm must balance the two.

A common approach: epsilon-greedy. With probability epsilon (say 10%), take a random action (explore). With probability 1-epsilon (90%), take the best known action (exploit). Over time, gradually decrease epsilon β€” explore a lot at first, then settle into exploiting what you've learned.

Reward Shaping: The Art of Incentives

Designing the right reward function is crucial β€” and tricky. A robot told to "reach the door" might learn to fall forward because it's the fastest way to move the sensor closer. A game bot rewarded for score might discover an exploit instead of playing as intended.

This is called reward hacking, and it's a preview of the alignment challenges facing AI today. The agent will do exactly what you incentivize, which might not be what you actually want.

A Simple RL Agent in a Grid World

import random
class GridWorld:
"""A 4x4 grid. Agent starts at (0,0), goal at (3,3)."""
def __init__(self):
self.size = 4
self.reset()
def reset(self):
self.pos = (0, 0)
return self.pos
def step(self, action):
x, y = self.pos
if action == 'up': y = max(0, y - 1)
elif action == 'down': y = min(self.size - 1, y + 1)
elif action == 'left': x = max(0, x - 1)
elif action == 'right': x = min(self.size - 1, x + 1)
self.pos = (x, y)
if self.pos == (3, 3):
return self.pos, 10, True # Big reward for goal!
return self.pos, -1, False # Small penalty per step
def train_agent(episodes=500):
env = GridWorld()
actions = ['up', 'down', 'left', 'right']
# Q-table: state -> action -> value
Q = {}
for x in range(4):
for y in range(4):
Q[(x,y)] = {a: 0.0 for a in actions}
epsilon = 0.3
alpha = 0.1 # Learning rate
gamma = 0.9 # Discount factor
for ep in range(episodes):
state = env.reset()
for _ in range(50): # Max steps
# Epsilon-greedy action selection
if random.random() < epsilon:
action = random.choice(actions)
else:
action = max(actions, key=lambda a: Q[state][a])
next_state, reward, done = env.step(action)
# Q-learning update
best_next = max(Q[next_state].values())
Q[state][action] += alpha * (reward + gamma * best_next - Q[state][action])
state = next_state
if done:
break
return Q
random.seed(42)
Q = train_agent()
# Show learned policy
print("Learned policy (best action per cell):")
arrows = {'up': '^', 'down': 'v', 'left': '<', 'right': '>'}
for y in range(4):
row = ""
for x in range(4):
if (x, y) == (3, 3):
row += " G "
else:
best = max(Q[(x,y)], key=Q[(x,y)].get)
row += f" {arrows[best]} "
print(row)
Output
Learned policy (best action per cell):
 v  v  >  v
 >  v  >  v
 >  >  >  v
 >  >  >  G
Note: RL's greatest hits: AlphaGo (2016) used RL to master Go, beating world champion Lee Sedol. OpenAI Five (2019) used RL to beat professional Dota 2 teams. RL has also been used to optimize data center cooling (saving Google 40% on energy), control nuclear fusion reactors, and fine-tune large language models (RLHF β€” the technique that made ChatGPT so conversational).

RL vs Supervised Learning

In supervised learning, you have a dataset of correct answers: "this image is a cat." In RL, there are no correct answers β€” only rewards that come after the action. The agent has to figure out which of its past actions led to the reward. Did it get the treat because it sat, because it wagged its tail, or because it barked three times first?

This is called the credit assignment problem β€” figuring out which actions deserve credit for a delayed reward. It's what makes RL both powerful and challenging.

Quick check

What is the exploration-exploitation dilemma?
Challenge

Continue reading