Reinforcement Learning Basics
Training a Dog
You're teaching a puppy to sit. You say "sit." The puppy spins in a circle. Nothing happens. The puppy lies down. Nothing. The puppy accidentally sits. Treat! The puppy's eyes light up. After enough repetitions, the puppy learns: sit = treat.
That's reinforcement learning (RL) β learning by trial, error, and reward. No one showed the puppy a manual. No one labeled 10,000 examples of sitting. The puppy just tried things and learned from the consequences.
The RL Framework
Every RL problem has four components:
- Agent β the learner (the puppy, a game bot, a robot)
- Environment β the world the agent interacts with (the room, the game, the factory floor)
- State β what the agent observes right now (the room layout, the game screen, sensor readings)
- Action β what the agent can do (sit, jump, move left, grab an object)
The loop: the agent observes a state, takes an action, receives a reward (positive or negative), and transitions to a new state. Repeat forever. The goal: learn a policy β a strategy that maximizes total reward over time.
The Exploration-Exploitation Dilemma
Imagine you're in a new city, hungry for dinner. You found a decent restaurant yesterday (7/10). Do you go back (exploit what you know) or try a new place (explore and maybe find a 9/10 β or a 3/10)?
This is the exploration-exploitation dilemma, and it's at the heart of RL. Too much exploitation: you settle for mediocrity. Too much exploration: you never capitalize on what you've learned. Every RL algorithm must balance the two.
A common approach: epsilon-greedy. With probability epsilon (say 10%), take a random action (explore). With probability 1-epsilon (90%), take the best known action (exploit). Over time, gradually decrease epsilon β explore a lot at first, then settle into exploiting what you've learned.
Reward Shaping: The Art of Incentives
Designing the right reward function is crucial β and tricky. A robot told to "reach the door" might learn to fall forward because it's the fastest way to move the sensor closer. A game bot rewarded for score might discover an exploit instead of playing as intended.
This is called reward hacking, and it's a preview of the alignment challenges facing AI today. The agent will do exactly what you incentivize, which might not be what you actually want.
A Simple RL Agent in a Grid World
RL vs Supervised Learning
In supervised learning, you have a dataset of correct answers: "this image is a cat." In RL, there are no correct answers β only rewards that come after the action. The agent has to figure out which of its past actions led to the reward. Did it get the treat because it sat, because it wagged its tail, or because it barked three times first?
This is called the credit assignment problem β figuring out which actions deserve credit for a delayed reward. It's what makes RL both powerful and challenging.