Q-Learning
The Cheat Sheet
Imagine you're playing a video game for the first time. You stumble through rooms, fall into pits, and occasionally find treasure. After dozens of attempts, you start writing notes:
- "In room A, going left gives +10 points"
- "In room A, going right leads to a pit (-5 points)"
- "In room B, going up reaches the treasure (+100 points!)"
Over time, your notebook becomes a cheat sheet — for every room (state) and every direction (action), you know roughly how much reward to expect. That cheat sheet is a Q-table, and the process of building it is Q-learning.
What is Q?
The letter Q stands for "quality." Q(state, action) represents the expected total future reward of taking that action in that state, and then following the best strategy from there on.
A high Q-value means: "this is a great move." A low Q-value means: "this move leads to bad outcomes." A perfect Q-table tells you the optimal move in every situation — just pick the action with the highest Q-value.
The Q-Learning Update Rule
Here's the magic formula that makes Q-learning work:
Q(s, a) ← Q(s, a) + α * [reward + γ * max Q(s', a') - Q(s, a)]
Let's decode this:
Q(s, a)— current estimate of how good actionais in statesα(alpha) — learning rate. How much to trust new information vs old estimates. Typically 0.1.reward— the immediate reward received after taking actionaγ(gamma) — discount factor. How much to value future rewards vs immediate ones. Typically 0.9.max Q(s', a')— the best Q-value available from the next states'
In English: "Update my estimate of this move based on the reward I just got, plus the best possible future from where I ended up."
The Discount Factor: Patience vs Impatience
Gamma (γ) controls how far ahead the agent thinks:
γ = 0— completely short-sighted. Only cares about immediate reward. Like a child who always wants candy now.γ = 0.99— very patient. Values future rewards almost as much as immediate ones. Like investing for retirement.γ = 0.9— a balanced middle ground. Used in most practical applications.
Q-Learning: From Scratch to Optimal Policy
From Q-Tables to Deep Q-Networks (DQN)
Q-tables work great for small problems (a 4x4 grid has just 16 states). But what about a video game with millions of possible screen configurations? You can't have a table that big.
In 2013, DeepMind had a breakthrough: replace the Q-table with a neural network. The network takes a state as input and outputs Q-values for all actions. This is a Deep Q-Network (DQN).
DQN famously learned to play 49 Atari games at superhuman levels — directly from raw pixels, with no game-specific programming. The same algorithm learned Pong, Breakout, and Space Invaders, just by playing millions of games and learning which actions led to high scores.
Today, DQN and its descendants are used in robotics, recommendation systems, and even chip design at Google.