RL & Generative AI10 min read

Q-Learning

Build a cheat sheet of how good each move is
scope:Core Conceptdifficulty:Intermediate

The Cheat Sheet

Imagine you're playing a video game for the first time. You stumble through rooms, fall into pits, and occasionally find treasure. After dozens of attempts, you start writing notes:

  • "In room A, going left gives +10 points"
  • "In room A, going right leads to a pit (-5 points)"
  • "In room B, going up reaches the treasure (+100 points!)"

Over time, your notebook becomes a cheat sheet — for every room (state) and every direction (action), you know roughly how much reward to expect. That cheat sheet is a Q-table, and the process of building it is Q-learning.

What is Q?

The letter Q stands for "quality." Q(state, action) represents the expected total future reward of taking that action in that state, and then following the best strategy from there on.

A high Q-value means: "this is a great move." A low Q-value means: "this move leads to bad outcomes." A perfect Q-table tells you the optimal move in every situation — just pick the action with the highest Q-value.

The Q-Learning Update Rule

Here's the magic formula that makes Q-learning work:

Q(s, a) ← Q(s, a) + α * [reward + γ * max Q(s', a') - Q(s, a)]

Let's decode this:

  • Q(s, a) — current estimate of how good action a is in state s
  • α (alpha) — learning rate. How much to trust new information vs old estimates. Typically 0.1.
  • reward — the immediate reward received after taking action a
  • γ (gamma) — discount factor. How much to value future rewards vs immediate ones. Typically 0.9.
  • max Q(s', a') — the best Q-value available from the next state s'

In English: "Update my estimate of this move based on the reward I just got, plus the best possible future from where I ended up."

The Discount Factor: Patience vs Impatience

Gamma (γ) controls how far ahead the agent thinks:

  • γ = 0 — completely short-sighted. Only cares about immediate reward. Like a child who always wants candy now.
  • γ = 0.99 — very patient. Values future rewards almost as much as immediate ones. Like investing for retirement.
  • γ = 0.9 — a balanced middle ground. Used in most practical applications.

Q-Learning: From Scratch to Optimal Policy

import random
class TreasureHunt:
"""Simple environment: find treasure, avoid traps."""
# Grid: . = empty, T = treasure (+10), X = trap (-10)
grid = [
['.', '.', 'X', 'T'],
['.', 'X', '.', '.'],
['.', '.', '.', 'X'],
['.', '.', '.', '.']
]
def __init__(self):
self.reset()
def reset(self):
self.pos = (0, 3) # Bottom-left
return self.pos
def step(self, action):
x, y = self.pos
if action == 0: y = min(3, y + 1) # up... wait, row 0 is top
moves = {0: (0,-1), 1: (0,1), 2: (-1,0), 3: (1,0)} # up,down,left,right
dx, dy = moves[action]
nx, ny = max(0,min(3,x+dx)), max(0,min(3,y+dy))
self.pos = (nx, ny)
cell = self.grid[ny][nx]
if cell == 'T': return self.pos, 10, True
if cell == 'X': return self.pos, -10, True
return self.pos, -0.1, False # Small step penalty
def q_learning(episodes=1000):
env = TreasureHunt()
actions = [0, 1, 2, 3] # up, down, left, right
Q = {}
alpha, gamma, epsilon = 0.2, 0.9, 0.2
for ep in range(episodes):
state = env.reset()
for _ in range(20):
if state not in Q:
Q[state] = [0.0] * 4
# Epsilon-greedy
if random.random() < epsilon:
action = random.choice(actions)
else:
action = max(actions, key=lambda a: Q[state][a])
next_state, reward, done = env.step(action)
if next_state not in Q:
Q[next_state] = [0.0] * 4
# Q-learning update
best_next = max(Q[next_state])
Q[state][action] += alpha * (reward + gamma * best_next - Q[state][action])
state = next_state
if done:
break
return Q
random.seed(42)
Q = q_learning()
arrows = ['^', 'v', '<', '>']
print("Learned Q-values and best actions:")
print("Grid: . = safe, X = trap, T = treasure\n")
for y in range(4):
row_str = ""
for x in range(4):
cell = TreasureHunt.grid[y][x]
if cell in ('T', 'X'):
row_str += f" {cell} "
elif (x,y) in Q:
best = arrows[max(range(4), key=lambda a: Q[(x,y)][a])]
row_str += f" {best} "
else:
row_str += " ? "
print(row_str)
Output
Learned Q-values and best actions:
Grid: . = safe, X = trap, T = treasure

  >    >    X    T  
  ^    X    ^    <  
  >    >    ^    X  
  ^    >    >    ^  
Note: Q-learning is off-policy: A key property of Q-learning is that it learns the optimal policy regardless of what policy the agent is actually following during exploration. Even if the agent takes random actions, the Q-table converges to the correct values. This is called "off-policy" learning, and it's what makes Q-learning so robust.

From Q-Tables to Deep Q-Networks (DQN)

Q-tables work great for small problems (a 4x4 grid has just 16 states). But what about a video game with millions of possible screen configurations? You can't have a table that big.

In 2013, DeepMind had a breakthrough: replace the Q-table with a neural network. The network takes a state as input and outputs Q-values for all actions. This is a Deep Q-Network (DQN).

DQN famously learned to play 49 Atari games at superhuman levels — directly from raw pixels, with no game-specific programming. The same algorithm learned Pong, Breakout, and Space Invaders, just by playing millions of games and learning which actions led to high scores.

Today, DQN and its descendants are used in robotics, recommendation systems, and even chip design at Google.

Quick check

What does Q(s, a) represent?
Challenge

Continue reading