Design a Task Queue
Understanding the Problem
Every production system has work that shouldn't block the user β sending emails, processing images, generating reports, running ETL pipelines. A task queue decouples producers (who submit work) from consumers (who execute it).
Functional Requirements:
- Submit a task with a payload and priority level.
- Workers pick up tasks and execute them asynchronously.
- Retry failed tasks with configurable backoff.
- Support task prioritization (high, normal, low).
- Query task status (pending, processing, completed, failed).
Non-Functional Requirements:
- At-least-once delivery: Every task must be executed at least once β losing tasks is unacceptable.
- Horizontal scalability: Add more workers as load increases, no code changes needed.
- Fault tolerance: If a worker crashes mid-task, the task must be retried automatically.
- Ordering: FIFO within the same priority level (best-effort).
Estimation
Let's size this system:
- 10M tasks/day β ~115 tasks/second average, ~350/s peak
- Average task processing time: 5 seconds
- Workers needed at steady state: 115 Γ 5 = ~575 concurrent tasks β ~60 workers (each handling ~10 concurrent tasks)
- Task payload size: ~2 KB average (JSON with params)
- Daily storage: 10M Γ 2 KB = ~20 GB/day for task metadata
- Queue depth at peak: If workers can't keep up for 5 minutes: 350 Γ 300 = 105K queued tasks
The bottleneck is worker throughput, not storage. Auto-scaling workers based on queue depth is critical.
API Design
Submit a Task
| Endpoint | POST /api/v1/tasks |
| Request | {"type": "send_email", "payload": {"to": "[email protected]", ...}, "priority": "high", "idempotency_key": "abc-123"} |
| Response | {"task_id": "t-7f3a2", "status": "queued", "created_at": "..."} |
| Status | 202 Accepted |
Check Task Status
| Endpoint | GET /api/v1/tasks/:id |
| Response | {"task_id": "t-7f3a2", "status": "completed", "result": {...}, "attempts": 1} |
Cancel a Task
| Endpoint | DELETE /api/v1/tasks/:id |
| Response | {"task_id": "t-7f3a2", "status": "cancelled"} |
202 Accepted is the right status code β the request has been accepted for processing, but the processing has not been completed. This signals the asynchronous nature of the system.
Visibility Timeout and Failure Handling
The visibility timeout is the core reliability mechanism. When a worker picks up a task, the task becomes invisible to other workers for a set duration (e.g., 30 seconds). This prevents duplicate processing.
What happens if a worker crashes?
- Worker picks task, visibility timeout starts (30s).
- Worker crashes at second 10 β no ACK sent.
- After 30s, the task becomes visible again in the queue.
- Another worker picks it up and processes it.
Dead Letter Queue (DLQ): After a configurable number of retries (e.g., 3), the task is moved to a separate DLQ. This prevents "poison messages" β tasks that always fail β from blocking the queue forever. Engineers can inspect the DLQ and either fix and replay the tasks, or discard them.
Idempotency: Since tasks may be delivered more than once (at-least-once), task handlers must be idempotent. Use an idempotency key β before processing, check if this key was already processed. If so, skip or return the cached result. This is critical for tasks like "charge payment" or "send email."