Message Queues
Why Message Queues?
Imagine a restaurant. When you order, the waiter doesn't stand at the grill waiting for your burger. They write your order on a ticket and clip it to the line. The cook grabs tickets in order and works through them. The waiter is free to take more orders.
That ticket line is a message queue. It decouples the sender (waiter) from the receiver (cook) so they can work at their own pace.
In software, message queues sit between services. Instead of Service A directly calling Service B and waiting for a response, Service A drops a message in the queue and moves on. Service B picks it up whenever it's ready.
This pattern solves three critical problems:
- Decoupling — Services don't need to know about each other. Service A just sends a message; it doesn't care who processes it.
- Buffering — If traffic spikes, the queue absorbs the burst (complementing load balancing). Consumers process at their own pace without being overwhelmed.
- Reliability — If Service B crashes, messages wait safely in the queue until it recovers. No data lost.
Messaging Patterns
Point-to-Point (Queue)
One producer sends a message, and exactly one consumer receives it. Like a task queue — once a worker picks up a task, no other worker gets it. Perfect for job processing, order handling, or any work that should happen exactly once.
Publish/Subscribe (Pub/Sub)
One producer publishes a message to a topic, and all subscribers receive a copy. Like a radio broadcast — everyone tuned to the channel hears the message. Perfect for event notifications, real-time updates, or fan-out processing.
For example, when a user uploads a photo:
- The image service publishes a "photo-uploaded" event to a topic
- The thumbnail service subscribes and generates thumbnails
- The notification service subscribes and alerts followers
- The analytics service subscribes and logs the event
Each service works independently. Adding a new subscriber (like a moderation service) doesn't require changing the publisher at all.
Producer/Consumer Pattern with a Simple Queue
Popular Message Brokers
Apache Kafka — A distributed streaming platform built for high throughput. Messages are stored in ordered, immutable logs organized into topics and partitions. Consumers track their position (offset) and can replay messages. Kafka handles millions of messages per second and retains data for days or weeks. Best for: event streaming, log aggregation, real-time analytics.
RabbitMQ — A traditional message broker that excels at complex routing. Supports multiple messaging patterns out of the box: direct, topic, fanout, and headers exchanges. Messages are typically deleted after consumption. Best for: task queues, request-reply patterns, complex routing logic.
Amazon SQS — A fully managed queue service from AWS. No infrastructure to manage. Two flavors: Standard (best-effort ordering, at-least-once delivery) and FIFO (strict ordering, exactly-once processing). Best for: simple cloud workloads, serverless architectures, teams that don't want to manage infrastructure.
Quick comparison:
- Throughput: Kafka >> RabbitMQ > SQS
- Message replay: Kafka (yes) vs RabbitMQ/SQS (no, messages deleted after consumption)
- Complexity: Kafka (high) vs RabbitMQ (medium) vs SQS (low)
- Managed option: SQS (fully), Kafka (Confluent Cloud, AWS MSK), RabbitMQ (Amazon MQ)
Delivery Guarantees
How many times does a consumer receive each message? This is one of the trickiest problems in distributed systems.
At-most-once: The message is delivered zero or one times. If something goes wrong, the message is lost. Fast but unreliable. Like sending a postcard — it might get there, it might not, you'll never know.
At-least-once: The message is delivered one or more times. If the consumer crashes before acknowledging, the message is redelivered. You might process it twice, so your consumer must be idempotent (processing the same message twice has the same result as once). This is the most common guarantee.
Exactly-once: The holy grail — every message is processed exactly one time. Extremely hard to achieve in distributed systems. Kafka achieves it through a combination of idempotent producers and transactional consumers, but it comes with a performance cost.
In practice, most systems use at-least-once delivery with idempotent consumers. For example, when processing a payment, check if the payment ID has already been processed before charging the card again.
Dead Letter Queues
What happens when a message can't be processed? Maybe the data is malformed, or a dependent service is permanently down. If you keep retrying forever, the bad message blocks the entire queue — a poison pill.
The solution: a Dead Letter Queue (DLQ). After a message fails N times (say, 3 retries), it's moved to a separate queue for inspection. The main queue keeps flowing, and an engineer can later examine the DLQ to fix the issue.
DLQs are essential for production systems. They prevent one bad message from bringing your entire pipeline to a halt.
Backpressure
What if producers generate messages faster than consumers can process them? The queue grows and grows until you run out of memory or disk. This is a backpressure problem.
Strategies to handle it:
- Drop messages: When the queue is full, reject new messages. Simple but you lose data. OK for metrics or logs, not OK for orders.
- Block the producer: Make the producer wait until there's space. This naturally slows down the system but can cause cascading slowdowns.
- Scale consumers: Automatically add more consumers when the queue depth exceeds a threshold. The most common cloud approach.
- Set queue limits: Configure a max queue size with an overflow policy (dead letter, oldest-first eviction, etc.).
Monitoring queue depth is critical. If it keeps growing, you either need more consumers or fewer producers. A steadily increasing queue is a ticking time bomb.