Real-World Systems11 min read

Design a Notification System

The right message, to the right person, at the right time
scope:Real-World Systemdifficulty:Intermediate

Understanding the Problem

Notifications are everywhere. Your phone buzzes with a text, your email dings with a shipping update, a push notification reminds you about a sale. Behind all of this is a notification system — a service that decides what to send, who to send it to, and how to deliver it.

Functional Requirements:

  • Support multiple channels: push notifications (iOS/Android), SMS, and email.
  • Triggered by events ("Your order shipped") or scheduled ("Weekly digest").
  • Users can manage their preferences — opt out of marketing emails, choose quiet hours, etc.
  • Support templates with personalization ("Hi {name}, your order #{order_id} has shipped").

Non-Functional Requirements:

  • Reliability: Critical notifications (password reset, 2FA codes) must not be lost.
  • At-least-once delivery: Better to send twice than not at all (for most notifications).
  • Scale: 10M+ notifications per day across all channels.
  • Low latency: Time-sensitive notifications (2FA codes) should arrive within seconds.
Notification types: Push, Email, and SMS

Notification Types

Each channel has its own delivery mechanism and constraints:

Push Notifications (Mobile)

  • Delivered through APNs (Apple Push Notification service) for iOS and FCM (Firebase Cloud Messaging) for Android.
  • You send a payload (title, body, data) to the provider along with the device token.
  • The provider delivers it to the device. You have no control over timing — the OS decides when to show it.
  • Device tokens can become invalid (user uninstalls the app, gets a new phone).

SMS

  • Sent through providers like Twilio, Vonage, or AWS SNS.
  • Expensive — $0.01-0.05 per message. Use sparingly.
  • Great for critical alerts (2FA codes, fraud alerts) because nearly everyone has a phone.
  • Character limits (160 chars for standard SMS).

Email

  • Sent through providers like SendGrid, Amazon SES, or Mailgun.
  • Cheap and rich — you can include HTML, images, links, attachments.
  • But deliverability is a challenge: spam filters, bounces, rate limits from providers.
  • Best for non-urgent, content-rich notifications (order summaries, newsletters).

High-Level Architecture

Here's the system broken into components:

1. Notification Triggers — The source of events. Could be:

  • An internal service ("order service says order #123 shipped")
  • A cron job ("send weekly digest every Monday")
  • A user action ("Alice mentioned you in a comment")

2. Notification Service (API) — Receives notification requests, validates them, looks up user preferences, applies rate limits, and enqueues the notification.

3. User Preferences Store — A database (or cache) with each user's notification settings: which channels they've enabled, quiet hours, frequency preferences.

4. Template Engine — Stores templates and renders them with personalized data. Template: "Hi {name}, your order #{order_id} is on its way!" → "Hi Alice, your order #7892 is on its way!"

5. Message Queues — Separate queues for each channel (push queue, SMS queue, email queue). This decouples the notification service from delivery and allows independent scaling.

6. Channel Workers — Consumers that pull from queues and deliver via the appropriate provider (APNs, Twilio, SendGrid).

7. Delivery Log — Records every notification attempt and its result (delivered, bounced, failed). Used for analytics, debugging, and retry logic.

The pipeline: Trigger → Process → Deliver
Notification dispatch flow: event triggers preference check, then routes to enabled channels in parallel

Notification Service Core Logic

from dataclasses import dataclass
from enum import Enum
from typing import Optional
import json
class Channel(Enum):
PUSH = "push"
SMS = "sms"
EMAIL = "email"
@dataclass
class NotificationRequest:
user_id: str
template_id: str
channel: Channel
data: dict # Template variables: {name: "Alice", order_id: "7892"}
priority: str = "normal" # "critical", "normal", "low"
class NotificationService:
def __init__(self, preference_store, template_engine,
rate_limiter, queue_client, dedup_cache):
self.preferences = preference_store
self.templates = template_engine
self.rate_limiter = rate_limiter
self.queue = queue_client
self.dedup = dedup_cache
def send(self, request: NotificationRequest) -> dict:
# Step 1: Deduplicate — prevent sending the same notification twice
dedup_key = f"{request.user_id}:{request.template_id}:{hash(str(request.data))}"
if self.dedup.exists(dedup_key):
return {"status": "skipped", "reason": "duplicate"}
# Step 2: Check user preferences
prefs = self.preferences.get(request.user_id)
if not prefs.is_channel_enabled(request.channel):
return {"status": "skipped", "reason": "user_opted_out"}
if prefs.is_quiet_hours():
if request.priority != "critical": # Critical bypasses quiet hours
return {"status": "deferred", "reason": "quiet_hours"}
# Step 3: Rate limit — don't spam users
if not self.rate_limiter.allow(request.user_id, request.channel):
return {"status": "rate_limited"}
# Step 4: Render template
content = self.templates.render(request.template_id, request.data)
# Step 5: Enqueue to the right channel queue
message = {
"user_id": request.user_id,
"channel": request.channel.value,
"content": content,
"priority": request.priority,
"attempt": 1
}
queue_name = f"notifications-{request.channel.value}"
self.queue.publish(queue_name, json.dumps(message))
# Step 6: Mark as sent for dedup
self.dedup.set(dedup_key, ttl=3600)
return {"status": "queued", "channel": request.channel.value}
# Usage
# service.send(NotificationRequest(
# user_id="user_42",
# template_id="order_shipped",
# channel=Channel.PUSH,
# data={"name": "Alice", "order_id": "7892"}
# ))
Output
# Returns: {"status": "queued", "channel": "push"}
# Message flows: API → Queue → Push Worker → APNs → User's phone

Reliability and Retry Logic

Notifications can fail for many reasons: the phone is off, the email bounces, the SMS provider is having an outage. A reliable system handles failures gracefully.

Retry with exponential backoff: If delivery fails, retry after 1s, then 4s, then 16s, then 64s. This avoids hammering a struggling provider.

Max retries: After N failures (say, 5), move the notification to a dead letter queue for manual investigation. Don't retry forever.

Idempotency: Retries might deliver the same notification twice. For non-critical notifications ("someone liked your post"), this is annoying but acceptable. For critical ones ("your 2FA code is 123456"), deduplicate on the client side using notification IDs.

Fallback channels: If push notification fails (maybe the user uninstalled the app), fall back to SMS or email. Define a priority chain: Push → SMS → Email.

User preferences: filtering by channel
Retry and reliability with exponential backoff

User Preferences and Rate Limiting

Nobody likes being spammed. A good notification system respects user preferences:

  • Channel opt-in/opt-out: "Send me push notifications but not emails."
  • Category preferences: "I want order updates but not marketing."
  • Quiet hours: "Don't bother me between 10 PM and 8 AM" (except critical alerts like security).
  • Frequency caps: "No more than 5 notifications per hour." This prevents a burst of activity (50 likes on a popular post) from buzzing someone's phone 50 times.

Rate limiting notifications per user is different from API rate limiting. Here, you're protecting the user experience, not the server. Aggregate notifications when possible: "Alice and 47 others liked your post" instead of 48 separate notifications.

Note: Think of a notification system like a postal service. The notification service is the post office — it sorts mail. Message queues are the delivery trucks — one for packages (push), one for letters (email), one for telegrams (SMS). User preferences are the "No Junk Mail" signs on mailboxes. And retry logic is the 'return to sender, try again tomorrow' process.
Full architecture: scalable multi-channel system

Analytics and Tracking

You need to know if notifications are working. Track these metrics:

  • Delivery rate: What percentage of notifications actually reach the user? (APNs and FCM provide delivery receipts.)
  • Open rate: What percentage of push notifications are tapped? What percentage of emails are opened? (Use tracking pixels for emails.)
  • Click-through rate: How many users clicked a link in the notification?
  • Unsubscribe rate: Are notifications causing users to opt out? High unsubscribe rates signal you're sending too much.
  • Latency: Time from trigger to delivery. Critical for time-sensitive alerts like 2FA codes.

Store delivery events in a data warehouse (like BigQuery or Snowflake) for analytics. Use dashboards to monitor health in real-time.

Note: Interview tip: The three things interviewers look for in a notification system design are: (1) multi-channel support with separate queues per channel, (2) user preferences and rate limiting to prevent spam, and (3) reliability — retry logic, dead letter queues, and fallback channels. Hit all three and you're in great shape.

Key Metrics

Notification API → QueueFast enqueue
~5-20 ms\(O(1)\)
Push delivery (APNs/FCM)Provider latency
~100-500 ms\(O(1)\)
SMS delivery (Twilio)Carrier routing
~1-5 sec\(O(1)\)
Email delivery (SES)SMTP handshake + delivery
~1-30 sec\(O(1)\)
Template renderingn = template size
~1-5 ms\(O(n)\)
Preference lookup (cached)Redis cache
< 1 ms\(O(1)\)
Daily throughput target~115/sec average
10M+ notifs/day

Quick check

Why use separate message queues for each notification channel (push, SMS, email)?

Continue reading

Message Queues
Don't do everything right now — put it in line
Design a Chat System
Real-time messaging for millions — delivered instantly
Design a Rate Limiter
Stop the flood before it drowns your servers
QueueData Structure
FIFO — array & linked-list backed