Design a Notification System
Understanding the Problem
Notifications are everywhere. Your phone buzzes with a text, your email dings with a shipping update, a push notification reminds you about a sale. Behind all of this is a notification system — a service that decides what to send, who to send it to, and how to deliver it.
Functional Requirements:
- Support multiple channels: push notifications (iOS/Android), SMS, and email.
- Triggered by events ("Your order shipped") or scheduled ("Weekly digest").
- Users can manage their preferences — opt out of marketing emails, choose quiet hours, etc.
- Support templates with personalization ("Hi {name}, your order #{order_id} has shipped").
Non-Functional Requirements:
- Reliability: Critical notifications (password reset, 2FA codes) must not be lost.
- At-least-once delivery: Better to send twice than not at all (for most notifications).
- Scale: 10M+ notifications per day across all channels.
- Low latency: Time-sensitive notifications (2FA codes) should arrive within seconds.
Notification Types
Each channel has its own delivery mechanism and constraints:
Push Notifications (Mobile)
- Delivered through APNs (Apple Push Notification service) for iOS and FCM (Firebase Cloud Messaging) for Android.
- You send a payload (title, body, data) to the provider along with the device token.
- The provider delivers it to the device. You have no control over timing — the OS decides when to show it.
- Device tokens can become invalid (user uninstalls the app, gets a new phone).
SMS
- Sent through providers like Twilio, Vonage, or AWS SNS.
- Expensive — $0.01-0.05 per message. Use sparingly.
- Great for critical alerts (2FA codes, fraud alerts) because nearly everyone has a phone.
- Character limits (160 chars for standard SMS).
- Sent through providers like SendGrid, Amazon SES, or Mailgun.
- Cheap and rich — you can include HTML, images, links, attachments.
- But deliverability is a challenge: spam filters, bounces, rate limits from providers.
- Best for non-urgent, content-rich notifications (order summaries, newsletters).
High-Level Architecture
Here's the system broken into components:
1. Notification Triggers — The source of events. Could be:
- An internal service ("order service says order #123 shipped")
- A cron job ("send weekly digest every Monday")
- A user action ("Alice mentioned you in a comment")
2. Notification Service (API) — Receives notification requests, validates them, looks up user preferences, applies rate limits, and enqueues the notification.
3. User Preferences Store — A database (or cache) with each user's notification settings: which channels they've enabled, quiet hours, frequency preferences.
4. Template Engine — Stores templates and renders them with personalized data. Template: "Hi {name}, your order #{order_id} is on its way!" → "Hi Alice, your order #7892 is on its way!"
5. Message Queues — Separate queues for each channel (push queue, SMS queue, email queue). This decouples the notification service from delivery and allows independent scaling.
6. Channel Workers — Consumers that pull from queues and deliver via the appropriate provider (APNs, Twilio, SendGrid).
7. Delivery Log — Records every notification attempt and its result (delivered, bounced, failed). Used for analytics, debugging, and retry logic.
Notification Service Core Logic
Reliability and Retry Logic
Notifications can fail for many reasons: the phone is off, the email bounces, the SMS provider is having an outage. A reliable system handles failures gracefully.
Retry with exponential backoff: If delivery fails, retry after 1s, then 4s, then 16s, then 64s. This avoids hammering a struggling provider.
Max retries: After N failures (say, 5), move the notification to a dead letter queue for manual investigation. Don't retry forever.
Idempotency: Retries might deliver the same notification twice. For non-critical notifications ("someone liked your post"), this is annoying but acceptable. For critical ones ("your 2FA code is 123456"), deduplicate on the client side using notification IDs.
Fallback channels: If push notification fails (maybe the user uninstalled the app), fall back to SMS or email. Define a priority chain: Push → SMS → Email.
User Preferences and Rate Limiting
Nobody likes being spammed. A good notification system respects user preferences:
- Channel opt-in/opt-out: "Send me push notifications but not emails."
- Category preferences: "I want order updates but not marketing."
- Quiet hours: "Don't bother me between 10 PM and 8 AM" (except critical alerts like security).
- Frequency caps: "No more than 5 notifications per hour." This prevents a burst of activity (50 likes on a popular post) from buzzing someone's phone 50 times.
Rate limiting notifications per user is different from API rate limiting. Here, you're protecting the user experience, not the server. Aggregate notifications when possible: "Alice and 47 others liked your post" instead of 48 separate notifications.
Analytics and Tracking
You need to know if notifications are working. Track these metrics:
- Delivery rate: What percentage of notifications actually reach the user? (APNs and FCM provide delivery receipts.)
- Open rate: What percentage of push notifications are tapped? What percentage of emails are opened? (Use tracking pixels for emails.)
- Click-through rate: How many users clicked a link in the notification?
- Unsubscribe rate: Are notifications causing users to opt out? High unsubscribe rates signal you're sending too much.
- Latency: Time from trigger to delivery. Critical for time-sensitive alerts like 2FA codes.
Store delivery events in a data warehouse (like BigQuery or Snowflake) for analytics. Use dashboards to monitor health in real-time.