Design an Image Hosting Service
Understanding the Problem
Image hosting services let users upload images, generate shareable links, and serve those images fast to millions of viewers. Think Imgur, Flickr, or the image backends behind social media platforms.
Let's define what we need:
Functional Requirements:
- Upload: Users can upload images (JPEG, PNG, GIF, WebP) up to 10 MB.
- View: Anyone with the link can view the image β no authentication required for public images.
- Delete: Image owners can delete their uploads.
- Resize/Thumbnail: Automatically generate thumbnails (150Γ150, 300Γ300, 600Γ600) for different contexts (previews, embeds, galleries).
Non-Functional Requirements:
- Low latency serving: Images should load in under 50ms for most users worldwide. CDN is essential.
- High availability: The service must be up 99.99% of the time. A broken image link is a terrible user experience.
- Handle large files: Uploads up to 10 MB. The system must handle multipart uploads and not timeout on slow connections.
- Durability: Once uploaded, images must never be lost. We need redundant storage.
Estimation
Let's size this system:
- 10M uploads/day (~115 uploads/second, peak ~350/s)
- Average image size: 2 MB
- Read:Write ratio 10:1: ~1,150 reads/second average, ~1.2M reads/second peak (viral images)
- Daily storage: 10M Γ 2 MB = 20 TB/day of new images
- 5-year storage: 20 TB Γ 365 Γ 5 = ~36 PB (originals only β thumbnails add ~30% more)
- Bandwidth: At peak 1.2M reads/s Γ 200 KB avg served size = ~240 GB/s outbound β this is why CDN is non-negotiable
This is a storage-heavy, read-heavy system. The main challenges are efficient storage, fast serving via CDN, and an async image processing pipeline.
API Design
Upload Image
| Endpoint | POST /api/v1/images |
| Content-Type | multipart/form-data |
| Body | file (binary), title (optional), is_public (boolean) |
| Response | {"id": "img_abc123", "url": "https://cdn.imghost.com/abc123.jpg", "thumbnails": {...}} |
| Status | 201 Created |
View Image
| Endpoint | GET /api/v1/images/:id |
| Query params | ?size=thumb|medium|large|original |
| Response | 302 redirect to CDN URL, or image binary |
Delete Image
| Endpoint | DELETE /api/v1/images/:id |
| Auth | Bearer token (owner only) |
| Response | 204 No Content |
In practice, most image reads bypass the API entirely β the client hits the CDN URL directly. The API is mainly for upload, metadata, and deletion.
Image Processing Pipeline
When an image is uploaded, it doesn't just get stored β it goes through a processing pipeline:
- Validation: Check file type, size (β€10 MB), and dimensions. Reject malformed files.
- Deduplication: Compute a content hash (SHA-256) of the file. If the same hash already exists, return the existing image instead of storing a duplicate. This can save 20-30% storage.
- Thumbnail generation: Create multiple sizes β 150Γ150 (avatar/preview), 300Γ300 (gallery), 600Γ600 (medium). This runs async via a worker queue.
- Format conversion: Convert to WebP for browsers that support it (30-50% smaller than JPEG at same quality). Store both formats.
- EXIF stripping: Remove metadata (GPS location, camera info) for privacy. Users don't expect their location to be embedded in shared images.
- Content moderation: Run through an ML model or third-party API to detect inappropriate content. Flag or reject as needed.
The key insight: only the original upload is synchronous. Everything else (thumbnails, format conversion, moderation) happens asynchronously via a message queue. This keeps upload latency low (~200ms).
Storage Strategy
Storage is the biggest cost and design challenge:
Object Store (S3): Store all image files β originals and thumbnails. S3 gives us 11 nines of durability, automatic replication, and virtually unlimited capacity. Organize by content hash: s3://images/{hash_prefix}/{hash}.{ext}
Metadata Database: Store image metadata β ID, owner, upload time, dimensions, content hash, thumbnail URLs, view count. A relational database (PostgreSQL) works well here since the data is structured and we need indexes on owner, hash, and creation time.
CDN: All image serving goes through a CDN (CloudFront, Fastly). The CDN caches images at edge locations worldwide, so users get images from the nearest server. Cache TTL of 30 days for images (they rarely change).
Deduplication: Before storing, compute SHA-256 hash of the image content. Check the metadata DB β if the hash exists, point the new image record to the existing S3 object. This saves enormous storage when the same meme or image gets uploaded thousands of times.