Databases: SQL vs NoSQL
The Big Picture
Every application needs to store data somewhere. Your choice of database shapes everything — how fast your app runs, how well it scales, and how much pain you'll feel at 3 AM when things break.
There are two major families: SQL (relational) databases and NoSQL (non-relational) databases. Think of it like organizing your stuff: SQL is a perfectly organized filing cabinet with labeled folders. NoSQL is a collection of different storage solutions — a box for random items, a shelf for documents, a rolodex for contacts.
Neither is "better." They solve different problems. Let's understand both.
SQL / Relational Databases
SQL databases (MySQL, PostgreSQL, Oracle, SQL Server) store data in tables with rows and columns — like spreadsheets. Each table has a fixed schema that defines exactly what columns exist and what type of data they hold.
A users table might have columns: id (int), name (varchar), email (varchar), created_at (timestamp). Every row follows this exact structure. Try to insert a row without an email? The database says no.
Tables relate to each other through foreign keys. A posts table has a user_id column that points back to the users table. This is why they're called relational databases — the power is in the relationships.
Normalization is the process of organizing data to reduce duplication. Instead of storing the user's name on every post, you store the user_id and JOIN the tables when you need both. Less duplication = less inconsistency.
ACID Properties
SQL databases guarantee ACID properties — four promises about how your data behaves:
- Atomicity — A transaction is all-or-nothing. If you're transferring $100 from Alice to Bob, either both the debit and credit happen, or neither does. No "money disappeared" bugs.
- Consistency — The database moves from one valid state to another. All rules (constraints, foreign keys) are enforced. You can't end up with a post pointing to a user that doesn't exist.
- Isolation — Concurrent transactions don't interfere with each other. If two people buy the last concert ticket at the same time, only one succeeds.
- Durability — Once a transaction commits, it's permanent. Even if the server loses power one millisecond later, your data is safe.
ACID is crucial for financial systems, e-commerce, and anything where data correctness matters more than speed.
SQL vs NoSQL Data Modeling
NoSQL Database Types
NoSQL is an umbrella term for "anything that's not a traditional relational database." There are four main flavors:
1. Key-Value Stores (Redis, DynamoDB, Memcached)
- The simplest model. It's literally a dictionary: key in, value out.
- Blazing fast. Perfect for caching, session storage, user preferences.
- Limited querying — you can only look up by key.
2. Document Stores (MongoDB, CouchDB, Firestore)
- Store data as JSON-like documents. Each document can have a different structure.
- Great for content management, user profiles, catalogs.
- More flexible than SQL — add new fields anytime without schema migration.
3. Column-Family Stores (Cassandra, HBase, ScyllaDB)
- Store data in columns instead of rows. Great for analytical queries on huge datasets.
- Excellent write throughput. Scales horizontally like a champ.
- Used for time-series data, IoT, event logging.
4. Graph Databases (Neo4j, Amazon Neptune, ArangoDB)
- Store data as nodes and edges. Perfect for highly connected data.
- Social networks (who follows whom), recommendation engines, fraud detection.
- Queries like "find all friends-of-friends" are natural and fast.
Sharding: Splitting Data Across Servers
When your database gets too big for one server, you shard it — split the data across multiple servers. Each server holds a portion (shard) of the total data.
Common sharding strategies:
- Range-based: Users A-M go to Shard 1, N-Z go to Shard 2. Simple but can create uneven distribution (hotspots).
- Hash-based: Hash the key (e.g., user_id) and mod by the number of shards. Distributes evenly but makes range queries hard.
- Geography-based: US users go to the US shard, EU users go to the EU shard. Great for compliance (GDPR) and latency.
Sharding adds serious complexity: cross-shard queries are expensive, rebalancing shards is painful, and some operations (like JOINs across shards) become nearly impossible. That's why most teams delay sharding as long as possible and try vertical scaling or read replicas first.
Replication
Replication means keeping copies of your data on multiple servers. There are two main approaches:
Leader-Follower (Primary-Replica): One server (the leader) handles all writes. It replicates changes to one or more followers, which handle reads. This is great for read-heavy workloads — scale reads by adding more followers.
Leader-Leader (Multi-Master): Multiple servers can handle writes. More complex because you need to handle write conflicts (two servers update the same row at the same time). Used when you need writes in multiple regions.
Replication provides:
- High availability — If the leader dies, a follower can be promoted.
- Better read performance — Spread reads across many replicas.
- Geographic distribution — Put replicas near users for lower latency.
The trade-off is replication lag. Followers might be a few milliseconds (or seconds) behind the leader. A user writes data, then reads it from a follower that hasn't caught up yet — they see stale data. This is the consistency vs availability trade-off we explore in the CAP Theorem lesson.