Fundamentals12 min read

Databases: SQL vs NoSQL

Choosing the right home for your data

scope:Building Blockdifficulty:Beginner-Intermediate

The Big Picture

Every application needs to store data somewhere. Your choice of database shapes everything — how fast your app runs, how well it scales, and how much pain you'll feel at 3 AM when things break.

There are two major families: SQL (relational) databases and NoSQL (non-relational) databases. Think of it like organizing your stuff: SQL is a perfectly organized filing cabinet with labeled folders. NoSQL is a collection of different storage solutions — a box for random items, a shelf for documents, a rolodex for contacts.

Neither is "better." They solve different problems. Let's understand both.

▸ What is a database? — persistent, organized storage

SQL / Relational Databases

SQL databases (MySQL, PostgreSQL, Oracle, SQL Server) store data in tables with rows and columns — like spreadsheets. Each table has a fixed schema that defines exactly what columns exist and what type of data they hold.

A users table might have columns: id (int), name (varchar), email (varchar), created_at (timestamp). Every row follows this exact structure. Try to insert a row without an email? The database says no.

Tables relate to each other through foreign keys. A posts table has a user_id column that points back to the users table. This is why they're called relational databases — the power is in the relationships.

Normalization is the process of organizing data to reduce duplication. Instead of storing the user's name on every post, you store the user_id and JOIN the tables when you need both. Less duplication = less inconsistency.

▸ SQL — structured tables with foreign keys

ACID Properties

SQL databases guarantee ACID properties — four promises about how your data behaves:

Atomicity — A transaction is all-or-nothing. If you're transferring $100 from Alice to Bob, either both the debit and credit happen, or neither does. No "money disappeared" bugs.
Consistency — The database moves from one valid state to another. All rules (constraints, foreign keys) are enforced. You can't end up with a post pointing to a user that doesn't exist.
Isolation — Concurrent transactions don't interfere with each other. If two people buy the last concert ticket at the same time, only one succeeds.
Durability — Once a transaction commits, it's permanent. Even if the server loses power one millisecond later, your data is safe.

ACID is crucial for financial systems, e-commerce, and anything where data correctness matters more than speed.

SQL vs NoSQL Data Modeling

# === SQL Approach (using SQLAlchemy-style pseudocode) ===
# Normalized: data split across related tables

# Table: users
# | id | name    | email           |
# | 1  | Alice   | [email protected]  |

# Table: posts
# | id | user_id | content         | created_at |
# | 1  | 1       | "Hello world!"  | 2024-01-15 |

# To get a user's posts, you JOIN:
# SELECT users.name, posts.content
# FROM posts JOIN users ON posts.user_id = users.id
# WHERE users.id = 1;


# === NoSQL Approach (Document store like MongoDB) ===
# Denormalized: everything about a user in one document

user_document = {
    "_id": "user_1",
    "name": "Alice",
    "email": "[email protected]",
    "posts": [
        {
            "id": "post_1",
            "content": "Hello world!",
            "created_at": "2024-01-15"
        },
        {
            "id": "post_2",
            "content": "NoSQL is flexible!",
            "created_at": "2024-01-16"
        }
    ]
}

# No JOIN needed — all data is in one place
# But if you update Alice's name, you only update it here
# (in SQL, you'd update one row; in NoSQL, the name
#  might be duplicated in other documents too)

Output

# SQL: Normalized, JOINs needed, strong consistency
# NoSQL: Denormalized, fast reads, flexible schema

NoSQL Database Types

NoSQL is an umbrella term for "anything that's not a traditional relational database." There are four main flavors:

1. Key-Value Stores (Redis, DynamoDB, Memcached)

The simplest model. It's literally a dictionary: key in, value out.
Blazing fast. Perfect for caching, session storage, user preferences.
Limited querying — you can only look up by key.

2. Document Stores (MongoDB, CouchDB, Firestore)

Store data as JSON-like documents. Each document can have a different structure.
Great for content management, user profiles, catalogs.
More flexible than SQL — add new fields anytime without schema migration.

3. Column-Family Stores (Cassandra, HBase, ScyllaDB)

Store data in columns instead of rows. Great for analytical queries on huge datasets.
Excellent write throughput. Scales horizontally like a champ.
Used for time-series data, IoT, event logging.

4. Graph Databases (Neo4j, Amazon Neptune, ArangoDB)

Store data as nodes and edges. Perfect for highly connected data.
Social networks (who follows whom), recommendation engines, fraud detection.
Queries like "find all friends-of-friends" are natural and fast.

▸ NoSQL — flexible, schema-less documents

Note: Here's a simple rule of thumb: If your data has lots of relationships and you need transactions, go SQL. If your data is flexible, you need massive scale, and eventual consistency is OK, go NoSQL. Many real systems use BOTH — SQL for the core business data (orders, payments) and NoSQL for everything else (sessions, analytics, caching).

Sharding: Splitting Data Across Servers

When your database gets too big for one server, you shard it — split the data across multiple servers. Each server holds a portion (shard) of the total data.

Common sharding strategies:

Range-based: Users A-M go to Shard 1, N-Z go to Shard 2. Simple but can create uneven distribution (hotspots).
Hash-based: Hash the key (e.g., user_id) and mod by the number of shards. Distributes evenly but makes range queries hard.
Geography-based: US users go to the US shard, EU users go to the EU shard. Great for compliance (GDPR) and latency.

Sharding adds serious complexity: cross-shard queries are expensive, rebalancing shards is painful, and some operations (like JOINs across shards) become nearly impossible. That's why most teams delay sharding as long as possible and try vertical scaling or read replicas first.

▸ Sharding — split data across multiple databases

Replication

Replication means keeping copies of your data on multiple servers. There are two main approaches:

Leader-Follower (Primary-Replica): One server (the leader) handles all writes. It replicates changes to one or more followers, which handle reads. This is great for read-heavy workloads — scale reads by adding more followers.

Leader-Leader (Multi-Master): Multiple servers can handle writes. More complex because you need to handle write conflicts (two servers update the same row at the same time). Used when you need writes in multiple regions.

Replication provides:

High availability — If the leader dies, a follower can be promoted.
Better read performance — Spread reads across many replicas.
Geographic distribution — Put replicas near users for lower latency.

The trade-off is replication lag. Followers might be a few milliseconds (or seconds) behind the leader. A user writes data, then reads it from a follower that hasn't caught up yet — they see stale data. This is the consistency vs availability trade-off we explore in the CAP Theorem lesson.

▸ Replication — copies for reliability and read performance

Key Metrics

SQL indexed query

B-tree index

1-10 ms $O(\log n)$

SQL full table scan

Avoid in production

100-10000 ms $O(n)$

SQL JOIN (indexed)

Depends on table sizes

5-50 ms $O(n \log m)$

Key-value GET (Redis)

In-memory

< 1 ms $O(1)$

Document read (MongoDB)

With index

1-5 ms $O(\log n)$

Cassandra write

Append-only, very fast

1-5 ms $O(1)$

Graph traversal (Neo4j)

Depends on depth

1-50 ms $O(V+E)$

Note: Interview tip: When asked 'which database would you use?', never say just 'SQL' or 'NoSQL.' Explain your reasoning: 'The data is highly relational with complex queries, so I'd choose PostgreSQL. For the session cache layer, I'd use Redis since it's key-value lookups with no relationships needed.' Showing you understand the trade-offs is what impresses interviewers.

Quick check

Which ACID property ensures that a bank transfer either fully completes or fully rolls back?

CAP Theorem

You can't have it all — pick two out of three

→

Consistent Hashing

Adding a server shouldn't reshuffle everything

→

Caching

Keep the good stuff close — skip the slow trip

→

Binary Search TreeData Structure

Insert, delete, search

→