Distributed Systems: When One Machine Is Not Enough

This is Post 9 in the Computer Science Series. The previous post covered building applications. Now we look at what happens when your app grows beyond one machine — distributed systems.

YouTube serves billions of videos. WhatsApp delivers messages to 2 billion users. Google processes hundreds of thousands of search queries every second. No single machine can do this. The solution is many machines working together — and that introduces a whole new class of hard problems.

Distributed Systems Overview

The Big Picture

╔══════════════════════════════════════════════════════════════════════════════╗
║                      Distributed Systems                                     ║
╠══════════════════════════════════════════════════════════════════════════════╣
║                                                                              ║
║  Why distribute?                                                             ║
║   → More throughput than one machine                                         ║
║   → Survive machine failures (fault tolerance)                               ║
║   → Serve users globally with low latency                                    ║
║                                                                              ║
║  The hard problems:                                                          ║
║   → Partial failures: some machines die, others keep running                 ║
║   → Network delays: messages arrive late or out of order                     ║
║   → Consistency: different machines have different views of data             ║
║   → Coordination: getting machines to agree on anything                      ║
║                                                                              ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  Key Concepts                                                                ║
║                                                                              ║
║  CAP theorem    → Consistency, Availability, Partition-tolerance             ║
║  Consensus      → Raft algorithm, used in etcd, CockroachDB                  ║
║  Replication    → leader-follower, multi-leader, leaderless                  ║
║  Sharding       → partition data across machines                             ║
║  Caching        → Redis, CDN — serve fast from nearby                        ║
╚══════════════════════════════════════════════════════════════════════════════╝

1. Why Distributed Is Hard

On a single machine, everything is easy:

One memory space — no question about who sees what
If the machine crashes, everything stops — no partial failure
Clock is shared — no ambiguity about order of events

On multiple machines:

Partial failures: Server A crashes, B and C keep running. From B’s perspective, is A dead? Or just slow? You can’t tell.
Network unreliability: a message from A to B might arrive late, arrive twice, or not arrive at all.
No global clock: A’s clock might be 100ms ahead of B’s. Which event happened first?

These aren’t implementation bugs — they’re fundamental properties of distributed systems. Every design decision is navigating these constraints.

2. The CAP Theorem

In 2000, Eric Brewer stated what became the CAP theorem: in a distributed system, you can have at most two of three properties:

C — Consistency:         every read sees the most recent write
A — Availability:        every request gets a response (not an error)
P — Partition tolerance: system works even when network splits

In practice, network partitions always happen (cables fail, routers restart). So you must choose between C and A when a partition occurs.

CP System (Consistent + Partition-tolerant)

When a partition happens, refuse to serve requests rather than return stale data.

Network partition occurs:
  Server A (primary):  has latest data
  Server B (replica):  can't reach A

  Client asks B for user balance
  B: "I can't guarantee I have the latest data → I refuse to answer (error)"

Examples: HBase, ZooKeeper, most relational databases. Good for: banking, inventory systems where stale data causes real harm.

AP System (Available + Partition-tolerant)

When a partition happens, still serve requests but might return stale data.

Network partition occurs:
  Server A: updates user's balance to $150
  Server B: can't reach A, still thinks balance is $100

  Client asks B for balance
  B: "I'll return $100. It might be stale, but at least you get an answer."

Examples: DynamoDB (Eventual mode), Cassandra, DNS. Good for: shopping carts, social media likes, analytics — where small staleness is acceptable.

Eventual Consistency

Most AP systems promise eventual consistency: if you stop writing, all replicas will eventually converge to the same value.

Time 0:  A=100 on all nodes
Time 1:  Write $50 to Node 1 → Node 1 has A=150
Time 2:  (propagation) → Node 2 gets the update → A=150
Time 3:  All nodes consistent: A=150

The window of inconsistency might be milliseconds or seconds. For most applications (social media, shopping recommendations), this is fine.

3. Replication — Copies for Safety

Replication keeps copies of data on multiple machines, so if one fails, others have the data.

Leader-Follower (Primary-Replica)

One leader accepts all writes. Followers replicate the leader and serve reads.

Leader (primary):
  Accepts writes
  Replicates to followers asynchronously

Followers (replicas):
  Serve reads (possibly slightly stale)
  Promote to leader if leader fails

Used by PostgreSQL, MySQL, MongoDB, Redis.

Synchronous replication: wait for follower to confirm before acknowledging the write.

Pro: no data loss if leader dies
Con: every write is slower (must wait for network round-trip to follower)

Asynchronous replication: acknowledge immediately, replicate in background.

Pro: fast writes
Con: if leader dies before replication, the last few writes are lost

Multi-Leader

Multiple nodes accept writes. Used when writes come from multiple geographic regions.

US datacenter (Leader 1) ←──sync──→ Asia datacenter (Leader 2)
   ↑ accepts US writes                    ↑ accepts Asia writes

Problem: what if both leaders accept a different write to the same record?

Leader 1: user changes name to "Alice"
Leader 2: user changes name to "Alicia"
(at same time, before sync)
→ conflict!

Conflicts must be resolved somehow: last-write-wins, merge, or show both to the user.

Leaderless (Dynamo-style)

Any node can accept writes. A write succeeds if W nodes confirm. A read succeeds if R nodes respond. If W + R > N (total nodes), reads will always see at least one up-to-date copy.

Used by Cassandra, DynamoDB, Riak. Very available — no single point of failure.

4. Consensus — Agreeing on Anything

If multiple nodes can receive writes, how do they agree on who’s the leader? Who wins a conflict? What’s the definitive order of events?

This is the consensus problem: getting a group of nodes to agree on a value, even when some nodes fail.

Consensus turns out to be hard. The FLP impossibility theorem (1985) proved it’s impossible to achieve consensus in an asynchronous system if even one node can fail. In practice, systems add timeouts and assumptions to work around this.

Raft — How Most Systems Do It

Raft is the consensus algorithm used in etcd (Kubernetes’ backbone), CockroachDB, TiKV, and Consul.

Leader Election

All nodes start as followers.
If a follower hears nothing from a leader for ~150ms, it becomes a candidate.
Candidate asks others for votes.
If it gets a majority, it becomes the leader.
Leader sends heartbeats to prevent re-elections.

Election example (3 nodes: A, B, C):
  1. Leader C crashes
  2. A and B both wait a random timeout (A waits 150ms, B waits 200ms)
  3. A times out first → becomes candidate, votes for itself (1/3)
  4. A requests votes from B → B votes for A (2/3) → A wins!
  5. A becomes new leader, sends heartbeats

The random timeout prevents both A and B from starting an election simultaneously.

Log Replication

Once a leader is elected, it replicates every write to followers before committing:

Client: "set x = 5"
Leader: 1. Append to own log: [set x = 5]
        2. Send to followers
        3. Wait for majority to confirm (2/3 nodes)
        4. Commit → apply to state machine
        5. Reply to client: "done"

If the leader crashes after commit but before telling the client, the client retries and gets “already done”. Because the write is in the majority’s logs, it won’t be lost.

Raft guarantees: any committed write will survive any single machine failure (in a 3-node cluster). For tolerance of 2 failures: 5 nodes. For k failures: 2k+1 nodes.

5. Sharding — Splitting Data Across Machines

When data grows beyond what one machine can hold or one machine can write fast enough, you shard — partition data across multiple machines.

Shard by user_id:
  Shard 1: user_id % 3 == 0  (users 0, 3, 6, 9...)
  Shard 2: user_id % 3 == 1  (users 1, 4, 7, 10...)
  Shard 3: user_id % 3 == 2  (users 2, 5, 8, 11...)

Queries for one user go to one shard. Each shard handles 1/3 of the load.

Problems with sharding:

Cross-shard queries (JOINs across shards) are very expensive — must query all shards
Rebalancing: adding a new shard means moving lots of data
Hot spots: if user_id 1 is 10× more active than others, Shard 1 is overloaded

Consistent hashing solves the rebalancing problem — when you add a shard, only a fraction of keys need to move.

6. Caching — Fast Data from Nearby

Not all load needs to hit the database. Caching stores frequently accessed data in a fast store (like Redis, in RAM).

Request: GET /api/product/123
  → Check Redis cache
     Hit:  return in < 1ms
     Miss: query database (50ms) → store in Redis → return

Cache invalidation is the hard part: when data changes in the database, you must update or remove the cached copy. Stale cache = users see old data.

# After updating a product:
db.execute("UPDATE products SET price=99 WHERE id=123")
redis.delete("product:123")   # invalidate cache
# Next request will miss cache → fetch fresh from DB → re-cache

CDN (Content Delivery Network): a global cache for static files (images, CSS, JS). User in Vietnam gets files from a nearby CDN edge in Singapore, not from a server in the US. 10ms vs 200ms.

7. The Reality: Trade-offs Everywhere

Distributed systems are a trade-off negotiation:

Want	Trade-off
Strong consistency	Higher latency (wait for majority), lower availability
High availability	Possible stale reads, conflict resolution needed
Low latency	Async replication → possible data loss
Horizontal scale	No cross-shard JOINs, complex rebalancing
Global distribution	Higher replication lag, conflict probability

The right design depends on what your system actually needs. A payment system needs strong consistency — stale data causes real financial harm. A social media “likes” counter doesn’t — being off by a few likes for a second is fine.

Famous Distributed System Disasters

Real incidents that show these problems are real:

GitHub, 2012: MySQL failover left a replica as leader. Both old and new leader accepted writes. Conflicting data corrupted the database.
Amazon DynamoDB, 2011: Metadata service overloaded → cascading failures across multiple AWS regions.
Cloudflare, 2019: A BGP routing misconfiguration caused 15 minutes of partial internet outage globally.

Every large distributed system has failure stories. The goal isn’t to prevent all failures — it’s to design systems that fail gracefully and recover quickly.

Summary

Distributed systems are hard because physical reality is hard:

Partial failures    → some nodes die; design for it
Network unreliability → messages are lost/delayed; handle it
No global clock    → order of events is ambiguous; track causality
CAP theorem         → choose consistency or availability during partitions
Raft consensus      → elect a leader, replicate a log, achieve agreement
Replication         → copies for fault tolerance and scale
Sharding            → split data for write scale
Caching             → serve fast from nearby; invalidate carefully

Building correct distributed systems requires thinking carefully about failure modes, not just the happy path. The best engineers design for failure — they ask “what happens when this network link goes down?” before shipping.

In the next post, we’ll look at Machine Learning — how computers learn patterns from data instead of being explicitly programmed.

Back to the series: Welcome to the Computer Science Series