Computer Architecture: Inside the CPU

This is Post 2 in the Computer Science Series. The previous post covered how bits encode everything — numbers, text, audio, video. Now we look at the machine that actually processes those bits: the CPU.

Every game you play, every website that loads, every AI response you get — it all runs inside a CPU. Understanding how it works explains why some code is fast, why some is slow, and how computers went from slow vacuum tubes to processing billions of instructions per second.

The Big Picture

╔══════════════════════════════════════════════════════════════════════════════╗
║                      Inside a Modern CPU                                      ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  ┌─────────────────────────────────────────────────────────────────────┐      ║
║  │                        CPU Core                                     │      ║
║  │                                                                     │      ║
║  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐   │        ║
║  │  │  Fetch   │→ │  Decode  │→ │ Execute  │→ │   Write Back     │   │        ║
║  │  │          │  │          │  │          │  │                  │   │        ║
║  │  │ get next │  │ what does│  │  do the  │  │ save the result  │   │        ║
║  │  │ instr.   │  │ it mean? │  │  work    │  │ to register/mem  │   │        ║
║  │  └──────────┘  └──────────┘  └──────────┘  └──────────────────┘   │        ║
║  │                                                                     │      ║
║  │  ┌──────────────────────┐  ┌──────────────────────────────────┐    │       ║
║  │  │   Branch Predictor   │  │        ALU / FPU / SIMD          │    │       ║
║  │  │  guesses if/else     │  │  integer · float · vector math   │    │       ║
║  │  └──────────────────────┘  └──────────────────────────────────┘    │       ║
║  │                                                                     │      ║
║  │  ┌──────────────────────────────────────────────────────────────┐  │       ║
║  │  │  Registers  (< 1 KB, < 1 ns)  — things currently in use     │  │        ║
║  │  └──────────────────────────────────────────────────────────────┘  │       ║
║  └─────────────────────────────────────────────────────────────────────┘      ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  Memory Hierarchy (speed decreases, size increases going down)                ║
║                                                                               ║
║  L1 Cache   32–64 KB    ~1 ns    ← per core, closest to execution             ║
║  L2 Cache   256 KB–1 MB ~4 ns    ← per core                                   ║
║  L3 Cache   8–64 MB     ~10 ns   ← shared across cores                        ║
║  RAM        8–128 GB    ~60 ns   ← your computer's main memory                ║
║  SSD        1–4 TB      ~100 µs  ← 100,000× slower than L1 cache              ║
╠══════════════════════════════════════════════════════════════════════════════╣
║  Multiple Cores                                                               ║
║                                                                               ║
║  Core 0 │ Core 1 │ Core 2 │ Core 3  ← each runs independently                 ║
║  ────────┼────────┼────────┼────────                                          ║
║              Shared L3 Cache                                                  ║
║              Shared RAM                                                       ║
╚══════════════════════════════════════════════════════════════════════════════╝

1. What a CPU Actually Does

A CPU is a chip the size of a postage stamp that runs instructions — tiny commands like “add these two numbers”, “compare these values”, “jump to this memory address”.

Your Python script, your game, your browser — they all get translated into millions of these simple instructions before the CPU runs them.

A typical instruction looks like this (in human-readable form):

ADD  R1, R2, R3     # R1 = R2 + R3
CMP  R1, 100        # compare R1 with 100
JGT  loop_start     # if R1 > 100, jump to loop_start
MOV  R4, [0x1000]   # load value from memory address 0x1000 into R4

The CPU executes billions of these per second.

2. The Fetch-Decode-Execute Cycle

Every CPU, from the cheapest microcontroller to the fastest server chip, follows the same basic loop:

┌─────────────────────────────────────────────────────────┐
│                  The CPU Loop                           │
│                                                         │
│   1. FETCH   → read the next instruction from memory   │
│   2. DECODE  → figure out what the instruction means   │
│   3. EXECUTE → do the actual work (add, compare, jump) │
│   4. WRITE   → save the result back                    │
│                                                         │
│   Repeat. Billions of times per second.                 │
└─────────────────────────────────────────────────────────┘

Analogy: Imagine you’re following a recipe.

Fetch: read step 5 from the page
Decode: understand “whisk the eggs”
Execute: actually whisk
Write: the eggs are now whisked (result saved)

Simple enough. But modern CPUs do something much cleverer.

3. Pipelining — The Assembly Line

A single instruction takes several steps (fetch, decode, execute, write). Early CPUs waited for one instruction to fully finish before starting the next. That’s slow.

Modern CPUs use pipelining: while step 1 is executing, step 2 is decoding, step 3 is fetching — like an assembly line in a factory.

Time:        1    2    3    4    5    6    7
─────────────────────────────────────────────
Instruction 1: F  → D  → E  → W
Instruction 2:      F  → D  → E  → W
Instruction 3:           F  → D  → E  → W
Instruction 4:                F  → D  → E  → W

F=Fetch  D=Decode  E=Execute  W=Write

Without pipelining: 4 instructions × 4 steps = 16 cycles. With pipelining: 4 instructions finish in 7 cycles.

Modern CPUs have 10–20 pipeline stages. At 3 GHz, each stage takes about 0.3 nanoseconds.

4. Branch Prediction — Guessing the Future

Pipelining has a problem: what if the instruction you’re pre-fetching depends on a condition that hasn’t been calculated yet?

if score > 100:
    give_bonus()
else:
    try_again()

While the CPU is executing the score > 100 comparison, it’s already trying to fetch the next instruction. But which branch — give_bonus() or try_again()? It doesn’t know yet!

Branch prediction is the CPU’s educated guess. Modern CPUs track the history of every branch and predict with ~95% accuracy.

Prediction correct: no slowdown, pipeline keeps flowing
Prediction wrong: must flush the pipeline and start over — ~15 cycle penalty

This is why sorting data before processing can make code faster: sorted data has predictable patterns, so the branch predictor gets it right almost every time.

5. Out-of-Order Execution

Even within a single program, instructions don’t have to run in the order you wrote them — as long as the results are the same.

A = load from memory    # takes 60 ns (slow!)
B = 5 + 3              # only needs registers (0.3 ns)
C = A + B              # depends on A, must wait

A simple CPU would stall waiting for A. A smart CPU executes B while waiting for A to load. It finishes faster even though the order changed.

Modern CPUs can hold 200–400 in-flight instructions, reordering them on the fly to keep all execution units busy.

6. The Memory Hierarchy — Why Location Matters

The CPU can calculate at incredible speed. The problem is waiting for data. Memory access times differ enormously:

Level	Size	Speed	Analogy
Registers	< 1 KB	< 1 ns	Things in your hands
L1 Cache	32–64 KB	~1 ns	Things on your desk
L2 Cache	256 KB–1 MB	~4 ns	Shelf next to you
L3 Cache	8–64 MB	~10–40 ns	Filing cabinet across the room
RAM	8–128 GB	~60 ns	Library in another building
SSD	1–4 TB	~100 µs	Library in another city
HDD	1–20 TB	~10 ms	Library on another planet

L1 cache is about 60× faster than RAM. SSD is 100,000× slower than L1 cache.

This is why data locality — keeping data close together in memory — matters so much:

# SLOW: jumps randomly through memory (many cache misses)
for i in range(1000):
    total += matrix[random_row[i]][random_col[i]]

# FAST: reads memory in order (hits cache almost every time)
for row in matrix:
    for value in row:
        total += value

Both loops do the same math. The second can be 10–100× faster just because it reads memory in order.

How Caching Works

When you access memory address X, the CPU doesn’t just fetch X. It fetches a whole cache line — typically 64 bytes around X. If your next access is close to X, it’s already in cache (a cache hit). If not, it must fetch again (a cache miss).

Cache hit:   data already in L1/L2/L3   →  lightning fast
Cache miss:  must go to RAM or SSD      →  very slow stall

7. Registers — The CPU’s Workspace

Registers are tiny storage locations inside the CPU itself — the fastest memory that exists. A modern 64-bit CPU has a few dozen general-purpose registers, each 8 bytes wide.

Register R1:  00000000 00000000 00000000 00000101  (value = 5)
Register R2:  00000000 00000000 00000000 00000011  (value = 3)
              ↓  ADD R3, R1, R2
Register R3:  00000000 00000000 00000000 00001000  (value = 8)

Everything the CPU computes must pass through registers. When your code has too many variables at once, the compiler “spills” some to memory — which is much slower.

8. Multiple Cores — Parallel Work

One core doing one instruction at a time is fast. But what about doing two things at once?

Modern CPUs have multiple cores — 4, 8, 16, even 128 on server chips. Each core has its own registers, L1, and L2 cache. They share L3 cache and RAM.

Core 0: running your Chrome tab
Core 1: running your music player
Core 2: running a background virus scan
Core 3: idle (saving power)

A program must be written to use multiple cores — it doesn’t happen automatically. Threads are how a single program splits work across cores.

But multiple cores sharing memory creates a new problem: if Core 0 and Core 1 both try to update the same variable at the same time, the result is unpredictable. This is a race condition — a topic we’ll explore deeply in the Operating Systems post.

9. The Instruction Set — What a CPU “Speaks”

Every CPU has an instruction set architecture (ISA) — the set of instructions it understands.

ISA	Used by	Notes
x86-64	Intel, AMD desktop/server CPUs	Most PCs and servers
ARM	Apple M-series, phones, tablets	Power-efficient, now very fast
RISC-V	Embedded systems, research	Open standard, growing fast

Software compiled for x86-64 won’t run on ARM directly — it’s a different language. This is why Apple had to develop a translation layer (Rosetta) when they switched Macs from Intel to their own ARM chips.

10. Special Execution Units

The main CPU isn’t the only compute unit on the chip:

ALU (Arithmetic Logic Unit): integer math, comparisons, bitwise operations.

FPU (Floating-Point Unit): decimal math (the kind used in games, scientific computing, ML). Separate hardware because floating-point operations are more complex.

SIMD Units (Single Instruction, Multiple Data): apply one instruction to multiple values at once.

Normal (scalar):  add R1, R2  →  1 addition
SIMD (AVX-256):   VADD YMM0, YMM1  →  4 additions simultaneously

Video codecs, image filters, and ML inference all use SIMD heavily. It’s like having 4–16 extra calculators working in parallel.

GPU (Graphics Processing Unit): technically a separate chip, but works alongside the CPU. A GPU has thousands of tiny cores — slower than CPU cores, but great for doing the same thing to millions of values at once (like applying a filter to every pixel in an image).

How It All Fits Together

When you run a Python function like sum([1, 2, 3, 4, 5]):

The Python interpreter translates it into machine instructions
The CPU fetches the first instruction from memory
The branch predictor starts guessing what comes next
The out-of-order engine lines up instructions to keep all units busy
The memory hierarchy tries to keep data in L1 cache
SIMD units crunch multiple numbers at once
Results flow back through write-back into registers

All of this happens in a few microseconds.

Your Python code
      ↓
Python bytecode (intermediate language)
      ↓
Machine instructions (x86-64 / ARM)
      ↓
CPU pipeline: Fetch → Decode → Execute → Write
      ↓
Result in register → written to memory → returned to Python

The next time you wonder why one loop runs 10× faster than another, the answer is almost always in this pipeline: a cache miss, a branch misprediction, or a missed chance to use SIMD.

In the next post, we’ll go up one layer: the Operating System — the software that manages the CPU, memory, and all running programs at once.

Back to the series: Welcome to the Computer Science Series