01Key Concepts
B-Tree Internals
- Balanced tree of pages (8KB typical)
- Internal nodes: keys + child pointers
- Leaf nodes: keys + heap tuple pointers (or clustered data)
- Insert: find leaf, split if full (O(log n))
- Read-optimized: good for range queries
- PostgreSQL, MySQL InnoDB use B+Tree variant
LSM Tree Internals
- Write โ MemTable (sorted, in-memory) + WAL
- MemTable full โ flush to SSTable (immutable file)
- SSTables sorted, bloom filter per file
- Read: check MemTable โ L0 SSTables โ L1+ (compaction levels)
- Compaction: merge SSTables, remove tombstones
- Used by: Cassandra, RocksDB, LevelDB, BigTable
MVCC
- Each row has xmin (created by txn) + xmax (deleted by txn)
- Reader sees: rows where xmin committed AND xmax not yet committed
- Writers don't block readers, readers don't block writers
- Dead tuples: old versions accumulate โ VACUUM needed
- Snapshot isolation: each txn sees consistent snapshot at start
Write-Ahead Log (WAL)
- All changes written to log BEFORE data pages
- Crash recovery: replay WAL from last checkpoint
- Durability: fsync WAL entry = transaction committed
- Replication: ship WAL segments to standby (streaming)
- Logical replication: WAL decoded to row-level changes
Indexing Types
- B-Tree: equality + range, default in Postgres
- Hash: equality only, faster for =
- GiST/GIN: full-text, geometric, JSONB
- Composite: leftmost prefix must be used
- Covering (INCLUDE): store extra cols, index-only scan
- Partial: WHERE active=true โ smaller, faster
Redis Internals
- String: SDS (Simple Dynamic String) โ not C string
- Hash: ziplist (small) โ hashtable (large)
- ZSet: ziplist โ skip list + hashtable
- Skip list: O(log n) avg, probabilistic balancing
- Single-threaded event loop (I/O multi-plexed)
- Persistence: RDB snapshots + AOF (fsync options)
02Isolation Levels โ Know the Anomalies
Isolation Level Matrix
| Level | Dirty Read | Non-Repeatable Read | Phantom Read | Write Skew |
|---|---|---|---|---|
| READ UNCOMMITTED | Possible | Possible | Possible | Possible |
| READ COMMITTED (default PG) | Prevented | Possible | Possible | Possible |
| REPEATABLE READ | Prevented | Prevented | Possible | Possible |
| SERIALIZABLE (SSI in PG) | Prevented | Prevented | Prevented | Prevented |
Write Skew is the subtle one: two transactions read same data, both decide to write based on what they read, both write โ but their combined effect violates a constraint. Classic example: doctor on-call scheduling (both doctors check "is someone on-call?" โ both see yes โ both go off-call โ nobody on-call). SERIALIZABLE prevents this. Your CBDC double-spend prevention needed SERIALIZABLE or application-level locking.
03Must-Know Deep Dives
๐ฅ The Cost of Storage: Amplification
Every database choice is a trade-off between three types of Amplification. Mastering this allows you to choose storage without external docs:
- Write Amplification: Ratio of data written to disk vs data requested by the app. High in B-Trees due to page splits.
- Read Amplification: Number of disk reads required to find one record. High in LSM Trees due to scanning multiple levels.
- Space Amplification: Ratio of physical disk space used vs logical data size. High in MVCC systems due to dead row versions.
๐ฅ Index-Only Scans & The Visibility Map
If you have an index on (user_id, status), and run SELECT status FROM users WHERE user_id = 1, Postgres can often return the data without touching the table at all. This is an Index-Only Scan.
- The Catch: Postgres needs to know if the version in the index is actually "visible" to your transaction (MVCC).
- The Solution: The Visibility Map. It tracks which pages contain only committed, non-dead tuples.
- Performance Tip: If your Index-Only Scans are slow, it's usually because the table needs a
VACUUMto update the visibility map.
๐ฅ Query Planning โ What EXPLAIN ANALYZE Tells You
Key metrics in EXPLAIN: cost (planner estimate), actual time (real), rows (actual vs estimated โ if way off, run ANALYZE). Nested Loop vs Hash Join vs Merge Join: planner chooses based on row estimates. Miscalibrated stats โ wrong plan.
๐ฅ Connection Pooling โ pgBouncer Modes
PostgreSQL: each connection = OS process (~5-10MB RAM). 500 connections = 2.5GB just for connection overhead. Solution: pgBouncer in front of Postgres.
Session mode: one backend per client for session duration. Same as direct connection but reuses after disconnect. Transaction mode: backend held only during transaction. Most efficient. Statement mode: backend per statement โ breaks multi-statement transactions.
For your Go services: use database/sql with SetMaxOpenConns(25) + SetMaxIdleConns(25) + pgBouncer in transaction mode. Total DB connections = (go services ร 25) / pgBouncer workers โ 50-100 to Postgres.
๐ฅ Consistent Hashing for Sharding
When node added: only keys between previous node and new node migrate. Average: k/n keys move (k=total keys, n=nodes). Without consistent hashing: all keys rehash โ full data migration. Redis Cluster uses 16384 hash slots instead.
04Resources
60 min. Best DB internals lecture available online. Dense, no fluff. Watch this one.
Canonical source. Chapter 13. Read the isolation levels section carefully.
Best indexing resource on the internet. Free. Covers composite index traps, covering indexes, everything.
LSM trees in production. How Meta tunes RocksDB for different workloads.
Understand the internal encoding (ziplist vs skip list). Interview gold.
Chapter 3 (Storage Engines) + Chapter 7 (Transactions) are mandatory reads. Rest is optional.