Databases in a Day — Senior Level

01Key Concepts

B-Tree Internals

Balanced tree of pages (8KB typical)
Internal nodes: keys + child pointers
Leaf nodes: keys + heap tuple pointers (or clustered data)
Insert: find leaf, split if full (O(log n))
Read-optimized: good for range queries
PostgreSQL, MySQL InnoDB use B+Tree variant

LSM Tree Internals

Write → MemTable (sorted, in-memory) + WAL
MemTable full → flush to SSTable (immutable file)
SSTables sorted, bloom filter per file
Read: check MemTable → L0 SSTables → L1+ (compaction levels)
Compaction: merge SSTables, remove tombstones
Used by: Cassandra, RocksDB, LevelDB, BigTable

MVCC

Each row has xmin (created by txn) + xmax (deleted by txn)
Reader sees: rows where xmin committed AND xmax not yet committed
Writers don't block readers, readers don't block writers
Dead tuples: old versions accumulate → VACUUM needed
Snapshot isolation: each txn sees consistent snapshot at start

Write-Ahead Log (WAL)

All changes written to log BEFORE data pages
Crash recovery: replay WAL from last checkpoint
Durability: fsync WAL entry = transaction committed
Replication: ship WAL segments to standby (streaming)
Logical replication: WAL decoded to row-level changes

Indexing Types

B-Tree: equality + range, default in Postgres
Hash: equality only, faster for =
GiST/GIN: full-text, geometric, JSONB
Composite: leftmost prefix must be used
Covering (INCLUDE): store extra cols, index-only scan
Partial: WHERE active=true — smaller, faster

Redis Internals

String: SDS (Simple Dynamic String) — not C string
Hash: ziplist (small) → hashtable (large)
ZSet: ziplist → skip list + hashtable
Skip list: O(log n) avg, probabilistic balancing
Single-threaded event loop (I/O multi-plexed)
Persistence: RDB snapshots + AOF (fsync options)

02Isolation Levels — Know the Anomalies

Isolation Level Matrix

Level	Dirty Read	Non-Repeatable Read	Phantom Read	Write Skew
READ UNCOMMITTED	Possible	Possible	Possible	Possible
READ COMMITTED (default PG)	Prevented	Possible	Possible	Possible
REPEATABLE READ	Prevented	Prevented	Possible	Possible
SERIALIZABLE (SSI in PG)	Prevented	Prevented	Prevented	Prevented

Write Skew is the subtle one: two transactions read same data, both decide to write based on what they read, both write — but their combined effect violates a constraint. Classic example: doctor on-call scheduling (both doctors check "is someone on-call?" → both see yes → both go off-call → nobody on-call). SERIALIZABLE prevents this. Your CBDC double-spend prevention needed SERIALIZABLE or application-level locking.

03Must-Know Deep Dives

🔥 The Cost of Storage: Amplification

Every database choice is a trade-off between three types of Amplification. Mastering this allows you to choose storage without external docs:

Write Amplification: Ratio of data written to disk vs data requested by the app. High in B-Trees due to page splits.
Read Amplification: Number of disk reads required to find one record. High in LSM Trees due to scanning multiple levels.
Space Amplification: Ratio of physical disk space used vs logical data size. High in MVCC systems due to dead row versions.

// Trade-off Summary B-Tree (Postgres): Low Read Amp, High Write Amp. LSM Tree (Cassandra): Low Write Amp, High Read Amp.

🔥 Index-Only Scans & The Visibility Map

If you have an index on (user_id, status), and run SELECT status FROM users WHERE user_id = 1, Postgres can often return the data without touching the table at all. This is an Index-Only Scan.

The Catch: Postgres needs to know if the version in the index is actually "visible" to your transaction (MVCC).
The Solution: The Visibility Map. It tracks which pages contain only committed, non-dead tuples.
Performance Tip: If your Index-Only Scans are slow, it's usually because the table needs a VACUUM to update the visibility map.

-- Check if your index is truly effective
EXPLAIN (ANALYZE, BUFFERS) SELECT ...
-- Look for "Heap Fetches: 0". 
-- If Heap Fetches > 0, the visibility map is out of date.

🔥 Query Planning — What EXPLAIN ANALYZE Tells You

-- Seq Scan: reading entire table. Index missing or not selective enough.
EXPLAIN ANALYZE SELECT * FROM transactions WHERE user_id = 123;
-- Seq Scan on transactions (cost=0.00..50000 rows=1)
-- Add index: CREATE INDEX ON transactions(user_id);

-- Index Scan vs Index-Only Scan
-- Index-Only: all columns in query are in the index (covering index)
-- Much faster: no heap fetch needed

-- N+1 Detection:
-- 1 query for orders + N queries for each order's items
-- Fix: JOIN or eager loading with IN clause
SELECT * FROM orders WHERE user_id = ?;  -- 1 query
SELECT * FROM items WHERE order_id IN (?...?);  -- 1 query with IN

Key metrics in EXPLAIN: cost (planner estimate), actual time (real), rows (actual vs estimated — if way off, run ANALYZE). Nested Loop vs Hash Join vs Merge Join: planner chooses based on row estimates. Miscalibrated stats → wrong plan.

🔥 Connection Pooling — pgBouncer Modes

PostgreSQL: each connection = OS process (~5-10MB RAM). 500 connections = 2.5GB just for connection overhead. Solution: pgBouncer in front of Postgres.

Session mode: one backend per client for session duration. Same as direct connection but reuses after disconnect. Transaction mode: backend held only during transaction. Most efficient. Statement mode: backend per statement — breaks multi-statement transactions.

For your Go services: use database/sql with SetMaxOpenConns(25) + SetMaxIdleConns(25) + pgBouncer in transaction mode. Total DB connections = (go services × 25) / pgBouncer workers ≈ 50-100 to Postgres.

🔥 Consistent Hashing for Sharding

// Virtual node consistent hashing (simplified)
ring := sortedMap{}  // position → node

// Add 150 virtual nodes per physical node (better distribution)
for _, node := range nodes {
    for i := 0; i < 150; i++ {
        hash := crc32(fmt.Sprintf("%s-%d", node.ID, i))
        ring[hash] = node
    }
}

// Route key to node: find next node clockwise on ring
func getNode(key string) Node {
    hash := crc32(key)
    return ring.nextAfter(hash)  // Binary search on sorted ring
}

When node added: only keys between previous node and new node migrate. Average: k/n keys move (k=total keys, n=nodes). Without consistent hashing: all keys rehash — full data migration. Redis Cluster uses 16384 hash slots instead.

04Resources

VIDEO

CMU 15-445 — Database Storage (Andy Pavlo)

60 min. Best DB internals lecture available online. Dense, no fluff. Watch this one.

BLOG

PostgreSQL Docs — MVCC

Canonical source. Chapter 13. Read the isolation levels section carefully.

BLOG

Use The Index, Luke — SQL Indexing Guide

Best indexing resource on the internet. Free. Covers composite index traps, covering indexes, everything.

BLOG

Meta Engineering — RocksDB at Scale

LSM trees in production. How Meta tunes RocksDB for different workloads.

BLOG

Redis Data Structures — Official Docs

Understand the internal encoding (ziplist vs skip list). Interview gold.

BOOK

Designing Data-Intensive Applications — Kleppmann (Chapters 3, 7)

Chapter 3 (Storage Engines) + Chapter 7 (Transactions) are mandatory reads. Rest is optional.

05Quick Revision Notes

DB INTERVIEW FLASH CARDS

B-Tree vs LSM for write-heavy?

LSM: sequential writes to MemTable, no in-place update. Better write throughput. B-Tree: random writes to pages, but better read latency.

Why does VACUUM matter?

MVCC leaves dead tuple versions. VACUUM clears them. Without VACUUM: table bloat, transaction ID wraparound (XID). Autovacuum in Postgres handles this.

Write skew anomaly?

Two txns read same row, each decides to write based on read. Combined effect violates constraint. Prevented by SERIALIZABLE isolation or SELECT FOR UPDATE.

Covering index?

Index contains all columns needed by query. No heap fetch = Index-Only Scan. Faster. CREATE INDEX ON t(a,b) INCLUDE (c) for SELECT c WHERE a=? ORDER BY b.

Redis ZSet internals?

Skip list (range queries by score) + hashtable (O(1) score lookup by member). ziplist for small sets (<128 members, <64 bytes per member).

Consistent hashing benefit?

Adding/removing node: only k/n keys remapped. Regular hashing: all keys remapped. Critical for live resharding without full data migration.