DATABASES ยท INTERNALS + INTERVIEWS ยท SENIOR

Databases in a Day

From your MariaDB + Redis production experience to understanding what actually happens inside. B-trees, MVCC, isolation anomalies, sharding strategy.

H 0โ€“1.5
Storage Engines: B-Tree, LSM
H 1.5โ€“3
MVCC, WAL, Transactions
H 3โ€“4.5
Isolation Levels + Anomalies
H 4.5โ€“6
Indexing + Query Planning
H 6โ€“7.5
Sharding + Replication
H 7.5โ€“8
Redis Internals

01Key Concepts

B-Tree Internals

  • Balanced tree of pages (8KB typical)
  • Internal nodes: keys + child pointers
  • Leaf nodes: keys + heap tuple pointers (or clustered data)
  • Insert: find leaf, split if full (O(log n))
  • Read-optimized: good for range queries
  • PostgreSQL, MySQL InnoDB use B+Tree variant

LSM Tree Internals

  • Write โ†’ MemTable (sorted, in-memory) + WAL
  • MemTable full โ†’ flush to SSTable (immutable file)
  • SSTables sorted, bloom filter per file
  • Read: check MemTable โ†’ L0 SSTables โ†’ L1+ (compaction levels)
  • Compaction: merge SSTables, remove tombstones
  • Used by: Cassandra, RocksDB, LevelDB, BigTable

MVCC

  • Each row has xmin (created by txn) + xmax (deleted by txn)
  • Reader sees: rows where xmin committed AND xmax not yet committed
  • Writers don't block readers, readers don't block writers
  • Dead tuples: old versions accumulate โ†’ VACUUM needed
  • Snapshot isolation: each txn sees consistent snapshot at start

Write-Ahead Log (WAL)

  • All changes written to log BEFORE data pages
  • Crash recovery: replay WAL from last checkpoint
  • Durability: fsync WAL entry = transaction committed
  • Replication: ship WAL segments to standby (streaming)
  • Logical replication: WAL decoded to row-level changes

Indexing Types

  • B-Tree: equality + range, default in Postgres
  • Hash: equality only, faster for =
  • GiST/GIN: full-text, geometric, JSONB
  • Composite: leftmost prefix must be used
  • Covering (INCLUDE): store extra cols, index-only scan
  • Partial: WHERE active=true โ€” smaller, faster

Redis Internals

  • String: SDS (Simple Dynamic String) โ€” not C string
  • Hash: ziplist (small) โ†’ hashtable (large)
  • ZSet: ziplist โ†’ skip list + hashtable
  • Skip list: O(log n) avg, probabilistic balancing
  • Single-threaded event loop (I/O multi-plexed)
  • Persistence: RDB snapshots + AOF (fsync options)

02Isolation Levels โ€” Know the Anomalies

Isolation Level Matrix

LevelDirty ReadNon-Repeatable ReadPhantom ReadWrite Skew
READ UNCOMMITTEDPossiblePossiblePossiblePossible
READ COMMITTED (default PG)PreventedPossiblePossiblePossible
REPEATABLE READPreventedPreventedPossiblePossible
SERIALIZABLE (SSI in PG)PreventedPreventedPreventedPrevented

Write Skew is the subtle one: two transactions read same data, both decide to write based on what they read, both write โ€” but their combined effect violates a constraint. Classic example: doctor on-call scheduling (both doctors check "is someone on-call?" โ†’ both see yes โ†’ both go off-call โ†’ nobody on-call). SERIALIZABLE prevents this. Your CBDC double-spend prevention needed SERIALIZABLE or application-level locking.

03Must-Know Deep Dives

๐Ÿ”ฅ The Cost of Storage: Amplification

Every database choice is a trade-off between three types of Amplification. Mastering this allows you to choose storage without external docs:

  • Write Amplification: Ratio of data written to disk vs data requested by the app. High in B-Trees due to page splits.
  • Read Amplification: Number of disk reads required to find one record. High in LSM Trees due to scanning multiple levels.
  • Space Amplification: Ratio of physical disk space used vs logical data size. High in MVCC systems due to dead row versions.
// Trade-off Summary B-Tree (Postgres): Low Read Amp, High Write Amp. LSM Tree (Cassandra): Low Write Amp, High Read Amp.

๐Ÿ”ฅ Index-Only Scans & The Visibility Map

If you have an index on (user_id, status), and run SELECT status FROM users WHERE user_id = 1, Postgres can often return the data without touching the table at all. This is an Index-Only Scan.

  • The Catch: Postgres needs to know if the version in the index is actually "visible" to your transaction (MVCC).
  • The Solution: The Visibility Map. It tracks which pages contain only committed, non-dead tuples.
  • Performance Tip: If your Index-Only Scans are slow, it's usually because the table needs a VACUUM to update the visibility map.
-- Check if your index is truly effective EXPLAIN (ANALYZE, BUFFERS) SELECT ... -- Look for "Heap Fetches: 0". -- If Heap Fetches > 0, the visibility map is out of date.

๐Ÿ”ฅ Query Planning โ€” What EXPLAIN ANALYZE Tells You

-- Seq Scan: reading entire table. Index missing or not selective enough. EXPLAIN ANALYZE SELECT * FROM transactions WHERE user_id = 123; -- Seq Scan on transactions (cost=0.00..50000 rows=1) -- Add index: CREATE INDEX ON transactions(user_id); -- Index Scan vs Index-Only Scan -- Index-Only: all columns in query are in the index (covering index) -- Much faster: no heap fetch needed -- N+1 Detection: -- 1 query for orders + N queries for each order's items -- Fix: JOIN or eager loading with IN clause SELECT * FROM orders WHERE user_id = ?; -- 1 query SELECT * FROM items WHERE order_id IN (?...?); -- 1 query with IN

Key metrics in EXPLAIN: cost (planner estimate), actual time (real), rows (actual vs estimated โ€” if way off, run ANALYZE). Nested Loop vs Hash Join vs Merge Join: planner chooses based on row estimates. Miscalibrated stats โ†’ wrong plan.

๐Ÿ”ฅ Connection Pooling โ€” pgBouncer Modes

PostgreSQL: each connection = OS process (~5-10MB RAM). 500 connections = 2.5GB just for connection overhead. Solution: pgBouncer in front of Postgres.

Session mode: one backend per client for session duration. Same as direct connection but reuses after disconnect. Transaction mode: backend held only during transaction. Most efficient. Statement mode: backend per statement โ€” breaks multi-statement transactions.

For your Go services: use database/sql with SetMaxOpenConns(25) + SetMaxIdleConns(25) + pgBouncer in transaction mode. Total DB connections = (go services ร— 25) / pgBouncer workers โ‰ˆ 50-100 to Postgres.

๐Ÿ”ฅ Consistent Hashing for Sharding

// Virtual node consistent hashing (simplified) ring := sortedMap{} // position โ†’ node // Add 150 virtual nodes per physical node (better distribution) for _, node := range nodes { for i := 0; i < 150; i++ { hash := crc32(fmt.Sprintf("%s-%d", node.ID, i)) ring[hash] = node } } // Route key to node: find next node clockwise on ring func getNode(key string) Node { hash := crc32(key) return ring.nextAfter(hash) // Binary search on sorted ring }

When node added: only keys between previous node and new node migrate. Average: k/n keys move (k=total keys, n=nodes). Without consistent hashing: all keys rehash โ€” full data migration. Redis Cluster uses 16384 hash slots instead.

04Resources

VIDEO
CMU 15-445 โ€” Database Storage (Andy Pavlo)

60 min. Best DB internals lecture available online. Dense, no fluff. Watch this one.

BLOG
PostgreSQL Docs โ€” MVCC

Canonical source. Chapter 13. Read the isolation levels section carefully.

BLOG
Use The Index, Luke โ€” SQL Indexing Guide

Best indexing resource on the internet. Free. Covers composite index traps, covering indexes, everything.

BLOG
Meta Engineering โ€” RocksDB at Scale

LSM trees in production. How Meta tunes RocksDB for different workloads.

BLOG
Redis Data Structures โ€” Official Docs

Understand the internal encoding (ziplist vs skip list). Interview gold.

BOOK
Designing Data-Intensive Applications โ€” Kleppmann (Chapters 3, 7)

Chapter 3 (Storage Engines) + Chapter 7 (Transactions) are mandatory reads. Rest is optional.

05Quick Revision Notes

DB INTERVIEW FLASH CARDS

B-Tree vs LSM for write-heavy?
LSM: sequential writes to MemTable, no in-place update. Better write throughput. B-Tree: random writes to pages, but better read latency.
Why does VACUUM matter?
MVCC leaves dead tuple versions. VACUUM clears them. Without VACUUM: table bloat, transaction ID wraparound (XID). Autovacuum in Postgres handles this.
Write skew anomaly?
Two txns read same row, each decides to write based on read. Combined effect violates constraint. Prevented by SERIALIZABLE isolation or SELECT FOR UPDATE.
Covering index?
Index contains all columns needed by query. No heap fetch = Index-Only Scan. Faster. CREATE INDEX ON t(a,b) INCLUDE (c) for SELECT c WHERE a=? ORDER BY b.
Redis ZSet internals?
Skip list (range queries by score) + hashtable (O(1) score lookup by member). ziplist for small sets (<128 members, <64 bytes per member).
Consistent hashing benefit?
Adding/removing node: only k/n keys remapped. Regular hashing: all keys remapped. Critical for live resharding without full data migration.