CONSENSUS ยท NETWORKING ยท DATA STRUCTURES ยท DISTRIBUTED SYSTEMS THEORY

CS Fundamentals in a Day

The theoretical foundations that explain why distributed systems behave the way they do. Raft consensus, TCP/TLS internals, probabilistic data structures, encoding formats, and distributed clocks โ€” all at interview depth.

H 0โ€“1
Complexity + Amortization
H 1โ€“2
Consensus: Raft / Paxos
H 2โ€“3
Networking: TCP/TLS/HTTP
H 3โ€“4.5
Probabilistic Data Structures
H 4.5โ€“5.5
Encoding + Serialization
H 5.5โ€“6
Distributed Clocks

01Complexity & Amortized Analysis

Big-O Reference

  • O(1) โ€” array index, hash map get/set
  • O(log n) โ€” binary search, balanced BST insert/lookup
  • O(n) โ€” linear scan, single pass
  • O(n log n) โ€” merge sort, heap sort, TreeMap ops
  • O(nยฒ) โ€” nested loops, bubble sort
  • O(2โฟ) โ€” all subsets (recursive Fibonacci)
  • O(n!) โ€” all permutations

Amortized Analysis

  • Average cost per operation over a sequence โ€” not worst-case single op
  • ArrayList/slice append: O(1) amortized (doubling strategy)
  • Doubling: total copies = 1+2+4+...+n = 2n โ†’ O(1) per append
  • HashMap resize: O(n) per resize, O(1) amortized over n inserts
  • Splay tree: O(log n) amortized, O(n) worst single op
  • Key insight: occasional expensive ops are "paid for" by cheap ones

Space Complexity

  • O(1) โ€” iterative algorithm with fixed variables
  • O(log n) โ€” recursive binary search (call stack)
  • O(n) โ€” storing all input elements (hash map, array copy)
  • O(nยฒ) โ€” 2D matrix, adjacency matrix for dense graph
  • In-place algorithms: O(1) extra space (quicksort partition)
  • Auxiliary space vs total space โ€” distinguish in interviews

NP-Complete Problems (Know These)

  • Travelling Salesman Problem (TSP)
  • 0-1 Knapsack (NP-hard, DP gives pseudo-polynomial)
  • Graph coloring (k โ‰ฅ 3)
  • Boolean satisfiability (SAT)
  • Subset sum
  • These have no known polynomial-time solution โ€” use approximations or DP for small n

Amortized Analysis โ€” Dynamic Array Append

// Dynamic array doubles when full โ€” Go slices, Java ArrayList Capacity: 1 โ†’ 2 โ†’ 4 โ†’ 8 โ†’ 16 โ†’ 32 โ†’ ... โ†’ n // Total copy operations for n appends: 1 + 2 + 4 + 8 + ... + n/2 = n - 1 โ‰ˆ O(n) total // Per append amortized cost: O(n) total / n appends = O(1) amortized // Worst case single append: O(n) (when doubling occurs) // But this happens rarely โ€” 1 in every 2n appends

Quiz โ€” Complexity

Q1. ArrayList.append() in Java has O(1) amortized time. What is its worst-case single-operation time?

Q2. Binary search on a sorted array of 1 million elements โ€” how many comparisons worst case?

02Distributed Consensus โ€” Raft & Paxos

The CAP Theorem & PACELC

To study distributed systems without external resources, you must understand the fundamental constraints:

  • CAP Theorem: In the presence of a Partition (network failure), you must choose between Consistency (all nodes see same data) or Availability (every request gets a response).
  • PACELC: An extension of CAP. Partition ? (A vs C) : Else (Latency vs Consistency).
// Tradeoff Examples Raft (etcd/CockroachDB): Chooses Consistency over Availability (CP). If a majority can't be reached, the system stops accepting writes to prevent data corruption. DynamoDB/Cassandra: Chooses Availability over Consistency (AP). Allows writes during a partition, resolving conflicts later using Vector Clocks or LWW.

Linearizability: This is the "Gold Standard" of consistency. It makes the distributed system appear as if there is only one copy of the data, and all operations are atomic.

Raft Consensus Algorithm

Raft is designed for understandability. It decomposes consensus into: leader election, log replication, and safety. Used by etcd (Kubernetes state store), CockroachDB, Consul, and TiKV.

// LEADER ELECTION All nodes start as Followers with an election timeout (150โ€“300ms, randomized) Follower hasn't heard from leader โ†’ becomes Candidate โ†’ increments term โ†’ requests votes Candidate needs majority (n/2 + 1) votes to become Leader Split vote โ†’ timeout expires โ†’ new election with higher term // LOG REPLICATION Client โ†’ Leader: write request Leader appends entry to its log (uncommitted) Leader โ†’ Followers: AppendEntries RPC (heartbeat + log entries) Majority ACKs โ†’ Leader commits entry โ†’ responds to client Leader notifies Followers of commit on next heartbeat // SAFETY GUARANTEE Committed entry = present in logs of majority Leader always has most up-to-date committed log (voting restriction) Term number ensures no split-brain: higher term = newer leader

Key properties: Election safety (at most one leader per term). Log matching (same index+term โ†’ same entry). Leader completeness (committed entries survive leadership changes).

Raft vs Paxos

  • Both achieve same safety guarantees under same failure model
  • Paxos: foundational theory, flexible, hard to implement correctly
  • Raft: leader-based, sequential log, much easier to build and reason about
  • Multi-Paxos: optimized Paxos for continuous log โ€” closer to Raft
  • Paxos used: Zookeeper (ZAB protocol), Google Chubby
  • Raft used: etcd, CockroachDB, Consul, TiKV, RethinkDB

Byzantine Fault Tolerance (BFT)

  • Raft/Paxos assume crash failures โ€” nodes stop working
  • BFT: nodes can behave maliciously (send wrong data)
  • PBFT: needs 3f+1 nodes to tolerate f Byzantine failures
  • Blockchain (Bitcoin/Ethereum): Nakamoto consensus โ€” probabilistic BFT via PoW/PoS
  • Hyperledger Fabric: PBFT variant for permissioned chains
  • BFT is much more expensive โ€” only use when adversarial actors possible

Consensus in Practice

  • Leader bottleneck: all writes go through leader โ€” use partitioning
  • Network partition: minority partition becomes unavailable (CP choice)
  • Pre-vote optimization: prevents unnecessary elections (etcd)
  • Joint consensus: safe cluster membership changes (add/remove nodes)
  • Read scalability: followers can serve reads with linearizability via read index
  • Lease reads: leader serves reads without log round-trip (clock-based)

FLP Impossibility

  • Fischer-Lynch-Paterson (1985): no deterministic algorithm can achieve consensus in an async system with even one crash failure
  • Why Raft works: uses randomized timeouts (randomness breaks the impossibility)
  • Practical insight: consensus requires either synchrony assumptions or randomness
  • Raft's randomized election timeouts: solve livelock, not covered by FLP

Quiz โ€” Consensus

Q1. In Raft, when does a leader commit a log entry?

Q2. Why do Raft election timeouts use randomization?

Q3. How many nodes does PBFT need to tolerate 1 Byzantine failure?

03Networking โ€” TCP, TLS, HTTP

TCP Deep Dive

To study networking as a self-contained book, you must master Congestion Control. This is why internet speed fluctuates.

  • Slow Start: Start with a small window. Double size every RTT. (Exponential growth).
  • Congestion Avoidance: After reaching a threshold, increase window by 1 per RTT. (Linear growth).
  • Multiplicative Decrease: On packet loss, cut the window by half immediately.
Senior Insight: This is AIMD (Additive Increase, Multiplicative Decrease). It is designed to be "polite"โ€”it tries to grab bandwidth quickly but backs off aggressively to avoid crashing the network.
// TCP 3-Way Handshake (1.5 RTT before data) Client โ†’ Server: SYN (seq=x) Server โ†’ Client: SYN-ACK (seq=y, ack=x+1) Client โ†’ Server: ACK (ack=y+1) // Now: first data segment can be sent // TCP 4-Way Close Client โ†’ Server: FIN Server โ†’ Client: ACK Server โ†’ Client: FIN Client โ†’ Server: ACK โ†’ TIME_WAIT (2ร—MSL โ‰ˆ 2-4 min) // TIME_WAIT: ensures late packets don't corrupt new connections on same port // Congestion Control Slow Start: cwnd doubles each RTT until ssthresh Congestion Avoidance: cwnd += 1 per RTT (additive increase) Packet loss detected: ssthresh = cwnd/2, restart (multiplicative decrease) // AIMD = Additive Increase, Multiplicative Decrease

TLS 1.3 Handshake

// TLS 1.3: 1-RTT (vs TLS 1.2's 2-RTT) Client โ†’ Server: ClientHello (TLS version, cipher suites, key_share: DH public key) Server โ†’ Client: ServerHello (chosen cipher, key_share: server DH public key) + {Certificate, CertificateVerify, Finished} (all encrypted now!) // Encryption begins here โ€” much earlier than TLS 1.2 Client โ†’ Server: {Finished} Client โ†’ Server: {HTTP Request} (already encrypted) // 0-RTT Session Resumption (TLS 1.3) Client โ†’ Server: ClientHello + {Early Data} (pre_shared_key from previous session) // Warning: 0-RTT is vulnerable to replay attacks // Only safe for idempotent requests (GET, HEAD) โ€” NEVER for POST/payments // Key Exchange: Diffie-Hellman (forward secrecy) // Even if server private key is compromised later, past sessions remain secure

Certificate validation: OCSP stapling (server includes signed proof cert is not revoked), Certificate Transparency (CT) logs detect rogue CA issuance. mTLS: both client and server present certificates โ€” used in service meshes (Istio), gRPC, Hyperledger Fabric.

HTTP/1.1 vs HTTP/2 vs HTTP/3

  • HTTP/1.1: one req/response per TCP connection (pipelining limited). Keep-alive reuses connection.
  • HTTP/2: multiplexing โ€” multiple streams over one TCP. HPACK header compression. Server push. But: single TCP stream = HOL blocking on packet loss.
  • HTTP/3: runs on QUIC (UDP). Per-stream loss recovery โ€” losing a packet only blocks that stream, not all streams. 0-RTT connection setup.
  • QUIC: connection migration (IP change doesn't break session), built-in TLS 1.3, designed for mobile networks.

DNS Internals

  • Recursive resolver: asks on your behalf (ISP DNS, 8.8.8.8)
  • Authoritative server: knows the definitive answer for a zone
  • Resolution chain: Root โ†’ TLD (.com) โ†’ Authoritative โ†’ Answer
  • TTL: how long to cache. Low TTL = slow (more DNS queries). High TTL = fast but slow propagation.
  • GeoDNS: return different IPs based on client location (Route53 latency-based)
  • DNSSEC: cryptographic signatures prevent DNS spoofing/cache poisoning

Load Balancer Layers

  • L4 (Transport): routes by IP + TCP port, no packet inspection. Faster.
  • L7 (Application): routes by HTTP headers, path, cookie. More flexible.
  • L7 enables: path-based routing, A/B testing, auth at edge, sticky sessions by cookie
  • AWS ALB: L7. NLB: L4. HAProxy: both modes.
  • Health checks: L4 (TCP connect) or L7 (HTTP /health endpoint)
  • Connection draining: allow in-flight requests to finish before removing instance

WebSockets vs SSE vs Polling

  • Long Polling: client waits, server holds until event. Simple, higher latency.
  • SSE: server โ†’ client only. HTTP/1.1 compatible. Good for feeds, notifications.
  • WebSocket: full-duplex over persistent TCP. Chat, gaming, real-time dashboards.
  • WebTransport: QUIC-based, future. Datagrams + streams.
  • WebSocket uses HTTP Upgrade header to switch protocol.
  • Go handles 10K+ concurrent WS connections well via goroutines.

Quiz โ€” Networking

Q1. HTTP/2's main limitation compared to HTTP/3 is:

Q2. TLS 1.3 achieves forward secrecy by using:

Q3. Why is 0-RTT TLS resumption dangerous for POST requests?

04Probabilistic Data Structures

Bloom Filter โ€” Probabilistic Set Membership

A Bloom filter answers "is this element in the set?" with no false negatives (if it says NO, definitely not in the set) and a tunable false positive rate (if it says YES, probably in the set).

// How it works Data structure: bit array of m bits, all initialized to 0 Insertion: hash element with k independent hash functions โ†’ set k bits to 1 Lookup: hash element with same k functions โ†’ if ANY bit is 0, definitely not in set if ALL bits are 1 โ†’ probably in set (false positive possible) // False Positive Rate formula FPR โ‰ˆ (1 - e^(-kn/m))^k where: k = number of hash functions, n = number of elements, m = number of bits // Optimal k = (m/n) ร— ln(2) โ‰ˆ 0.693 ร— (m/n) // For 1% FPR: need ~10 bits per element with 7 hash functions // For 0.1% FPR: need ~15 bits per element // Key limitation: cannot delete elements (a 1 could belong to multiple items) // Solution: Counting Bloom Filter (increment/decrement counts instead of bits)

Where it's used: Cassandra (skip SSTable disk reads for non-existent keys), Google Chrome (safe browsing URL check), databases (avoid unnecessary disk lookups), CDNs (cache negative results), Bitcoin (SPV wallet transaction lookup).

HyperLogLog โ€” Approximate Cardinality

Count distinct elements (like COUNT DISTINCT in SQL) using a tiny fixed amount of memory โ€” regardless of how many unique items you've seen.

// Problem: COUNT DISTINCT on 1 billion IDs requires storing all IDs โ€” ~8GB // HyperLogLog: ~1.5KB of memory, ~1.04/โˆšm standard error // Core insight: the maximum number of leading zeros in hashed values // estimates logโ‚‚(cardinality) For each element: hash(element) โ†’ binary string count leading zeros (position of first 1) track the maximum leading zeros seen: M Estimate = 2^M // HyperLogLog uses m registers (buckets) and takes the harmonic mean // m=16384 registers โ†’ standard error ~0.8%, memory ~12KB // Redis: PFADD key element, PFCOUNT key // Mergeable: PFMERGE โ†’ combine multiple HLLs without double-counting

Where it's used: Redis PFADD/PFCOUNT, counting unique visitors at scale, distinct queries in analytics (Presto, BigQuery), network traffic analysis, streaming distinct count.

Skip List

  • Probabilistic alternative to balanced BST
  • Multiple levels of sorted linked lists
  • Each element promoted to next level with probability p (usually 0.5)
  • O(log n) average search, insert, delete
  • Used by: Redis sorted sets (ZSETs), LevelDB, MemTable in Cassandra
  • Advantage over BST: simpler to implement, lock-free concurrent versions exist

Count-Min Sketch

  • Approximate frequency count for items in a stream
  • 2D array: d rows of hash functions ร— w counters per row
  • Update: increment d cells (one per row per hash function)
  • Query: take minimum of d cells โ†’ upper bound on true count
  • Never underestimates, may slightly overestimate
  • Used for: top-k heavy hitters, rate limiting by user, network flow analysis

Consistent Hashing

  • Map keys and nodes onto a ring (0 to 2ยณยฒ)
  • Key โ†’ hash โ†’ find nearest node clockwise on ring
  • Add/remove node: only adjacent keys move (not all keys)
  • Virtual nodes (vnodes): each physical node = 100-200 points on ring โ†’ better distribution
  • Adding 1 node to N-node cluster: only 1/N keys move (not all)
  • Used by: DynamoDB, Cassandra, Redis Cluster, CDN cache routing

Merkle Trees

  • Binary tree where each leaf = hash(data), each node = hash(left+right)
  • Root hash = fingerprint of entire dataset
  • Verify any subset of data without downloading all data
  • O(log n) proof size for single element membership
  • Used by: Git (tree objects), Bitcoin (transaction Merkle root in block header), Cassandra (anti-entropy repair), IPFS

Quiz โ€” Probabilistic Data Structures

Q1. A Bloom filter says an element IS in the set. What can you conclude?

Q2. HyperLogLog is designed to solve which problem efficiently?

Q3. When consistent hashing adds 1 node to an N-node cluster, approximately what fraction of keys need to move?

05Encoding & Serialization Formats

JSON vs Protobuf vs Avro vs MessagePack

// JSON โ€” text, human-readable {"user_id": 12345, "name": "Alice", "active": true} Pros: universal, debuggable, schema-optional Cons: ~5-10x larger than binary, slow parse, type coercion issues (int vs float) Use for: public APIs, config files, logs, debugging // Protocol Buffers (Protobuf) โ€” binary, schema-required message User { uint64 user_id = 1; string name = 2; bool active = 3; } Pros: 5-10x smaller than JSON, fast encode/decode, strong typing, backward compatible Cons: not human-readable, schema required, field renaming is safe but type changes are not // Backward compat rules: ADD fields (OK), REMOVE fields (OK if not required) // CHANGE field type: NEVER. CHANGE field number: NEVER. Use for: internal microservice communication, gRPC, mobile โ†’ backend // Apache Avro โ€” binary, schema in file header or registry Pros: schema evolution via schema registry, Kafka native format Cons: schema must be available at read time (use Schema Registry) Use for: Kafka event streaming, Hadoop, data pipelines needing schema evolution // MessagePack โ€” binary JSON, no schema required Pros: 50% smaller than JSON, drop-in replacement for JSON, schema-optional Use for: Redis protocol, cache serialization where JSON is too slow

Protobuf Schema Evolution Rules

  • โœ… Add new optional fields (old code ignores unknown fields)
  • โœ… Remove fields (mark deprecated, don't reuse field number)
  • โœ… Rename fields (wire format uses field number, not name)
  • โŒ Change field type (int32 โ†’ string breaks decoding)
  • โŒ Change field number (changes the binary wire format)
  • โŒ Change a repeated field to optional or vice versa

Data Compression

  • Snappy: fast compression/decompression, moderate ratio (~2x). Google default for Hadoop, Kafka.
  • LZ4: fastest decompression, lower ratio. Great for real-time streaming.
  • Zstd: best ratio for text/JSON (~3-4x), still fast. Modern default.
  • Gzip: slower, good ratio. Standard for HTTP responses.
  • Kafka: set compression.type=zstd for best throughput/storage balance

Base64 & Encoding Schemes

  • Base64: binary โ†’ printable ASCII (3 bytes โ†’ 4 chars, ~33% overhead)
  • Used: email attachments, JWT tokens, image data in JSON/HTML
  • URL-safe Base64: replace +/ with -_ (no URL encoding needed)
  • Base62: URL shortener encoding โ€” uses 0-9, a-z, A-Z, no special chars
  • Hex encoding: 1 byte โ†’ 2 hex chars (100% overhead). Used for hashes, UUIDs.

Character Encoding

  • ASCII: 128 chars, 7-bit, English only
  • UTF-8: variable width (1-4 bytes), backward compatible with ASCII, universal
  • UTF-16: 2-4 bytes, used in Java/JavaScript strings internally
  • UTF-32: fixed 4 bytes, simple but wasteful
  • Always use UTF-8 for network protocols and storage
  • Emoji: 4 bytes in UTF-8 โ€” breaks naive string length counting

Quiz โ€” Encoding

Q1. In Protobuf, you need to rename a field. What must you NOT change?

Q2. For Kafka event streaming that needs schema evolution guarantees, which format is most appropriate?

06Distributed Clocks & Ordering

Lamport Timestamps & Vector Clocks

In a distributed system, there is no global clock. We need logical ordering to reason about causality.

// Lamport Timestamp Each process maintains a counter L Rules: Before any event: L += 1 Before sending message: L += 1, attach L to message On receiving message(m_time): L = max(L, m_time) + 1 Establishes "happens-before" (โ†’) but not concurrency If A โ†’ B then Lamport(A) < Lamport(B) โ€” but NOT the reverse! // Lamport(A) < Lamport(B) does NOT mean A happened before B // Use vector clocks if you need to detect concurrency // Vector Clock Each process i maintains vector V[n] (one counter per process) Send from i: V[i] += 1, attach V to message Receive at j from i: V[j][k] = max(V[j][k], V[msg][k]) for all k, then V[j][j] += 1 Comparison: VC_A < VC_B iff every element of A โ‰ค B (and at least one strictly <) Concurrent: neither A โ‰ค B nor B โ‰ค A โ†’ concurrent events! // Used in: DynamoDB (for conflict detection), CRDTs, Riak

Physical Clocks vs Logical Clocks

  • Physical clocks (NTP): drift up to ms. Can go backwards after sync. Can't order events in same millisecond.
  • Logical clocks: no drift, purely causal ordering, no real-time information
  • Hybrid Logical Clocks (HLC): combines physical + logical. Tracks causality, close to wall clock. Used by CockroachDB.
  • TrueTime (Google Spanner): GPS + atomic clocks. Bounded uncertainty interval โ€” waits out uncertainty for strict serializability.

CRDTs โ€” Conflict-Free Replicated Data Types

  • Data structures that can be merged without conflicts โ€” always converge
  • G-Counter: only increment (no decrement). Merge = take max per node.
  • PN-Counter: two G-Counters (positive + negative). Net = P - N.
  • G-Set: add-only set. Merge = union.
  • LWW-Register: last-write-wins by timestamp.
  • Used in: Redis (some types), Riak, collaborative editing (Google Docs)

Distributed Transactions โ€” Ordering

  • Serializability: transactions appear to execute in some serial order
  • Linearizability: real-time consistent reads (stronger than serializability)
  • Snapshot isolation: each transaction sees consistent snapshot โ€” allows write skew
  • Snapshot isolation + write skew: doctors on-call example (both check, both go off)
  • Spanner: external consistency (linearizability) across datacenters via TrueTime

Fencing Tokens

  • Problem: distributed lock holder pauses (GC), lock expires, another client acquires lock, original resumes with stale lock
  • Solution: fencing token โ€” monotonically increasing number with each lock grant
  • Storage service rejects writes with token โ‰ค last seen token
  • Even if old client resumes, its old token is rejected
  • Concept from Martin Kleppmann's DDIA โ€” essential for safe distributed locking

Quiz โ€” Distributed Clocks

Q1. Vector clocks detect concurrent events by:

Q2. A CRDT G-Counter merge operation takes:

07Best Resources

BOOK
Designing Data-Intensive Applications โ€” Martin Kleppmann

The bible for distributed systems. Chapters 5-9 are essential: replication, partitioning, transactions, consistency. Read this cover to cover.

INTERACTIVE
Raft Visualization โ€” raft.github.io

Interactive animation of Raft leader election and log replication. 15 minutes here is worth more than any blog post.

VIDEO
Martin Kleppmann โ€” Distributed Systems Lectures (Cambridge)

University lecture series. Lectures 5-10 cover vector clocks, consensus, consistency, transactions. Dense and rigorous.

VIDEO
MIT 6.824 โ€” Distributed Systems (Raft lecture)

MIT course lecture on Raft. Best academic explanation of the algorithm and its safety proofs.

INTERACTIVE
Bloom Filter Tutorial โ€” Interactive

Visual + mathematical explanation. Build one from scratch โ€” 30 minutes, permanently understand it.

BLOG
Cloudflare โ€” QUIC is Now RFC 9000

Best practical QUIC/HTTP3 explanation with real production numbers. Understand WHY QUIC fixes HTTP/2's problems.

BLOG
Meta Engineering โ€” HyperLogLog in Practice

How Meta counts distinct users at scale. Production accuracy numbers and memory savings.