Core Concepts
- Vertical scaling: add CPU/RAM to one machine. Simple, no code change. Hard ceiling ~96 cores. SPOF.
- Horizontal scaling: add more machines. Requires stateless design, service discovery, load balancing.
- Stateless services: no local session — store state in Redis/DB so any instance can serve any request.
- Scale Cube (AKF): X=clone (horizontal), Y=functional decompose (microservices), Z=data partition (sharding)
- Bottleneck first: before scaling, profile — CPU, memory, I/O, or network? Scale the actual bottleneck.
When to Scale What
- CPU bound → vertical first, then horizontal with load balancer
- Memory bound → vertical or offload to Redis/external cache
- I/O bound → read replicas, caching, async processing
- Network bound → CDN, edge caching, compression
- DB bound → connection pooling, read replicas, sharding
- Stateful service → externalize state first, then scale horizontally
Key Trade-offs
Vertical: operationally simple, no distributed system complexity, but hard ceiling and SPOF. Horizontal: theoretically unlimited, fault tolerant, but requires distributed consensus, session externalization, consistent hashing for data locality. Most production systems do both: vertical for the database, horizontal for stateless API tiers.
Interview Approach
When asked "how do you scale X?": first identify the bottleneck. Don't say "add more servers" — say "I'd first profile to identify whether we're CPU, memory, or I/O bound. For the stateless API tier I'd scale horizontally behind an ALB. For the database I'd start with read replicas and connection pooling, escalating to sharding only when reads on replicas saturate."
L4 vs L7 Load Balancing
- L4 (Transport): routes by IP + TCP port, no packet inspection — fastest, no TLS termination
- L7 (Application): routes by HTTP headers, path, cookie — enables path-based routing, A/B testing, auth
- AWS NLB = L4. AWS ALB = L7. HAProxy does both. Nginx = L7 (can do L4 with stream module).
- TLS termination at LB: offloads crypto from app servers, enables header inspection
Algorithms
- Round Robin: simple, equal distribution. Bad if requests vary in cost.
- Least Connections: send to server with fewest active connections. Best for variable-cost requests.
- IP Hash: same client → same server (sticky). Breaks with proxies/NAT.
- Consistent Hashing: minimize reshuffling when servers added/removed. Best for cache locality.
- Weighted: canary deployments — send 5% to new version.
Health Checks + Connection Draining
Active health checks: LB polls /health endpoint every 5s. Passive: detect failures from response codes. Connection draining: when removing a server, LB stops sending new connections but lets in-flight requests finish (30–60s grace period). Critical for zero-downtime deployments.
Common Interview Gotcha
Sticky sessions (session affinity) break horizontal scaling — if a server dies, all its sessions are lost. The correct answer: externalize sessions to Redis, make the service stateless, and use round-robin or least-connections. Only use sticky sessions when you can't externalize state (legacy systems).
Caching Strategies
- Cache-aside (Lazy): app checks cache first → miss → read DB → populate cache. Most common.
- Read-through: cache fetches from DB on miss automatically. App only talks to cache.
- Write-through: write to cache AND DB synchronously. Consistent, but doubles write latency.
- Write-behind (Write-back): write to cache only, flush to DB async. Fast writes, risk of data loss on crash.
- Refresh-ahead: proactively refresh cache before TTL expires on popular keys.
Redis Data Structures
- String: counters, session tokens, simple cache values
- Hash: user profile fields — partial updates without fetching all
- List: message queues, activity feeds (LPUSH/RPOP)
- Set: unique visitors, tags, friend lists (union/intersection)
- Sorted Set (ZSET): leaderboards, rate limiting, priority queues
- HyperLogLog: approximate unique count — 12KB for billions of items
- Streams: durable event log — lighter-weight Kafka alternative
Cache Invalidation — "Two Hard Problems"
TTL-based: simple, eventual consistency acceptable. Event-driven: write → Kafka event → cache invalidation consumers. More consistent but complex. Versioned keys: user:42:v3 — change version on update, old key orphans and expires. Cache stampede (thundering herd): many requests miss simultaneously and hammer DB. Solutions: mutex lock (only one request rebuilds), probabilistic early expiry (PER algorithm), request coalescing.
Multilevel Caching
L1 = in-process memory (microseconds, limited size, per-instance). L2 = Redis cluster (milliseconds, shared, bounded). L3 = DB read replica. Mention hot keys: a single extremely popular key (celebrity post) can overwhelm one Redis shard — solutions: local replica per app server, key sharding by appending suffix.
CAP Theorem
- Consistency: every read sees the most recent write
- Availability: every request gets a (non-error) response
- Partition Tolerance: system continues despite network splits
- Reality: P always occurs in distributed systems → choose C or A during partition
- CP systems: HBase, ZooKeeper, etcd, MongoDB (strong mode)
- AP systems: Cassandra, DynamoDB, CouchDB, Redis (default)
Consistency Models (Spectrum)
- Linearizability: single-node illusion — reads always see latest write
- Sequential: all ops appear in some consistent order
- Causal: causally related operations appear in order
- Read-your-writes: you always see your own writes (session)
- Monotonic reads: once you read a value, you won't see older values
- Eventual: all replicas converge if no new writes — lowest consistency
PACELC — More Practical Than CAP
PACELC extends CAP: "If Partition (P), choose between Availability (A) and Consistency (C). Else (E) — even without partition — choose between Latency (L) and Consistency (C)." Most distributed systems trade off latency vs consistency ALL the time, not just during partitions. DynamoDB: PA/EL (available, low latency). Spanner: PC/EC (consistent, higher latency). Understanding PACELC shows deeper distributed systems thinking.
Interview Signal
Never just say "I'd use CAP theorem." Say: "For a payment system, I need CP — a failed transaction is better than a duplicate credit. I'd accept reduced availability during network partitions and use circuit breakers to fail fast. For a social media feed, AP is fine — showing a slightly stale post is acceptable, and I'd use eventual consistency with short TTLs."
Algorithms Compared
- Fixed Window: count resets at interval boundary. Simple. Burst attack possible at boundary edges.
- Sliding Window Log: log timestamps of each request in Redis sorted set. Precise. Memory = O(requests in window).
- Sliding Window Counter: interpolate between current and previous window counts. Memory efficient, ~0.003% error rate.
- Token Bucket: bucket refills at rate R, max burst = bucket size. Allows controlled bursting. AWS API Gateway uses this.
- Leaky Bucket: requests queued, processed at constant rate. Smooths traffic. Queue fills under burst — requests dropped.
Distributed Rate Limiting
- Single Redis node: atomic Lua script — INCR + EXPIRE in one round-trip
- Redis Cluster: consistent hash user_id to shard — same user always hits same node
- Sliding window with ZSET: ZADD + ZREMRANGEBYSCORE + ZCARD atomically
- Response headers:
X-RateLimit-Limit,X-RateLimit-Remaining,Retry-After - Return HTTP 429 Too Many Requests with Retry-After header
- Different limits by tier: free (100/min), pro (1000/min), enterprise (unlimited)
CDN Internals
- Edge PoPs: globally distributed cache nodes — serve content from nearest geography
- Cache-Control headers:
public, max-age=86400tells CDN to cache for 1 day - Pull CDN: first request goes to origin, CDN caches response. Most common.
- Push CDN: upload assets proactively to CDN edge nodes. Better for static files that don't change.
- Origin Shield: intermediate layer between CDN edges and origin — collapses cache misses from multiple PoPs into one origin request. Reduces origin load ~90%.
- Cache invalidation: URL versioning (
app.v3.js) or API purge (CDN purge API on deploy).
DNS + Proxies
- DNS resolution chain: Browser cache → OS cache → Recursive resolver → Root NS → TLD NS → Authoritative NS
- TTL strategy: low TTL (60s) before blue-green switch, high TTL (3600s) normal operation
- GeoDNS: return different IP based on client region — Route53 latency-based routing
- Forward proxy: client-side (corp proxy, VPN) — client knows about it
- Reverse proxy: server-side (Nginx, Cloudflare) — client doesn't know
- API Gateway: reverse proxy + auth + rate limiting + routing + observability
Full Request Flow — Memorize This
Client → DNS lookup (~30ms first time, cached after) → CDN edge (cache HIT → ~5ms, done). If MISS → Load Balancer (~1ms) → API Gateway (auth, rate limit, ~5ms) → Service → Cache check (Redis HIT ~1ms, done). If MISS → Database query (~5–20ms) → populate cache → respond. Total p50: 50–100ms. p99 bottleneck is almost always the DB or a missing cache layer.
REST — Principles & Best Practices
- Resource-based URLs: nouns not verbs —
/users/42/ordersnot/getUserOrders - HTTP verbs: GET (read), POST (create), PUT (replace), PATCH (partial update), DELETE (remove)
- Status codes: 200 OK, 201 Created, 204 No Content, 400 Bad Request, 401 Unauth, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable, 429 Too Many, 500 Server Error
- Idempotency: GET/PUT/DELETE are idempotent. POST is not. Use idempotency keys for POST payments.
- HATEOAS: responses include links to related actions — reduces client coupling
GraphQL vs REST vs gRPC
- REST: simple, cacheable, universal client support. Overfetch/underfetch problem.
- GraphQL: client specifies exact fields needed — eliminates over/under-fetch. Single endpoint. Complex queries, no HTTP caching by default. Best for: frontend-heavy apps, mobile (save bandwidth), multiple clients with different data needs.
- gRPC: binary protocol (Protobuf), strongly typed, bidirectional streaming, HTTP/2. 5-10x faster than REST/JSON. Best for: internal microservice-to-microservice, streaming data, polyglot services.
- WebSocket: persistent full-duplex. Use when server must push to client in real time.
API Versioning Strategies
- URL versioning:
/api/v1/users— most visible, easy to route. Industry default. - Header versioning:
Accept: application/vnd.api+json;version=2— cleaner URLs, harder to test in browser - Query param:
/api/users?version=2— simple but pollutes query space - Never break existing clients — add fields, never remove/rename
- Deprecation: return
Sunsetheader with deprecation date - Maintain N-1 versions minimum. Remove after migration period.
Idempotency & Safety
- Idempotency key: client sends unique key with POST request (UUID)
- Server caches response keyed by idempotency key for 24h
- On retry: return cached response — no duplicate charge/order
- Stripe, PayPal, Braintree all use this pattern for payments
- Key header:
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000 - Store key → response in Redis with TTL. Check before processing.
API Authentication & Authorization
- API Key: simple, server-side secret. No expiry mechanism. Best for server-to-server.
- JWT (Bearer token): stateless, self-contained claims. Verify signature without DB call. Short expiry (15min) + refresh token (7 days).
- OAuth 2.0: delegated authorization. User grants app access to their data on another service. Authorization Code flow for web, PKCE for mobile/SPA.
- mTLS: mutual TLS — both client and server present certificates. For internal service-to-service in zero-trust networks.
- HMAC signatures: sign request body with shared secret — webhooks, S3 presigned URLs.
API Error Design
- Return structured errors — not just HTTP status code
- Include: error code (machine-readable), message (human), details array, request_id for tracing
- Never expose stack traces or internal DB errors to clients
- Use RFC 7807 Problem Details standard for REST APIs
- Consistent envelope:
{"error":{"code":"RATE_LIMITED","message":"...","retry_after":47}} - Distinguish client errors (4xx) from server errors (5xx) — client errors are caller's fault
API Design Decision Framework
Public API or 3rd party developers? → REST. Universally understood, easy to test with curl, HTTP caching works naturally.
Mobile app with bandwidth concerns? → GraphQL. Client specifies exactly what fields it needs — no over-fetching.
Internal microservices, high throughput? → gRPC. Binary protocol, streaming support, strongly typed contracts catch breaking changes at compile time.
Real-time bidirectional? → WebSocket (full-duplex persistent) or SSE (server-push only, simpler).
Multiple frontend clients (web, mobile, IoT) with different data needs? → GraphQL with a BFF (Backend for Frontend) layer.
API Design Interview Tips
Always define the API before drawing the architecture. Walk through: (1) What are the key resources? (2) What operations does each resource support? (3) What are the access patterns — who calls what, how often? (4) What are the consistency requirements for each endpoint? Showing API-first thinking is a strong signal — it means you think about contracts and client usability, not just internal implementation.
🎯 Week 1 Checkpoint
- Draw the full request flow (Client → CDN → LB → Gateway → Service → Cache → DB) from memory with latency at each hop
- Design a REST API for a ride-booking app — define 5 endpoints with correct verbs, status codes, and pagination
- Explain when you'd choose gRPC over REST and vice versa
- Design a distributed rate limiter for 50K RPS — which algorithm and why?
- Given a payment system, which CAP trade-off do you make and why?
SQL Strengths
- ACID transactions — critical for financial systems, inventory, reservations
- Complex joins and aggregations across normalized data
- Schema enforcement = data integrity at DB level
- Mature tooling: EXPLAIN plans, index advisors, migrations
- Strong consistency by default
- Best for: payments, e-commerce orders, user accounts, ERP, anything relational
NoSQL Categories
- Key-Value (Redis, DynamoDB): O(1) get/set, simple access patterns, no joins. Best for: sessions, cache, counters.
- Document (MongoDB, Firestore): flexible schema, nested JSON. Best for: catalogs, CMS, user profiles.
- Wide-Column (Cassandra, HBase): write-optimized LSM tree, massive scale, tunable consistency. Best for: time-series, IoT, audit logs.
- Graph (Neo4j, Amazon Neptune): relationship traversal. Best for: social networks, fraud detection, recommendations.
- Search (Elasticsearch, OpenSearch): inverted index, full-text + faceted search.
Decision Framework — Never "I'd use MongoDB because it's flexible"
Ask: (1) Do I need multi-entity ACID transactions? → PostgreSQL. (2) Do I need complex multi-table joins? → SQL. (3) Is write throughput > 50K/s with simple access patterns? → Cassandra/DynamoDB. (4) Flexible schema, document-centric reads, no complex joins? → MongoDB. (5) Cache or session? → Redis. (6) Full-text search? → Elasticsearch alongside primary DB. (7) Graph traversal? → Neo4j. The key signal: which queries will this DB serve, and can it serve them efficiently?
B-Tree (PostgreSQL, MySQL)
- Self-balancing tree, O(log n) reads and writes
- Pages: typically 8KB, tuned for disk block size
- Good for mixed read/write workloads
- Random writes cause write amplification (update internal nodes)
- In-place updates — good for read-heavy workloads
LSM Tree (Cassandra, RocksDB, LevelDB)
- Write to in-memory MemTable first — O(1) sequential write
- MemTable flushed to immutable SSTable on disk when full
- Compaction merges SSTables — removes tombstones, sorts keys
- Bloom filter per SSTable: skip SSTables that don't contain key
- O(1) writes, O(log n) reads. Better for write-heavy workloads.
- Write amplification during compaction. Read amplification without bloom filter.
MVCC + WAL — How PostgreSQL Works
MVCC (Multi-Version Concurrency Control): each write creates a new row version with a transaction timestamp. Readers see a snapshot at their transaction start time — no blocking. Old versions cleaned up by VACUUM. Enables non-blocking reads alongside concurrent writes.
WAL (Write-Ahead Log): every change is written to the WAL log before it's applied to data pages. On crash: replay WAL to recover. Also enables streaming replication — ship WAL segments to replicas. PostgreSQL logical replication: replays WAL as SQL changes, enabling cross-version replication and CDC (Debezium).
Sharding Strategies
- Hash sharding: hash(key) % N → shard. Uniform distribution, no range queries across shards.
- Range sharding: shard by key range (A-M, N-Z). Range queries efficient. Hotspot risk if data is skewed.
- Directory sharding: lookup table maps key → shard. Most flexible, single point of failure.
- Geo sharding: shard by region/country. Data locality, GDPR compliance.
- Consistent hashing: hash ring — adding/removing nodes minimizes key movement. Used by DynamoDB, Cassandra.
Consistent Hashing Deep Dive
- Place nodes on hash ring (0 to 2³²)
- Key → hash → find nearest node clockwise on ring
- Add node: only neighbors' keys move (≈ k/n keys)
- Remove node: only that node's keys move to next node
- Virtual nodes (vnodes): each physical node = 100-200 points on ring → better, even distribution even with heterogeneous hardware
- Used by: DynamoDB, Cassandra, Redis Cluster, CDN cache routing
Shard Key Selection — The Most Critical Decision
Bad shard key = hotspots = one shard does all work while others are idle. For social network: user_id hash works for regular users but celebrities cause hot shards. Solution: celebrity detection + separate handling (fan-out on read for celebrities). For time-series: time-based shard key causes write hotspot on current time shard. Solution: composite key (time + device_id). Rule: choose a key with high cardinality AND even distribution across your access patterns. Avoid sequential keys (auto-increment IDs) — they create monotonic hotspots.
Replication Models
- Single-leader: all writes to primary, replicated to read replicas. Simple. Primary is write bottleneck.
- Multi-leader: writes accepted at multiple nodes. Conflict resolution required. Useful for multi-datacenter active-active.
- Leaderless (Dynamo-style): any node accepts writes. Quorum-based consistency.
- Sync replication: write confirmed only after replica ACKs. Durable but adds latency.
- Async replication: write confirmed immediately, replica catches up. Low latency but possible data loss on primary crash.
- Semi-sync: at least one replica must ACK. Balance between durability and latency. MySQL default.
Quorum Reads/Writes
- n = total replicas, w = write quorum, r = read quorum
- Strong consistency: w + r > n (overlap guarantees you read the latest write)
- Typical: n=3, w=2, r=2 — tolerates 1 failure, strongly consistent
- Write-heavy: w=1, r=3 (fast writes, consistent reads)
- Read-heavy: w=3, r=1 (durable writes, fast reads)
- Availability-first: w=1, r=1 (AP system — eventual consistency)
- Cassandra: tunable per query — LOCAL_QUORUM for multi-DC consistency
Leader Election — How etcd/ZooKeeper Handle It
Raft consensus: nodes time out waiting for heartbeat → become candidate → request votes → win majority → become leader. Election timeout is randomized (150–300ms) to prevent split votes. Fencing tokens prevent split-brain: new leader issues higher token; storage layer rejects writes with lower token. PostgreSQL failover (Patroni): uses etcd/Consul for leader election, switches DNS/VIP to new primary. Zero-downtime failover in ~30s.
Index Types
- B-Tree (default): equality + range queries. Works with LIKE 'prefix%'.
- Hash: equality only — slightly faster for exact match, no range support
- Composite: (col_a, col_b, col_c). Leftmost-prefix rule: must include leading columns in query to use index.
- Covering index: query fully served from index without table fetch (index includes all selected columns)
- Partial index: index on subset of rows.
WHERE active = true— smaller, faster for frequent filtered queries. - Full-text: inverted index with TF-IDF ranking. Use Elasticsearch for serious full-text search.
- GIN / GiST: PostgreSQL indexes for arrays, JSONB, full-text, geometric types.
Query Optimization Signals
- EXPLAIN ANALYZE: look for Seq Scan on large tables → missing index
- Index selectivity: high cardinality columns make better indexes (email > gender)
- N+1 query problem: 1 query for list + N queries for each item. Fix with JOIN or batch load.
- Connection pooling: PgBouncer in transaction mode — 10K app connections → 100 real DB connections
- Read replicas for analytics/reporting — separate from OLTP primary
- Avoid SELECT * — select only needed columns (reduces I/O, enables covering indexes)
- Pagination: use keyset pagination (
WHERE id > last_id) not OFFSET for large tables
🎯 Week 2 Checkpoint
- When would you choose LSM tree over B-Tree? Articulate write amplification trade-off.
- Design a sharding strategy for a payment system with high-value accounts — what's your shard key and why?
- Explain MVCC and why it enables non-blocking reads without SELECT FOR SHARE locks
- w=2, r=2, n=3 — is this strongly consistent? What about w=1, r=1, n=3?
- What's the N+1 problem? Show the fix with a query example.
Kafka Architecture
- Topic → Partitions (ordered, immutable append-only log)
- Offset: per-partition monotonic counter. Consumer tracks its position.
- Consumer Group: each partition consumed by exactly one consumer in the group
- Replication factor: ISR (In-Sync Replicas) — leader + N replicas
- Leader election via KRaft (Kafka Raft — replaces ZooKeeper since Kafka 3.3)
- Retention: time-based (7 days default) or size-based
- Log compaction: keep only latest value per key — good for state stores
Performance Tuning Knobs
- Partition count: parallelism ceiling for consumers — can't have more consumers than partitions
- acks=0: fire and forget (fastest, data loss). acks=1: leader ack. acks=all: all ISR ack (slowest, most durable).
- linger.ms + batch.size: batch small messages — increases throughput, adds latency
- compression.type: lz4 (speed) or zstd (ratio). Apply at producer level.
- max.poll.interval.ms: if consumer doesn't poll in time → rebalance triggered
- Partition key: determines which partition a message lands on — use user_id for ordering per user
Exactly-Once Semantics (EOS)
Kafka guarantees: at-least-once by default (consumer may process duplicate on crash). For exactly-once: (1) Idempotent producer: enable.idempotence=true — sequence numbers deduplicate retries at broker. (2) Transactional API: atomic produce to multiple topics + consumer offset commit in one transaction. In practice: idempotent consumers (dedup on DB side using unique constraint on idempotency key) are simpler and more resilient than Kafka EOS transactions.
Ordering Nuance
Kafka guarantees ordering per partition only — not across partitions. For strict ordering of all events for a user: partition by user_id → all events for that user land in same partition in order. Cross-user ordering is not guaranteed. For globally ordered events (rare): use a single partition (loses parallelism) or use application-level sequence numbers.
2-Phase Commit (2PC)
- Coordinator: send PREPARE → all participants vote YES/NO
- If all YES → send COMMIT to all. If any NO → send ABORT.
- Blocking protocol: if coordinator crashes after PREPARE, participants are stuck (holding locks)
- Used in distributed SQL (Spanner uses 2PC + Paxos). Avoid in microservices — too much coupling.
- 3-Phase Commit (3PC) adds pre-commit phase to reduce blocking — rarely used in practice
Saga Pattern
- Break distributed transaction into sequence of local transactions
- Each step publishes an event that triggers the next service
- On failure: compensating transactions undo previous steps in reverse order
- Choreography: services react to events autonomously — no central coordinator. Simple, but harder to track overall state.
- Orchestration: central saga orchestrator sends commands to each service. Easier to monitor and debug. Coupling to orchestrator.
Outbox Pattern — The Production-Grade Dual-Write Fix
Problem: write to DB and publish to Kafka — if Kafka publish fails after DB commit, event is lost. If you roll back on Kafka failure, you lose the DB write. This is the dual-write problem.
Solution — Outbox pattern: write to DB + Outbox table in same transaction (atomic). Debezium watches the WAL for Outbox table inserts and publishes to Kafka. Atomicity guaranteed by DB transaction. Eventual delivery guaranteed by CDC. This is the correct production approach for any service that writes to DB AND publishes events.
Protocol Comparison
- Long Polling: client sends request → server holds until event or timeout → client immediately re-requests. Latency = server hold time. High connection overhead. Use only for legacy clients.
- SSE (Server-Sent Events): persistent HTTP connection, server streams events to client. One-way (server → client only). HTTP/1.1 compatible. Good for: live feeds, notifications, progress bars.
- WebSocket: full-duplex persistent TCP connection. Both sides can send at any time. Best for: chat, collaborative editing, gaming, live dashboards. Uses HTTP Upgrade handshake.
- WebTransport (QUIC): emerging standard. Datagrams + streams over HTTP/3. Better for real-time gaming, video conferencing.
WebSocket at Scale
- Go handles 100K+ concurrent WS connections per server with goroutines
- Connection server is stateful — can't naively load balance (sticky sessions needed at WS tier)
- Pub-sub backend: connection servers subscribe to Redis Pub-Sub or Kafka topics
- Fan-out: message → Redis Pub-Sub → all connection servers for recipient's room/channel → push to client
- Presence tracking: heartbeat every 15s → Redis key with 30s TTL. Key expiry = user offline.
- Reconnect: client tracks last received sequence number → reconnect with offset → server replays missed messages
Metrics (Prometheus + Grafana)
- RED method: Rate (requests/sec), Errors (error rate %), Duration (latency p50/p99)
- USE method: Utilization, Saturation, Errors — for infrastructure resources
- Histogram: p50, p90, p99 latency. Never use averages — they hide outliers.
- Cardinality trap: avoid high-cardinality labels (user_id, request_id) — explodes Prometheus memory
- SLI/SLO/SLA: SLI=measure (p99 latency), SLO=target (p99 < 200ms 99.9% of time), SLA=contractual guarantee
Distributed Tracing (Jaeger, Tempo)
- Trace: tree of spans across services for one request
- Span: single operation with start/end time, service name, tags
- Propagate: W3C TraceContext header (
traceparent) across service calls - Sampling: head-based (random %) vs tail-based (on error/slow). Tail-based catches all slow/error requests.
- OpenTelemetry: vendor-neutral instrumentation standard. Collect once, export to Jaeger/Tempo/Datadog.
Structured Logging Best Practices
Always log JSON in production. Include: timestamp, level, service, trace_id, span_id, user_id (when applicable), message, and relevant context fields. Never log PII (emails, passwords, payment cards). Use log levels correctly: DEBUG (dev only), INFO (normal ops), WARN (unexpected but handled), ERROR (needs attention), FATAL (crash). Centralize with ELK (Elasticsearch + Logstash + Kibana) or Loki + Grafana. Correlate logs with traces via trace_id for end-to-end debugging.
🎯 Week 3 Checkpoint
- Explain the Outbox pattern to a junior engineer — why is it needed and how does it work?
- When does Kafka guarantee ordering? When does it not?
- Design a presence system for a chat app with 10M concurrent users — connection tracking, heartbeat, fan-out
- Saga choreography vs orchestration — what are the debugging trade-offs?
- What's the difference between p50 and p99 latency? Why does p99 matter more for user experience?
Before Week 4 — How To Read Any Full System Design
- Step 1: identify the user action. "Send message", "watch video", "request ride", "open feed".
- Step 2: draw the write path first. This is usually where durability, ordering, and fan-out decisions live.
- Step 3: draw the read path separately. Reads often use different stores, caches, or denormalized views.
- Step 4: find the state owner. For each entity, say which component is the source of truth.
- Step 5: ask what is synchronous vs asynchronous. This reveals latency and coupling trade-offs.
- Step 6: find the hottest key, shard, queue, or celebrity-case bottleneck.
- Step 7: end with failure handling: retries, dedup, replay, reconciliation, observability.
Key Design Decisions
- Cassandra: partition by chat_id, cluster by timestamp DESC — efficient time-ordered reads per chat
- Message ID: Snowflake (timestamp + server + seq) — globally unique, sortable, no coordination
- E2E encryption: Signal Protocol — keys exchanged out-of-band, server stores only ciphertext
- WebSocket server → stateful, use consistent hashing to route user's connections to same server
- Message state machine: SENT → DELIVERED (ACK from recipient device) → READ (read receipt)
- Offline delivery: store in Cassandra + push notification. On reconnect, fetch unread since last_seen_timestamp.
Group Messages — Fan-Out Strategy
- Small groups (<100 members): fan-out on write — write message to each member's inbox
- Large groups: fan-out on read — members pull from shared group inbox on open
- Bounded fan-out: limit group size (WhatsApp: 1024) to cap write amplification
- Group membership: cached in Redis, backed by PostgreSQL
- Delivery receipt for groups: message delivered when ALL members receive (complex)
Scale Numbers to Reason With
- 500 hours of video uploaded per minute
- 1 hour of video → ~5GB raw → ~1.5GB after compression across qualities
- Transcoding: parallelize per video segment (5s chunks) — 1 hour video → 720 chunks → 720 parallel jobs
- CDN: 95%+ of views served from edge — origin only serves cache misses
- View count: approximate with Redis INCR per shard, aggregate periodically. Exact count: HyperLogLog for unique viewers.
Recommendation System (Offline ML)
- Watch history + clicks → feature store → ML training pipeline (Spark)
- Pre-compute recommendations per user nightly → store in Redis/DynamoDB
- Real-time signals: last 10 watch actions → re-score with lightweight model
- Cold start: new users get popular videos by category until enough signal
- Candidate generation → ranking → filtering (already watched, policy) → serve
Location Service
- Driver sends location every 4 seconds via WebSocket
- Geohash: encode lat/lng as string prefix (shorter string = larger cell)
- Redis: GEOADD key lng lat driver_id. GEOSEARCH for radius queries.
- Or: geohash prefix → Redis SET of driver IDs in that cell. Query neighboring cells.
- Location history: write to Cassandra for trip replay and billing
Matching Service
- Rider requests → query nearby available drivers (geosearch)
- Score drivers: distance + rating + vehicle type + surge zone
- Dispatch to top N drivers simultaneously (first to accept wins)
- Distributed lock on driver during dispatch window (Redis SET NX PX 30000)
- ETA: precomputed road graph (OSRM or Google Maps Platform)
- Surge pricing: supply/demand ratio per geohash cell, multiplier applied at pricing service
URL Shortener
- Short code generation: base62(auto-increment ID) — simple, unique, sortable
- Alternative: MD5(url)[0:7] — collision check required, no guarantee of uniqueness
- Redirect: 301 (permanent — browser caches, saves future requests) vs 302 (temporary — allows analytics tracking)
- Storage: DynamoDB (short_code → long_url + metadata). 100M URLs/day → Cassandra for write throughput.
- Cache hot redirects in Redis: top 10% of links → 90%+ of traffic
- Analytics: log redirect events to Kafka → Spark/Flink pipeline → ClickHouse for querying
- Custom domains: DNS CNAME to CDN → CDN routes to redirect service
Redis Cluster Architecture
- Hash slots: 16384 slots distributed across nodes. CRC16(key) % 16384 → slot → node.
- Each master has N replicas. Failure: replica auto-promotes (Sentinel or Cluster mode).
- Hot key problem: single key too popular → one slot hot. Solution: replicate hot key across shards with suffix (user:42:0, user:42:1).
- CLUSTER KEYSLOT cmd: check which slot a key maps to
- Cross-slot transactions: not supported — redesign if needed (use Lua scripts within same slot)
Fan-Out on Write (Push)
- On tweet: push tweet_id to all followers' feed caches (Redis ZSET, score=timestamp)
- O(followers) work at write time. Read is O(1) — just fetch pre-built feed.
- Celebrity problem: user with 50M followers = 50M Redis writes per tweet
- Good for: regular users (<1M followers) — fast feed reads
Fan-Out on Read (Pull)
- On feed load: fetch recent tweets from all followed users, merge + sort
- Heavy reads but no write amplification for celebrities
- N followed users → N DB queries → merge sort → paginate. Slow for users following many accounts.
- Good for: celebrities — read cost is shared across followers, not pushed to write
Hybrid Approach — The Actual Twitter Answer
Classify users by follower count. Regular users (<1M followers): fan-out on write to Redis ZSET (score=timestamp). Celebrity users (>1M): on read, merge their recent tweets with the user's pre-built feed from Redis. This is what Twitter actually did. When you mention this unprompted, it signals you understand real production trade-offs, not just textbook patterns. Also discuss: feed truncation (keep only last 800 tweets in cache), backfill on first load, and how promoted tweets are injected.
LLD: Parking Lot System
- Entities: ParkingLot, Level, Slot, Ticket, Vehicle (abstract), Car/Bike/Truck (concrete)
- Strategy pattern: PricingStrategy interface — HourlyPricing, FlatRatePricing. Swap without changing Ticket.
- Factory pattern: VehicleFactory.create(type) — returns correct Vehicle subclass
- Observer pattern: SlotAvailabilityBoard subscribes to slot state changes
- Concurrency: booking a slot → check availability + assign must be atomic. Use optimistic locking: slot has version field, CAS update fails if version changed.
- State machine: AVAILABLE → RESERVED → OCCUPIED → AVAILABLE
LLD: Notification System
- Entities: Notification, NotificationChannel (abstract), EmailChannel, SMSChannel, PushChannel
- Strategy pattern: each channel implements send(notification). Plug in new channels without changing core.
- Template method: BaseNotification.send() calls format() + deliver() — subclasses override format()
- Queue-based delivery: write to Kafka topic per channel → consumers retry with exponential backoff
- Idempotency: idempotency_key per notification → dedup on delivery side
- Rate limiting per user per channel: prevent spam — max 10 emails/day per user
LLD Interview Signal
The gap between senior and mid-level in LLD is concurrency awareness. Don't just draw classes — say "this slot booking has a TOCTOU (time-of-check-time-of-use) race condition between the availability check and the reservation write. I'd handle this with optimistic locking — add a version field to the Slot entity. The update only succeeds if version matches what we read. On conflict, retry with fresh read." Proactively naming race conditions and their solutions is the signal interviewers are looking for.
🏆 Day 29–30: Full Mock System Design Sessions
- Day 29: 45-min mock — Design a real-time gaming leaderboard for 10M players. Cover: storage (Redis ZSET), real-time updates, global ranking, weekly/monthly resets, top-K queries. Time yourself strictly.
- Day 30: 45-min mock — Design a distributed job scheduler (cron at scale). Cover: exactly-once execution guarantee, failure recovery, priority queues, worker assignment, monitoring.
- Standard: HLD complete in first 10 min, deep dive on hardest component, trade-offs raised unprompted before interviewer asks
- Signal: You should be driving the conversation — proposing alternatives, naming limitations, asking clarifying questions proactively. Not answering reactively.
This week is bonus / no time pressure — every concept here builds on your existing system design foundation. AI agent design is now appearing at companies building AI products. All prior weeks remain fully intact.
ReAct Pattern (Reason + Act)
- LLM alternates: Thought → Action (tool call) → Observation → repeat until done
- Handles ambiguity well — can discover information dynamically
- Risk: error compounds over long chains. Add max_steps hard limit.
- Best for: research tasks, open-ended exploration, web navigation
- Frameworks: LangGraph, LlamaIndex Agents
Plan-and-Execute Pattern
- Planner LLM: generates full step-by-step task list upfront
- Executor: smaller/cheaper model runs each step one at a time
- Dramatically reduces cost — expensive Planner called once
- More auditable — plan is a visible, loggable artifact
- Best for: financial workflows, structured data pipelines, known task shapes
- Add re-planning step if a step fails unexpectedly
ReAct vs Plan-Execute Trade-off
ReAct: flexible, handles unknown task shapes, discovers information dynamically. But error compounds — step 3 failure based on wrong step 1 reasoning can cascade. Plan-Execute: auditable, cost-efficient (Executor = small/cheap model), predictable — but brittle if plan is wrong at step 1. Production guideline (Anthropic): start with the simplest approach. Most production agent failures come from over-engineering, not under-engineering. Add complexity only when needed.
Memory Architecture
- In-context (working): current conversation + tool results in prompt window — fast but bounded (~128K tokens)
- External long-term: vector DB (pgvector, Pinecone) — embed past interactions, retrieve relevant on each turn
- Episodic: summarize older conversation chunks → write summary back to context. Google ADK compaction approach.
- Semantic cache: embed user query → cosine match recent Q&A pairs → return cached if similarity > threshold. Saves 30–60% LLM costs.
Tool Design for Agents
- Every tool = JSON schema: name, description, parameters, return type
- Description quality matters enormously — LLM uses it to decide when and how to call
- Tools must be idempotent where possible — agents retry on failure
- MCP (Model Context Protocol): Anthropic's standard for plug-and-play tool integrations
- Always validate + sanitize tool outputs before feeding back to LLM — prevent prompt injection from tool results
Context Engineering — The Real Production Bottleneck
Long-running agents exhaust their context window. Naive solution: bigger context window. Real solution: treat context as a compiled artifact. Separate durable state (DB) from per-call working context (prompt). Apply compaction: summarize old events, keep recent N turns verbatim, inject only relevant memory chunks retrieved from vector store. This is the difference between an agent that works for 3 turns and one that works for 300 turns in production.
Single vs Multi-Agent — When to Escalate
Start single agent + many tools. Add multi-agent only when: (1) tasks are genuinely parallel and independent, (2) specialist isolation prevents error contamination between domains, (3) different agents need different tool permissions (security boundary). Don't add multi-agent for complexity's sake. Production lesson from Anthropic: "The most successful implementations consistently used simple, composable patterns rather than complex frameworks." Debugging multi-agent race conditions and message passing failures is brutal.
AI Gateway Key Decisions
- Rate limit in tokens (TPM), not requests: one request can cost 100 or 100K tokens
- Model routing: classify query complexity → route simple to small/cheap model, complex to GPT-4o. 60-80% cost reduction.
- Semantic cache: embed prompt → cosine search → return cached if similarity > 0.95. Massive savings on repeated queries.
- Fallback chain: primary model timeout → retry secondary → hardcoded fallback response
- Streaming: SSE for token-by-token output — reduces perceived latency dramatically
RAG Production Trade-offs
- Chunking: fixed-size (fast) vs semantic (better recall). Parent-child: retrieve small chunk, return larger parent for context.
- Hybrid search: ANN (semantic) + BM25 (keyword) combined with RRF scoring. Best precision.
- Reranker: cross-encoder re-scores top-20 → return top-3. Expensive but dramatically improves answer quality.
- Context budget: dynamically allocate tokens: system prompt + examples + retrieved chunks + answer reserve. Hard cap on chunk injection.
- Eval: RAGAS metrics: faithfulness, answer relevance, context recall. Run on golden dataset per deploy.
The Three Hard Problems
1. Context assembly: what code do you put in the prompt and how do you rank relevance? AST-based chunking preserves function/class boundaries. Retrieve by semantic similarity + recency + call-graph proximity. Budget: 8K tokens for context, 4K for completion.
2. Latency: inline completions need <200ms p50. Use smaller/cached model for inline autocomplete; larger for chat and complex refactoring. Pre-warm model for active files.
3. Safety: LLM-generated code runs on your machine. Sandbox every tool call (file write, shell exec) in a container. Never auto-apply diffs — always require user approval. Cost cap per session: abort if token spend exceeds budget.
Observability for Agents
- Trace every tool call: span ID, tool name, input, output, latency, token cost
- Agent run trace: full Thought → Action → Observation tree per run_id — stored in persistent log
- Token cost per step: attribute cost to each LLM call — find expensive steps to optimize
- Eval pipeline: RAGAS / custom golden dataset — run on every deploy, catch regressions
- Tools: LangSmith, Langfuse, Helicone, Arize Phoenix — purpose-built for LLM tracing
Safety Layers
- Input validation: prompt injection detection, PII scrubbing before LLM
- Output validation: structured output schemas (JSON Schema/Pydantic) — reject malformed
- Tool permission scoping: each agent gets only tools it needs — least privilege
- Sandboxed execution: code-running agents in network-isolated containers with CPU/mem limits
- Human-in-the-loop gates: for irreversible actions (send email, deploy, delete data) — explicit approval required
- Max iterations + cost cap: hard stop prevents runaway loops
Real Production Failure Modes
(1) Compounding hallucination: wrong tool call → wrong observation → wrong next step → cascades. Fix: validate tool output schema, add reflection on unexpected results. (2) Context poisoning / prompt injection: malicious content in RAG retrieval injects instructions. Fix: separate retrieved content from instruction sections structurally. (3) Cost explosion: agent loops without progress. Fix: max_steps + cost budget hard cap + loop detection (hash of recent states). (4) No audit trail: impossible to debug what the agent did. Fix: persistent trace log from day one — non-negotiable.
🤖 Week 5 Checkpoint — AI Agent Design
- Mock 1: "Design an AI research agent that browses the web and summarizes findings." Draw the ReAct loop, tool registry, memory strategy. What breaks at 100 concurrent users?
- Mock 2: "Design a RAG system for a 10M document legal corpus." Cover chunking strategy, hybrid search, reranking, context budget, latency SLA, eval pipeline.
- Mock 3: "Design GitHub Copilot." Cover AST indexing, context assembly, latency budget, sandboxed execution, cost control. Time yourself: 45 mins, drive the hard problems proactively.
- The signal: In every agent design, proactively raise observability, cost caps, safety guardrails, and what happens when the agent takes a wrong turn on iteration 3 of 10. Raising these unprompted is the interview signal.