System Design in 30 Days

Pattern	Use When	Example
Strategy	Multiple interchangeable algorithms	PricingStrategy, SortStrategy
Observer	State change must notify multiple subscribers	OrderStatus → notify email/SMS/push
Factory	Create objects without specifying concrete class	NotificationFactory.create(type)
Singleton	Exactly one instance needed globally	DB connection pool, config
Decorator	Add behaviour without subclassing	LoggingCoffee wraps SimpleCoffee
Repository	Abstract data access layer	UserRepository.findById()

Operation	Latency
L1 cache hit	~0.5 ns
L2 cache hit	~5 ns
RAM read	~100 ns
SSD read	~16 μs
Redis GET	~0.5 ms
DB query (indexed)	~5–20 ms
CDN cache hit	~5 ms
Network (same DC)	~500 μs
Network (cross-region)	~80–150 ms
1B req/day	~11,500 req/sec

Criterion	REST	GraphQL	gRPC	WebSocket
Protocol	HTTP/1.1–2	HTTP/1.1–2	HTTP/2	TCP (WS upgrade)
Format	JSON/XML	JSON	Protobuf (binary)	Any (JSON/binary)
Performance	Good	Good	Excellent (5-10× REST)	Excellent (persistent)
Caching	Native HTTP cache	Complex (POST)	No HTTP cache	No
Type safety	OpenAPI/Swagger	Schema enforced	Protobuf — compile-time	None by default
Streaming	SSE only	Subscriptions	Native bidirectional	Native full-duplex
Browser support	Universal	Universal	Needs grpc-web proxy	Universal
Best for	Public APIs, CRUD	Mobile, multi-client apps	Internal microservices	Real-time bidirectional

WEEK 01 Foundations — Scaling, Caching, API Design, CAP FIRST PRINCIPLES

DAY 01 Scaling Fundamentals — Vertical, Horizontal, Scale Cube Concept ▼

Core Concepts

Vertical scaling: add CPU/RAM to one machine. Simple, no code change. Hard ceiling ~96 cores. SPOF.
Horizontal scaling: add more machines. Requires stateless design, service discovery, load balancing.
Stateless services: no local session — store state in Redis/DB so any instance can serve any request.
Scale Cube (AKF): X=clone (horizontal), Y=functional decompose (microservices), Z=data partition (sharding)
Bottleneck first: before scaling, profile — CPU, memory, I/O, or network? Scale the actual bottleneck.

When to Scale What

CPU bound → vertical first, then horizontal with load balancer
Memory bound → vertical or offload to Redis/external cache
I/O bound → read replicas, caching, async processing
Network bound → CDN, edge caching, compression
DB bound → connection pooling, read replicas, sharding
Stateful service → externalize state first, then scale horizontally

Key Trade-offs

Vertical: operationally simple, no distributed system complexity, but hard ceiling and SPOF. Horizontal: theoretically unlimited, fault tolerant, but requires distributed consensus, session externalization, consistent hashing for data locality. Most production systems do both: vertical for the database, horizontal for stateless API tiers.

Interview Approach

When asked "how do you scale X?": first identify the bottleneck. Don't say "add more servers" — say "I'd first profile to identify whether we're CPU, memory, or I/O bound. For the stateless API tier I'd scale horizontally behind an ALB. For the database I'd start with read replicas and connection pooling, escalating to sharding only when reads on replicas saturate."

DAY 02 Load Balancing — Algorithms, Layers, Health Checks Infra ▼

L4 vs L7 Load Balancing

L4 (Transport): routes by IP + TCP port, no packet inspection — fastest, no TLS termination
L7 (Application): routes by HTTP headers, path, cookie — enables path-based routing, A/B testing, auth
AWS NLB = L4. AWS ALB = L7. HAProxy does both. Nginx = L7 (can do L4 with stream module).
TLS termination at LB: offloads crypto from app servers, enables header inspection

Algorithms

Round Robin: simple, equal distribution. Bad if requests vary in cost.
Least Connections: send to server with fewest active connections. Best for variable-cost requests.
IP Hash: same client → same server (sticky). Breaks with proxies/NAT.
Consistent Hashing: minimize reshuffling when servers added/removed. Best for cache locality.
Weighted: canary deployments — send 5% to new version.

Health Checks + Connection Draining

Active health checks: LB polls /health endpoint every 5s. Passive: detect failures from response codes. Connection draining: when removing a server, LB stops sending new connections but lets in-flight requests finish (30–60s grace period). Critical for zero-downtime deployments.

Common Interview Gotcha

Sticky sessions (session affinity) break horizontal scaling — if a server dies, all its sessions are lost. The correct answer: externalize sessions to Redis, make the service stateless, and use round-robin or least-connections. Only use sticky sessions when you can't externalize state (legacy systems).

DAY 03 Caching — Strategies, Invalidation, Redis Internals Caching ▼

Caching Strategies

Cache-aside (Lazy): app checks cache first → miss → read DB → populate cache. Most common.
Read-through: cache fetches from DB on miss automatically. App only talks to cache.
Write-through: write to cache AND DB synchronously. Consistent, but doubles write latency.
Write-behind (Write-back): write to cache only, flush to DB async. Fast writes, risk of data loss on crash.
Refresh-ahead: proactively refresh cache before TTL expires on popular keys.

Redis Data Structures

String: counters, session tokens, simple cache values
Hash: user profile fields — partial updates without fetching all
List: message queues, activity feeds (LPUSH/RPOP)
Set: unique visitors, tags, friend lists (union/intersection)
Sorted Set (ZSET): leaderboards, rate limiting, priority queues
HyperLogLog: approximate unique count — 12KB for billions of items
Streams: durable event log — lighter-weight Kafka alternative

Cache Invalidation — "Two Hard Problems"

TTL-based: simple, eventual consistency acceptable. Event-driven: write → Kafka event → cache invalidation consumers. More consistent but complex. Versioned keys: user:42:v3 — change version on update, old key orphans and expires. Cache stampede (thundering herd): many requests miss simultaneously and hammer DB. Solutions: mutex lock (only one request rebuilds), probabilistic early expiry (PER algorithm), request coalescing.

Multilevel Caching

L1 = in-process memory (microseconds, limited size, per-instance). L2 = Redis cluster (milliseconds, shared, bounded). L3 = DB read replica. Mention hot keys: a single extremely popular key (celebrity post) can overwhelm one Redis shard — solutions: local replica per app server, key sharding by appending suffix.

DAY 04 CAP Theorem + Consistency Models + PACELC Theory ▼

CAP Theorem

Consistency: every read sees the most recent write
Availability: every request gets a (non-error) response
Partition Tolerance: system continues despite network splits
Reality: P always occurs in distributed systems → choose C or A during partition
CP systems: HBase, ZooKeeper, etcd, MongoDB (strong mode)
AP systems: Cassandra, DynamoDB, CouchDB, Redis (default)

Consistency Models (Spectrum)

Linearizability: single-node illusion — reads always see latest write
Sequential: all ops appear in some consistent order
Causal: causally related operations appear in order
Read-your-writes: you always see your own writes (session)
Monotonic reads: once you read a value, you won't see older values
Eventual: all replicas converge if no new writes — lowest consistency

PACELC — More Practical Than CAP

PACELC extends CAP: "If Partition (P), choose between Availability (A) and Consistency (C). Else (E) — even without partition — choose between Latency (L) and Consistency (C)." Most distributed systems trade off latency vs consistency ALL the time, not just during partitions. DynamoDB: PA/EL (available, low latency). Spanner: PC/EC (consistent, higher latency). Understanding PACELC shows deeper distributed systems thinking.

Interview Signal

Never just say "I'd use CAP theorem." Say: "For a payment system, I need CP — a failed transaction is better than a duplicate credit. I'd accept reduced availability during network partitions and use circuit breakers to fail fast. For a social media feed, AP is fine — showing a slightly stale post is acceptable, and I'd use eventual consistency with short TTLs."

DAY 05 Rate Limiting — Algorithms, Distributed Implementation Resilience ▼

Algorithms Compared

Fixed Window: count resets at interval boundary. Simple. Burst attack possible at boundary edges.
Sliding Window Log: log timestamps of each request in Redis sorted set. Precise. Memory = O(requests in window).
Sliding Window Counter: interpolate between current and previous window counts. Memory efficient, ~0.003% error rate.
Token Bucket: bucket refills at rate R, max burst = bucket size. Allows controlled bursting. AWS API Gateway uses this.
Leaky Bucket: requests queued, processed at constant rate. Smooths traffic. Queue fills under burst — requests dropped.

Distributed Rate Limiting

Single Redis node: atomic Lua script — INCR + EXPIRE in one round-trip
Redis Cluster: consistent hash user_id to shard — same user always hits same node
Sliding window with ZSET: ZADD + ZREMRANGEBYSCORE + ZCARD atomically
Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, Retry-After
Return HTTP 429 Too Many Requests with Retry-After header
Different limits by tier: free (100/min), pro (1000/min), enterprise (unlimited)

Distributed Rate Limiter Architecture: Client → [API Gateway] → [Rate Limiter Middleware] ↓ [Redis Cluster — keyed by user_id:endpoint:window] MULTI INCR rate:user123:POST/api/v1/posts:1706745600 EXPIRE rate:user123:POST/api/v1/posts:1706745600 60 EXEC if count > limit → return 429 + Retry-After: 47 else → forward to service

DAY 06 CDN, DNS, Proxies — Full Request Flow Networking ▼

CDN Internals

Edge PoPs: globally distributed cache nodes — serve content from nearest geography
Cache-Control headers: public, max-age=86400 tells CDN to cache for 1 day
Pull CDN: first request goes to origin, CDN caches response. Most common.
Push CDN: upload assets proactively to CDN edge nodes. Better for static files that don't change.
Origin Shield: intermediate layer between CDN edges and origin — collapses cache misses from multiple PoPs into one origin request. Reduces origin load ~90%.
Cache invalidation: URL versioning (app.v3.js) or API purge (CDN purge API on deploy).

DNS + Proxies

DNS resolution chain: Browser cache → OS cache → Recursive resolver → Root NS → TLD NS → Authoritative NS
TTL strategy: low TTL (60s) before blue-green switch, high TTL (3600s) normal operation
GeoDNS: return different IP based on client region — Route53 latency-based routing
Forward proxy: client-side (corp proxy, VPN) — client knows about it
Reverse proxy: server-side (Nginx, Cloudflare) — client doesn't know
API Gateway: reverse proxy + auth + rate limiting + routing + observability

Full Request Flow — Memorize This

Client → DNS lookup (~30ms first time, cached after) → CDN edge (cache HIT → ~5ms, done). If MISS → Load Balancer (~1ms) → API Gateway (auth, rate limit, ~5ms) → Service → Cache check (Redis HIT ~1ms, done). If MISS → Database query (~5–20ms) → populate cache → respond. Total p50: 50–100ms. p99 bottleneck is almost always the DB or a missing cache layer.

DAY 07 API Design — REST, GraphQL, gRPC, WebSocket, Best Practices API Design ▼

REST — Principles & Best Practices

Resource-based URLs: nouns not verbs — /users/42/orders not /getUserOrders
HTTP verbs: GET (read), POST (create), PUT (replace), PATCH (partial update), DELETE (remove)
Status codes: 200 OK, 201 Created, 204 No Content, 400 Bad Request, 401 Unauth, 403 Forbidden, 404 Not Found, 409 Conflict, 422 Unprocessable, 429 Too Many, 500 Server Error
Idempotency: GET/PUT/DELETE are idempotent. POST is not. Use idempotency keys for POST payments.
HATEOAS: responses include links to related actions — reduces client coupling

GraphQL vs REST vs gRPC

REST: simple, cacheable, universal client support. Overfetch/underfetch problem.
GraphQL: client specifies exact fields needed — eliminates over/under-fetch. Single endpoint. Complex queries, no HTTP caching by default. Best for: frontend-heavy apps, mobile (save bandwidth), multiple clients with different data needs.
gRPC: binary protocol (Protobuf), strongly typed, bidirectional streaming, HTTP/2. 5-10x faster than REST/JSON. Best for: internal microservice-to-microservice, streaming data, polyglot services.
WebSocket: persistent full-duplex. Use when server must push to client in real time.

REST API Design — URL Structure GET /api/v1/users → list users (paginated) POST /api/v1/users → create user GET /api/v1/users/{id} → get user by ID PATCH /api/v1/users/{id} → partial update DELETE /api/v1/users/{id} → delete user GET /api/v1/users/{id}/orders → get user's orders (nested resource) POST /api/v1/orders → create order GET /api/v1/orders?status=pending&limit=20&cursor=abc123 → filtered + paginated Pagination strategies: Offset: ?page=2&limit=20 → simple but inconsistent on inserts/deletes Cursor: ?cursor=eyJpZCI6NDJ9 → stable, works with real-time data changes Keyset: ?after_id=42&limit=20 → fast for indexed columns, no skip

API Versioning Strategies

URL versioning: /api/v1/users — most visible, easy to route. Industry default.
Header versioning: Accept: application/vnd.api+json;version=2 — cleaner URLs, harder to test in browser
Query param: /api/users?version=2 — simple but pollutes query space
Never break existing clients — add fields, never remove/rename
Deprecation: return Sunset header with deprecation date
Maintain N-1 versions minimum. Remove after migration period.

Idempotency & Safety

Idempotency key: client sends unique key with POST request (UUID)
Server caches response keyed by idempotency key for 24h
On retry: return cached response — no duplicate charge/order
Stripe, PayPal, Braintree all use this pattern for payments
Key header: Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Store key → response in Redis with TTL. Check before processing.

API Authentication & Authorization

API Key: simple, server-side secret. No expiry mechanism. Best for server-to-server.
JWT (Bearer token): stateless, self-contained claims. Verify signature without DB call. Short expiry (15min) + refresh token (7 days).
OAuth 2.0: delegated authorization. User grants app access to their data on another service. Authorization Code flow for web, PKCE for mobile/SPA.
mTLS: mutual TLS — both client and server present certificates. For internal service-to-service in zero-trust networks.
HMAC signatures: sign request body with shared secret — webhooks, S3 presigned URLs.

API Error Design

Return structured errors — not just HTTP status code
Include: error code (machine-readable), message (human), details array, request_id for tracing
Never expose stack traces or internal DB errors to clients
Use RFC 7807 Problem Details standard for REST APIs
Consistent envelope: {"error":{"code":"RATE_LIMITED","message":"...","retry_after":47}}
Distinguish client errors (4xx) from server errors (5xx) — client errors are caller's fault

gRPC Service Definition (Protobuf): syntax = "proto3"; service UserService { rpc GetUser(GetUserRequest) returns (User); // unary rpc ListUsers(ListUsersRequest) returns (stream User); // server streaming rpc CreateUsers(stream CreateUserRequest) returns (BatchResult); // client streaming rpc Chat(stream ChatMessage) returns (stream ChatMessage); // bidirectional } message User { string id = 1; string name = 2; string email = 3; google.protobuf.Timestamp created_at = 4; } // gRPC: HTTP/2 + Protobuf = 5-10x faster than REST/JSON // Auto-generates client/server code in Go, Java, Python, Rust, etc. // Use for: internal microservices. REST for: public APIs.

API Design Decision Framework

Public API or 3rd party developers? → REST. Universally understood, easy to test with curl, HTTP caching works naturally.
Mobile app with bandwidth concerns? → GraphQL. Client specifies exactly what fields it needs — no over-fetching.
Internal microservices, high throughput? → gRPC. Binary protocol, streaming support, strongly typed contracts catch breaking changes at compile time.
Real-time bidirectional? → WebSocket (full-duplex persistent) or SSE (server-push only, simpler).
Multiple frontend clients (web, mobile, IoT) with different data needs? → GraphQL with a BFF (Backend for Frontend) layer.

API Design Interview Tips

Always define the API before drawing the architecture. Walk through: (1) What are the key resources? (2) What operations does each resource support? (3) What are the access patterns — who calls what, how often? (4) What are the consistency requirements for each endpoint? Showing API-first thinking is a strong signal — it means you think about contracts and client usability, not just internal implementation.

🎯 Week 1 Checkpoint

Draw the full request flow (Client → CDN → LB → Gateway → Service → Cache → DB) from memory with latency at each hop
Design a REST API for a ride-booking app — define 5 endpoints with correct verbs, status codes, and pagination
Explain when you'd choose gRPC over REST and vice versa
Design a distributed rate limiter for 50K RPS — which algorithm and why?
Given a payment system, which CAP trade-off do you make and why?

WEEK 02 Storage Deep Dives — SQL, NoSQL, Sharding, Replication DATA LAYER

DAY 08 SQL vs NoSQL — Decision Framework Storage ▼

SQL Strengths

ACID transactions — critical for financial systems, inventory, reservations
Complex joins and aggregations across normalized data
Schema enforcement = data integrity at DB level
Mature tooling: EXPLAIN plans, index advisors, migrations
Strong consistency by default
Best for: payments, e-commerce orders, user accounts, ERP, anything relational

NoSQL Categories

Key-Value (Redis, DynamoDB): O(1) get/set, simple access patterns, no joins. Best for: sessions, cache, counters.
Document (MongoDB, Firestore): flexible schema, nested JSON. Best for: catalogs, CMS, user profiles.
Wide-Column (Cassandra, HBase): write-optimized LSM tree, massive scale, tunable consistency. Best for: time-series, IoT, audit logs.
Graph (Neo4j, Amazon Neptune): relationship traversal. Best for: social networks, fraud detection, recommendations.
Search (Elasticsearch, OpenSearch): inverted index, full-text + faceted search.

Decision Framework — Never "I'd use MongoDB because it's flexible"

Ask: (1) Do I need multi-entity ACID transactions? → PostgreSQL. (2) Do I need complex multi-table joins? → SQL. (3) Is write throughput > 50K/s with simple access patterns? → Cassandra/DynamoDB. (4) Flexible schema, document-centric reads, no complex joins? → MongoDB. (5) Cache or session? → Redis. (6) Full-text search? → Elasticsearch alongside primary DB. (7) Graph traversal? → Neo4j. The key signal: which queries will this DB serve, and can it serve them efficiently?

DAY 09 Database Internals — B-Tree, LSM Tree, MVCC, WAL Internals ▼

B-Tree (PostgreSQL, MySQL)

Self-balancing tree, O(log n) reads and writes
Pages: typically 8KB, tuned for disk block size
Good for mixed read/write workloads
Random writes cause write amplification (update internal nodes)
In-place updates — good for read-heavy workloads

LSM Tree (Cassandra, RocksDB, LevelDB)

Write to in-memory MemTable first — O(1) sequential write
MemTable flushed to immutable SSTable on disk when full
Compaction merges SSTables — removes tombstones, sorts keys
Bloom filter per SSTable: skip SSTables that don't contain key
O(1) writes, O(log n) reads. Better for write-heavy workloads.
Write amplification during compaction. Read amplification without bloom filter.

MVCC + WAL — How PostgreSQL Works

MVCC (Multi-Version Concurrency Control): each write creates a new row version with a transaction timestamp. Readers see a snapshot at their transaction start time — no blocking. Old versions cleaned up by VACUUM. Enables non-blocking reads alongside concurrent writes.

WAL (Write-Ahead Log): every change is written to the WAL log before it's applied to data pages. On crash: replay WAL to recover. Also enables streaming replication — ship WAL segments to replicas. PostgreSQL logical replication: replays WAL as SQL changes, enabling cross-version replication and CDC (Debezium).

DAY 10 Sharding — Strategies, Hotspot Prevention, Consistent Hashing Scalability ▼

Sharding Strategies

Hash sharding: hash(key) % N → shard. Uniform distribution, no range queries across shards.
Range sharding: shard by key range (A-M, N-Z). Range queries efficient. Hotspot risk if data is skewed.
Directory sharding: lookup table maps key → shard. Most flexible, single point of failure.
Geo sharding: shard by region/country. Data locality, GDPR compliance.
Consistent hashing: hash ring — adding/removing nodes minimizes key movement. Used by DynamoDB, Cassandra.

Consistent Hashing Deep Dive

Place nodes on hash ring (0 to 2³²)
Key → hash → find nearest node clockwise on ring
Add node: only neighbors' keys move (≈ k/n keys)
Remove node: only that node's keys move to next node
Virtual nodes (vnodes): each physical node = 100-200 points on ring → better, even distribution even with heterogeneous hardware
Used by: DynamoDB, Cassandra, Redis Cluster, CDN cache routing

Shard Key Selection — The Most Critical Decision

Bad shard key = hotspots = one shard does all work while others are idle. For social network: user_id hash works for regular users but celebrities cause hot shards. Solution: celebrity detection + separate handling (fan-out on read for celebrities). For time-series: time-based shard key causes write hotspot on current time shard. Solution: composite key (time + device_id). Rule: choose a key with high cardinality AND even distribution across your access patterns. Avoid sequential keys (auto-increment IDs) — they create monotonic hotspots.

DAY 11 Replication — Sync vs Async, Leader Election, Quorums Availability ▼

Replication Models

Single-leader: all writes to primary, replicated to read replicas. Simple. Primary is write bottleneck.
Multi-leader: writes accepted at multiple nodes. Conflict resolution required. Useful for multi-datacenter active-active.
Leaderless (Dynamo-style): any node accepts writes. Quorum-based consistency.
Sync replication: write confirmed only after replica ACKs. Durable but adds latency.
Async replication: write confirmed immediately, replica catches up. Low latency but possible data loss on primary crash.
Semi-sync: at least one replica must ACK. Balance between durability and latency. MySQL default.

Quorum Reads/Writes

n = total replicas, w = write quorum, r = read quorum
Strong consistency: w + r > n (overlap guarantees you read the latest write)
Typical: n=3, w=2, r=2 — tolerates 1 failure, strongly consistent
Write-heavy: w=1, r=3 (fast writes, consistent reads)
Read-heavy: w=3, r=1 (durable writes, fast reads)
Availability-first: w=1, r=1 (AP system — eventual consistency)
Cassandra: tunable per query — LOCAL_QUORUM for multi-DC consistency

Leader Election — How etcd/ZooKeeper Handle It

Raft consensus: nodes time out waiting for heartbeat → become candidate → request votes → win majority → become leader. Election timeout is randomized (150–300ms) to prevent split votes. Fencing tokens prevent split-brain: new leader issues higher token; storage layer rejects writes with lower token. PostgreSQL failover (Patroni): uses etcd/Consul for leader election, switches DNS/VIP to new primary. Zero-downtime failover in ~30s.

DAY 12–13 Indexing Deep Dive + Query Optimization Performance ▼

Index Types

B-Tree (default): equality + range queries. Works with LIKE 'prefix%'.
Hash: equality only — slightly faster for exact match, no range support
Composite: (col_a, col_b, col_c). Leftmost-prefix rule: must include leading columns in query to use index.
Covering index: query fully served from index without table fetch (index includes all selected columns)
Partial index: index on subset of rows. WHERE active = true — smaller, faster for frequent filtered queries.
Full-text: inverted index with TF-IDF ranking. Use Elasticsearch for serious full-text search.
GIN / GiST: PostgreSQL indexes for arrays, JSONB, full-text, geometric types.

Query Optimization Signals

EXPLAIN ANALYZE: look for Seq Scan on large tables → missing index
Index selectivity: high cardinality columns make better indexes (email > gender)
N+1 query problem: 1 query for list + N queries for each item. Fix with JOIN or batch load.
Connection pooling: PgBouncer in transaction mode — 10K app connections → 100 real DB connections
Read replicas for analytics/reporting — separate from OLTP primary
Avoid SELECT * — select only needed columns (reduces I/O, enables covering indexes)
Pagination: use keyset pagination (WHERE id > last_id) not OFFSET for large tables

🎯 Week 2 Checkpoint

When would you choose LSM tree over B-Tree? Articulate write amplification trade-off.
Design a sharding strategy for a payment system with high-value accounts — what's your shard key and why?
Explain MVCC and why it enables non-blocking reads without SELECT FOR SHARE locks
w=2, r=2, n=3 — is this strongly consistent? What about w=1, r=1, n=3?
What's the N+1 problem? Show the fix with a query example.

WEEK 03 Messaging, Event-Driven Architecture, Real-Time Systems EVENT-DRIVEN

DAY 15 Kafka Internals — Partitions, Offsets, Consumer Groups, EOS Messaging ▼

Kafka Architecture

Topic → Partitions (ordered, immutable append-only log)
Offset: per-partition monotonic counter. Consumer tracks its position.
Consumer Group: each partition consumed by exactly one consumer in the group
Replication factor: ISR (In-Sync Replicas) — leader + N replicas
Leader election via KRaft (Kafka Raft — replaces ZooKeeper since Kafka 3.3)
Retention: time-based (7 days default) or size-based
Log compaction: keep only latest value per key — good for state stores

Performance Tuning Knobs

Partition count: parallelism ceiling for consumers — can't have more consumers than partitions
acks=0: fire and forget (fastest, data loss). acks=1: leader ack. acks=all: all ISR ack (slowest, most durable).
linger.ms + batch.size: batch small messages — increases throughput, adds latency
compression.type: lz4 (speed) or zstd (ratio). Apply at producer level.
max.poll.interval.ms: if consumer doesn't poll in time → rebalance triggered
Partition key: determines which partition a message lands on — use user_id for ordering per user

Exactly-Once Semantics (EOS)

Kafka guarantees: at-least-once by default (consumer may process duplicate on crash). For exactly-once: (1) Idempotent producer: enable.idempotence=true — sequence numbers deduplicate retries at broker. (2) Transactional API: atomic produce to multiple topics + consumer offset commit in one transaction. In practice: idempotent consumers (dedup on DB side using unique constraint on idempotency key) are simpler and more resilient than Kafka EOS transactions.

Ordering Nuance

Kafka guarantees ordering per partition only — not across partitions. For strict ordering of all events for a user: partition by user_id → all events for that user land in same partition in order. Cross-user ordering is not guaranteed. For globally ordered events (rare): use a single partition (loses parallelism) or use application-level sequence numbers.

DAY 16 Distributed Transactions — 2PC, Saga, Outbox Pattern Consistency ▼

2-Phase Commit (2PC)

Coordinator: send PREPARE → all participants vote YES/NO
If all YES → send COMMIT to all. If any NO → send ABORT.
Blocking protocol: if coordinator crashes after PREPARE, participants are stuck (holding locks)
Used in distributed SQL (Spanner uses 2PC + Paxos). Avoid in microservices — too much coupling.
3-Phase Commit (3PC) adds pre-commit phase to reduce blocking — rarely used in practice

Saga Pattern

Break distributed transaction into sequence of local transactions
Each step publishes an event that triggers the next service
On failure: compensating transactions undo previous steps in reverse order
Choreography: services react to events autonomously — no central coordinator. Simple, but harder to track overall state.
Orchestration: central saga orchestrator sends commands to each service. Easier to monitor and debug. Coupling to orchestrator.

Outbox Pattern — The Production-Grade Dual-Write Fix

Problem: write to DB and publish to Kafka — if Kafka publish fails after DB commit, event is lost. If you roll back on Kafka failure, you lose the DB write. This is the dual-write problem.
Solution — Outbox pattern: write to DB + Outbox table in same transaction (atomic). Debezium watches the WAL for Outbox table inserts and publishes to Kafka. Atomicity guaranteed by DB transaction. Eventual delivery guaranteed by CDC. This is the correct production approach for any service that writes to DB AND publishes events.

DAY 17 Real-Time Systems — WebSockets, SSE, Long Polling at Scale Real-Time ▼

Protocol Comparison

Long Polling: client sends request → server holds until event or timeout → client immediately re-requests. Latency = server hold time. High connection overhead. Use only for legacy clients.
SSE (Server-Sent Events): persistent HTTP connection, server streams events to client. One-way (server → client only). HTTP/1.1 compatible. Good for: live feeds, notifications, progress bars.
WebSocket: full-duplex persistent TCP connection. Both sides can send at any time. Best for: chat, collaborative editing, gaming, live dashboards. Uses HTTP Upgrade handshake.
WebTransport (QUIC): emerging standard. Datagrams + streams over HTTP/3. Better for real-time gaming, video conferencing.

WebSocket at Scale

Go handles 100K+ concurrent WS connections per server with goroutines
Connection server is stateful — can't naively load balance (sticky sessions needed at WS tier)
Pub-sub backend: connection servers subscribe to Redis Pub-Sub or Kafka topics
Fan-out: message → Redis Pub-Sub → all connection servers for recipient's room/channel → push to client
Presence tracking: heartbeat every 15s → Redis key with 30s TTL. Key expiry = user offline.
Reconnect: client tracks last received sequence number → reconnect with offset → server replays missed messages

WebSocket Fan-Out Architecture: Client A ←→ [WS Server 1] ←→ [Redis Pub-Sub: room:123] Client B ←→ [WS Server 2] ←→ [Redis Pub-Sub: room:123] Client C ←→ [WS Server 1] ←→ [Redis Pub-Sub: room:123] Message from A → WS Server 1 → publish to Redis topic "room:123" Redis Pub-Sub → fan out to all WS servers subscribed to "room:123" WS Server 1 → push to Client C. WS Server 2 → push to Client B.

DAY 18–19 Observability — Metrics, Logs, Traces (The Three Pillars) Production ▼

Metrics (Prometheus + Grafana)

RED method: Rate (requests/sec), Errors (error rate %), Duration (latency p50/p99)
USE method: Utilization, Saturation, Errors — for infrastructure resources
Histogram: p50, p90, p99 latency. Never use averages — they hide outliers.
Cardinality trap: avoid high-cardinality labels (user_id, request_id) — explodes Prometheus memory
SLI/SLO/SLA: SLI=measure (p99 latency), SLO=target (p99 < 200ms 99.9% of time), SLA=contractual guarantee

Distributed Tracing (Jaeger, Tempo)

Trace: tree of spans across services for one request
Span: single operation with start/end time, service name, tags
Propagate: W3C TraceContext header (traceparent) across service calls
Sampling: head-based (random %) vs tail-based (on error/slow). Tail-based catches all slow/error requests.
OpenTelemetry: vendor-neutral instrumentation standard. Collect once, export to Jaeger/Tempo/Datadog.

Structured Logging Best Practices

Always log JSON in production. Include: timestamp, level, service, trace_id, span_id, user_id (when applicable), message, and relevant context fields. Never log PII (emails, passwords, payment cards). Use log levels correctly: DEBUG (dev only), INFO (normal ops), WARN (unexpected but handled), ERROR (needs attention), FATAL (crash). Centralize with ELK (Elasticsearch + Logstash + Kibana) or Loki + Grafana. Correlate logs with traces via trace_id for end-to-end debugging.

🎯 Week 3 Checkpoint

Explain the Outbox pattern to a junior engineer — why is it needed and how does it work?
When does Kafka guarantee ordering? When does it not?
Design a presence system for a chat app with 10M concurrent users — connection tracking, heartbeat, fan-out
Saga choreography vs orchestration — what are the debugging trade-offs?
What's the difference between p50 and p99 latency? Why does p99 matter more for user experience?

Before Week 4 — How To Read Any Full System Design

Step 1: identify the user action. "Send message", "watch video", "request ride", "open feed".
Step 2: draw the write path first. This is usually where durability, ordering, and fan-out decisions live.
Step 3: draw the read path separately. Reads often use different stores, caches, or denormalized views.
Step 4: find the state owner. For each entity, say which component is the source of truth.
Step 5: ask what is synchronous vs asynchronous. This reveals latency and coupling trade-offs.
Step 6: find the hottest key, shard, queue, or celebrity-case bottleneck.
Step 7: end with failure handling: retries, dedup, replay, reconciliation, observability.

WEEK 04 Real System Designs — HLD + LLD Full Walkthroughs FULL DESIGNS

DAY 22 Design WhatsApp — Messaging at Scale Full HLD+LLD ▼

HLD: Client → [WebSocket Server Cluster] → [Message Service] ↓ [Kafka: msg-events topic] ↓ ↓ [Delivery Service] [Persistence Service] ↓ ↓ [Recipient WS conn] [Cassandra: messages by chat_id + timestamp] Presence: Redis SET user:{id}:online → 30s TTL, heartbeat every 15s Push Notif: if recipient offline → FCM/APNs via [Notification Service] Media: Client uploads to S3 directly → CDN URL stored in message

Key Design Decisions

Cassandra: partition by chat_id, cluster by timestamp DESC — efficient time-ordered reads per chat
Message ID: Snowflake (timestamp + server + seq) — globally unique, sortable, no coordination
E2E encryption: Signal Protocol — keys exchanged out-of-band, server stores only ciphertext
WebSocket server → stateful, use consistent hashing to route user's connections to same server
Message state machine: SENT → DELIVERED (ACK from recipient device) → READ (read receipt)
Offline delivery: store in Cassandra + push notification. On reconnect, fetch unread since last_seen_timestamp.

Group Messages — Fan-Out Strategy

Small groups (<100 members): fan-out on write — write message to each member's inbox
Large groups: fan-out on read — members pull from shared group inbox on open
Bounded fan-out: limit group size (WhatsApp: 1024) to cap write amplification
Group membership: cached in Redis, backed by PostgreSQL
Delivery receipt for groups: message delivered when ALL members receive (complex)

DAY 23 Design YouTube — Video Storage, Transcoding, CDN Full HLD+LLD ▼

Upload Flow: Client → [Upload Service] → S3 (raw video) ↓ [Kafka: video-uploaded topic] ↓ [Transcoding Job Queue] ↓ [Worker Farm: FFmpeg containers (auto-scale)] ↓ [S3: segments in HLS format — 360p, 720p, 1080p, 4K] ↓ [CDN: distribute segments globally] View Flow: Client → CDN (HLS manifest .m3u8) → Adaptive Bitrate: pick quality based on bandwidth

Scale Numbers to Reason With

500 hours of video uploaded per minute
1 hour of video → ~5GB raw → ~1.5GB after compression across qualities
Transcoding: parallelize per video segment (5s chunks) — 1 hour video → 720 chunks → 720 parallel jobs
CDN: 95%+ of views served from edge — origin only serves cache misses
View count: approximate with Redis INCR per shard, aggregate periodically. Exact count: HyperLogLog for unique viewers.

Recommendation System (Offline ML)

Watch history + clicks → feature store → ML training pipeline (Spark)
Pre-compute recommendations per user nightly → store in Redis/DynamoDB
Real-time signals: last 10 watch actions → re-score with lightweight model
Cold start: new users get popular videos by category until enough signal
Candidate generation → ranking → filtering (already watched, policy) → serve

DAY 24 Design Uber — Real-Time Location, Matching, Surge Full HLD+LLD ▼

Location Service

Driver sends location every 4 seconds via WebSocket
Geohash: encode lat/lng as string prefix (shorter string = larger cell)
Redis: GEOADD key lng lat driver_id. GEOSEARCH for radius queries.
Or: geohash prefix → Redis SET of driver IDs in that cell. Query neighboring cells.
Location history: write to Cassandra for trip replay and billing

Matching Service

Rider requests → query nearby available drivers (geosearch)
Score drivers: distance + rating + vehicle type + surge zone
Dispatch to top N drivers simultaneously (first to accept wins)
Distributed lock on driver during dispatch window (Redis SET NX PX 30000)
ETA: precomputed road graph (OSRM or Google Maps Platform)
Surge pricing: supply/demand ratio per geohash cell, multiplier applied at pricing service

Ride Request Flow: Rider app → POST /rides → [Ride Service] ↓ [Location Service: find nearby drivers via Redis GEO] ↓ [Matching Service: score + dispatch] ↓ (send offer to top 3 drivers) [Driver apps via WebSocket push] ↓ (first driver accepts) [Ride Service: create ride, lock driver, notify rider] ↓ [Kafka: ride-accepted → ETA service, billing service]

DAY 25 Design URL Shortener + Rate-Limited API Service HLD ▼

URL Shortener

Short code generation: base62(auto-increment ID) — simple, unique, sortable
Alternative: MD5(url)[0:7] — collision check required, no guarantee of uniqueness
Redirect: 301 (permanent — browser caches, saves future requests) vs 302 (temporary — allows analytics tracking)
Storage: DynamoDB (short_code → long_url + metadata). 100M URLs/day → Cassandra for write throughput.
Cache hot redirects in Redis: top 10% of links → 90%+ of traffic
Analytics: log redirect events to Kafka → Spark/Flink pipeline → ClickHouse for querying
Custom domains: DNS CNAME to CDN → CDN routes to redirect service

Redis Cluster Architecture

Hash slots: 16384 slots distributed across nodes. CRC16(key) % 16384 → slot → node.
Each master has N replicas. Failure: replica auto-promotes (Sentinel or Cluster mode).
Hot key problem: single key too popular → one slot hot. Solution: replicate hot key across shards with suffix (user:42:0, user:42:1).
CLUSTER KEYSLOT cmd: check which slot a key maps to
Cross-slot transactions: not supported — redesign if needed (use Lua scripts within same slot)

DAY 26 Design Twitter/X Feed — Fan-Out Strategies Full HLD+LLD ▼

Fan-Out on Write (Push)

On tweet: push tweet_id to all followers' feed caches (Redis ZSET, score=timestamp)
O(followers) work at write time. Read is O(1) — just fetch pre-built feed.
Celebrity problem: user with 50M followers = 50M Redis writes per tweet
Good for: regular users (<1M followers) — fast feed reads

Fan-Out on Read (Pull)

On feed load: fetch recent tweets from all followed users, merge + sort
Heavy reads but no write amplification for celebrities
N followed users → N DB queries → merge sort → paginate. Slow for users following many accounts.
Good for: celebrities — read cost is shared across followers, not pushed to write

Hybrid Approach — The Actual Twitter Answer

Classify users by follower count. Regular users (<1M followers): fan-out on write to Redis ZSET (score=timestamp). Celebrity users (>1M): on read, merge their recent tweets with the user's pre-built feed from Redis. This is what Twitter actually did. When you mention this unprompted, it signals you understand real production trade-offs, not just textbook patterns. Also discuss: feed truncation (keep only last 800 tweets in cache), backfill on first load, and how promoted tweets are injected.

DAY 27–28 LLD Deep Dives — Design Patterns + Concurrency LLD ▼

LLD: Parking Lot System

Entities: ParkingLot, Level, Slot, Ticket, Vehicle (abstract), Car/Bike/Truck (concrete)
Strategy pattern: PricingStrategy interface — HourlyPricing, FlatRatePricing. Swap without changing Ticket.
Factory pattern: VehicleFactory.create(type) — returns correct Vehicle subclass
Observer pattern: SlotAvailabilityBoard subscribes to slot state changes
Concurrency: booking a slot → check availability + assign must be atomic. Use optimistic locking: slot has version field, CAS update fails if version changed.
State machine: AVAILABLE → RESERVED → OCCUPIED → AVAILABLE

LLD: Notification System

Entities: Notification, NotificationChannel (abstract), EmailChannel, SMSChannel, PushChannel
Strategy pattern: each channel implements send(notification). Plug in new channels without changing core.
Template method: BaseNotification.send() calls format() + deliver() — subclasses override format()
Queue-based delivery: write to Kafka topic per channel → consumers retry with exponential backoff
Idempotency: idempotency_key per notification → dedup on delivery side
Rate limiting per user per channel: prevent spam — max 10 emails/day per user

LLD Interview Signal

The gap between senior and mid-level in LLD is concurrency awareness. Don't just draw classes — say "this slot booking has a TOCTOU (time-of-check-time-of-use) race condition between the availability check and the reservation write. I'd handle this with optimistic locking — add a version field to the Slot entity. The update only succeeds if version matches what we read. On conflict, retry with fresh read." Proactively naming race conditions and their solutions is the signal interviewers are looking for.

🏆 Day 29–30: Full Mock System Design Sessions

Day 29: 45-min mock — Design a real-time gaming leaderboard for 10M players. Cover: storage (Redis ZSET), real-time updates, global ranking, weekly/monthly resets, top-K queries. Time yourself strictly.
Day 30: 45-min mock — Design a distributed job scheduler (cron at scale). Cover: exactly-once execution guarantee, failure recovery, priority queues, worker assignment, monitoring.
Standard: HLD complete in first 10 min, deep dive on hardest component, trade-offs raised unprompted before interviewer asks
Signal: You should be driving the conversation — proposing alternatives, naming limitations, asking clarifying questions proactively. Not answering reactively.

WEEK 05 · OPTIONAL AI Agent System Design — The 2025 Frontier AI SYSTEMS

This week is bonus / no time pressure — every concept here builds on your existing system design foundation. AI agent design is now appearing at companies building AI products. All prior weeks remain fully intact.

DAY 31 AI Agent Patterns — ReAct, Plan-Execute, Reflection Architecture ▼

ReAct Pattern (Reason + Act)

LLM alternates: Thought → Action (tool call) → Observation → repeat until done
Handles ambiguity well — can discover information dynamically
Risk: error compounds over long chains. Add max_steps hard limit.
Best for: research tasks, open-ended exploration, web navigation
Frameworks: LangGraph, LlamaIndex Agents

Plan-and-Execute Pattern

Planner LLM: generates full step-by-step task list upfront
Executor: smaller/cheaper model runs each step one at a time
Dramatically reduces cost — expensive Planner called once
More auditable — plan is a visible, loggable artifact
Best for: financial workflows, structured data pipelines, known task shapes
Add re-planning step if a step fails unexpectedly

ReAct vs Plan-Execute Trade-off

ReAct: flexible, handles unknown task shapes, discovers information dynamically. But error compounds — step 3 failure based on wrong step 1 reasoning can cascade. Plan-Execute: auditable, cost-efficient (Executor = small/cheap model), predictable — but brittle if plan is wrong at step 1. Production guideline (Anthropic): start with the simplest approach. Most production agent failures come from over-engineering, not under-engineering. Add complexity only when needed.

DAY 32 Agent Components — Memory, Tools, Context Engineering Deep Dive ▼

Memory Architecture

In-context (working): current conversation + tool results in prompt window — fast but bounded (~128K tokens)
External long-term: vector DB (pgvector, Pinecone) — embed past interactions, retrieve relevant on each turn
Episodic: summarize older conversation chunks → write summary back to context. Google ADK compaction approach.
Semantic cache: embed user query → cosine match recent Q&A pairs → return cached if similarity > threshold. Saves 30–60% LLM costs.

Tool Design for Agents

Every tool = JSON schema: name, description, parameters, return type
Description quality matters enormously — LLM uses it to decide when and how to call
Tools must be idempotent where possible — agents retry on failure
MCP (Model Context Protocol): Anthropic's standard for plug-and-play tool integrations
Always validate + sanitize tool outputs before feeding back to LLM — prevent prompt injection from tool results

Context Engineering — The Real Production Bottleneck

Long-running agents exhaust their context window. Naive solution: bigger context window. Real solution: treat context as a compiled artifact. Separate durable state (DB) from per-call working context (prompt). Apply compaction: summarize old events, keep recent N turns verbatim, inject only relevant memory chunks retrieved from vector store. This is the difference between an agent that works for 3 turns and one that works for 300 turns in production.

DAY 33 Multi-Agent Systems — Supervisor, Specialist Agents, Orchestration Architecture ▼

Supervisor Multi-Agent Pattern: User Request → [Supervisor Agent: classify + route task] ↓ ↓ ↓ [Research Agent] [Code Agent] [QA Agent] (web search tools) (shell tools) (test tools) ↓ ↓ ↓ [Results aggregated by Supervisor] ↓ [Final Response to User] Key: Each specialist runs its own ReAct loop internally. Supervisor only sees the final output of each agent — not intermediate steps. Each agent has bounded tool access (least privilege).

Single vs Multi-Agent — When to Escalate

Start single agent + many tools. Add multi-agent only when: (1) tasks are genuinely parallel and independent, (2) specialist isolation prevents error contamination between domains, (3) different agents need different tool permissions (security boundary). Don't add multi-agent for complexity's sake. Production lesson from Anthropic: "The most successful implementations consistently used simple, composable patterns rather than complex frameworks." Debugging multi-agent race conditions and message passing failures is brutal.

DAY 34 Design a Production LLM Platform — AI Gateway + RAG at Scale Full HLD ▼

LLM Platform HLD: Client → [AI Gateway: auth, TPM rate-limit, cost tracking, model routing] ↓ ↓ [Semantic Cache] [Model Router] (Redis + pgvector) ↓ ↓ ↓ similarity > 0.95 → hit [GPT-4o] [Claude] [Llama-70B on GPU] ↓ [Response Logger + Cost Aggregator] RAG Pipeline: Docs → [Chunker (semantic boundaries)] → [Embedder: text-embedding-3] → [pgvector / Pinecone] Query → [Embed query] → [ANN top-20] → [Cross-encoder reranker: top-3] → [Budget-aware prompt builder] → [LLM] → [Streamed response]

AI Gateway Key Decisions

Rate limit in tokens (TPM), not requests: one request can cost 100 or 100K tokens
Model routing: classify query complexity → route simple to small/cheap model, complex to GPT-4o. 60-80% cost reduction.
Semantic cache: embed prompt → cosine search → return cached if similarity > 0.95. Massive savings on repeated queries.
Fallback chain: primary model timeout → retry secondary → hardcoded fallback response
Streaming: SSE for token-by-token output — reduces perceived latency dramatically

RAG Production Trade-offs

Chunking: fixed-size (fast) vs semantic (better recall). Parent-child: retrieve small chunk, return larger parent for context.
Hybrid search: ANN (semantic) + BM25 (keyword) combined with RRF scoring. Best precision.
Reranker: cross-encoder re-scores top-20 → return top-3. Expensive but dramatically improves answer quality.
Context budget: dynamically allocate tokens: system prompt + examples + retrieved chunks + answer reserve. Hard cap on chunk injection.
Eval: RAGAS metrics: faithfulness, answer relevance, context recall. Run on golden dataset per deploy.

DAY 35 Design an AI Coding Assistant (Copilot / Claude Code Style) Full HLD+LLD ▼

AI Coding Assistant — Full System: IDE Plugin → [LSP Server] → [Context Assembler] ↓ [Repo Indexer: AST-chunked embeddings → pgvector (incremental, file-watcher)] [Current file + open tabs + cursor position + git diff + test results] ↓ [Retrieval: top-k relevant code chunks by semantic + BM25] ↓ [Prompt builder: system + repo context + user intent — 8K token budget] ↓ [LLM Gateway → streaming response] ↓ [Output Validator: AST parse, lint, run existing tests] ↓ [Streamed diff shown in IDE — user approves before apply]

The Three Hard Problems

1. Context assembly: what code do you put in the prompt and how do you rank relevance? AST-based chunking preserves function/class boundaries. Retrieve by semantic similarity + recency + call-graph proximity. Budget: 8K tokens for context, 4K for completion.
2. Latency: inline completions need <200ms p50. Use smaller/cached model for inline autocomplete; larger for chat and complex refactoring. Pre-warm model for active files.
3. Safety: LLM-generated code runs on your machine. Sandbox every tool call (file write, shell exec) in a container. Never auto-apply diffs — always require user approval. Cost cap per session: abort if token spend exceeds budget.

DAY 36 Agent Observability, Safety & Production Checklist Production ▼

Observability for Agents

Trace every tool call: span ID, tool name, input, output, latency, token cost
Agent run trace: full Thought → Action → Observation tree per run_id — stored in persistent log
Token cost per step: attribute cost to each LLM call — find expensive steps to optimize
Eval pipeline: RAGAS / custom golden dataset — run on every deploy, catch regressions
Tools: LangSmith, Langfuse, Helicone, Arize Phoenix — purpose-built for LLM tracing

Safety Layers

Input validation: prompt injection detection, PII scrubbing before LLM
Output validation: structured output schemas (JSON Schema/Pydantic) — reject malformed
Tool permission scoping: each agent gets only tools it needs — least privilege
Sandboxed execution: code-running agents in network-isolated containers with CPU/mem limits
Human-in-the-loop gates: for irreversible actions (send email, deploy, delete data) — explicit approval required
Max iterations + cost cap: hard stop prevents runaway loops

Real Production Failure Modes

(1) Compounding hallucination: wrong tool call → wrong observation → wrong next step → cascades. Fix: validate tool output schema, add reflection on unexpected results. (2) Context poisoning / prompt injection: malicious content in RAG retrieval injects instructions. Fix: separate retrieved content from instruction sections structurally. (3) Cost explosion: agent loops without progress. Fix: max_steps + cost budget hard cap + loop detection (hash of recent states). (4) No audit trail: impossible to debug what the agent did. Fix: persistent trace log from day one — non-negotiable.

🤖 Week 5 Checkpoint — AI Agent Design

Mock 1: "Design an AI research agent that browses the web and summarizes findings." Draw the ReAct loop, tool registry, memory strategy. What breaks at 100 concurrent users?
Mock 2: "Design a RAG system for a 10M document legal corpus." Cover chunking strategy, hybrid search, reranking, context budget, latency SLA, eval pipeline.
Mock 3: "Design GitHub Copilot." Cover AST indexing, context assembly, latency budget, sandboxed execution, cost control. Time yourself: 45 mins, drive the hard problems proactively.
The signal: In every agent design, proactively raise observability, cost caps, safety guardrails, and what happens when the agent takes a wrong turn on iteration 3 of 10. Raising these unprompted is the interview signal.

System Design in 30 Days

Prerequisites & Frameworks

The 5-Layer HLD Architecture

HLD Drawing Rules

Full Request Flow — Memorize This

LLD = Entities + Relationships + State + Concurrency

LLD Drawing Rules

Common LLD Design Patterns

The Universal 45-Minute Structure

Pattern Recognition Questions

Numbers Every Engineer Memorizes

REST — Core Rules

Pagination Strategies

Idempotency for Payments

Auth Patterns

REST vs GraphQL vs gRPC vs WebSocket

How To Study This Like A Senior Engineer

Daily Study Loop

What Interviewers Actually Score

Staff-Level Signal

Capacity Estimation Shortcuts

Answer Template For Full Designs

Fast Understanding Hub

The 7 Flows To Master

How Good Engineers Read A Design

What To Memorize vs Understand

Universal Read / Write / Async Model

The 10 Questions To Ask For Any Design

System Families

Primary Bottlenecks

Fast-Learning Sequence

Core Concepts

When to Scale What

Key Trade-offs

Interview Approach

L4 vs L7 Load Balancing

Algorithms

Health Checks + Connection Draining

Common Interview Gotcha

Caching Strategies

Redis Data Structures

Cache Invalidation — "Two Hard Problems"

Multilevel Caching

CAP Theorem

Consistency Models (Spectrum)

PACELC — More Practical Than CAP

Interview Signal

Algorithms Compared

Distributed Rate Limiting

CDN Internals

DNS + Proxies

Full Request Flow — Memorize This

REST — Principles & Best Practices

GraphQL vs REST vs gRPC

API Versioning Strategies

Idempotency & Safety

API Authentication & Authorization

API Error Design

API Design Decision Framework

API Design Interview Tips

🎯 Week 1 Checkpoint

SQL Strengths

NoSQL Categories

Decision Framework — Never "I'd use MongoDB because it's flexible"

B-Tree (PostgreSQL, MySQL)

LSM Tree (Cassandra, RocksDB, LevelDB)

MVCC + WAL — How PostgreSQL Works

Sharding Strategies

Consistent Hashing Deep Dive

Shard Key Selection — The Most Critical Decision

Replication Models

Quorum Reads/Writes

Leader Election — How etcd/ZooKeeper Handle It

Index Types

Query Optimization Signals

🎯 Week 2 Checkpoint

Kafka Architecture

Performance Tuning Knobs

Exactly-Once Semantics (EOS)

Ordering Nuance