01System Design Interview Framework
Ask before designing. "Who are the users? What scale โ DAU, requests/sec? Read vs write ratio? Latency SLA? Consistency requirements? Global or single region?" Interviewers reward engineers who discover requirements, not wait for them.
Back-of-envelope math shows engineering judgment. "100M users, 10% DAU = 10M/day. 1 post each = 115 writes/sec. 20x read:write = 2300 reads/sec. 1KB/post ร 10M = 10GB/day." Order of magnitude is the point โ round aggressively.
Define the APIs first: what does the client call? Then draw the main flow: client โ load balancer โ service โ cache โ DB. Stay on one level of abstraction. Explicitly state what you're skipping ("I'll skip auth, focusing on the write path").
Ask "Where should I go deep?" or pick it yourself. Show internals: why Cassandra for messages (write-optimized LSM), why Redis ZSET for feeds (sorted by score, O(log n) insert), why Kafka for fan-out (durable, replay, backpressure handling).
Proactively raise issues before they're asked. "One bottleneck here is celebrity fan-out โ I'd handle this with hybrid push/pull." Discuss trade-offs. End with: "What aspect would you like me to go deeper on?"
Latency Numbers to Know
- L1 cache hit: ~1ns ยท L2: ~4ns ยท RAM: ~100ns
- SSD read: ~16ฮผs ยท HDD seek: ~10ms
- Network same datacenter: ~500ฮผs
- Network cross-region: ~150ms
- Read 1MB from RAM: ~250ฮผs ยท SSD: ~1ms
- 1M req/day โ 12 req/sec ยท 1B req/day โ 11.5K/sec
Pick the Right Database
- ACID transactions, complex joins โ PostgreSQL
- Write-heavy, 100K+ writes/sec โ Cassandra
- Flexible schema, JSON nesting โ MongoDB
- Cache, session, leaderboard, pub-sub โ Redis
- Graph traversal (social, fraud) โ Neo4j
- Object storage (files, images, video) โ S3 + CDN
Top 8 Design Questions
- WhatsApp / Slack โ real-time messaging
- YouTube / Netflix โ video at scale
- Twitter / Instagram โ feed & fan-out
- Uber / Lyft โ real-time geo matching
- Google Drive โ file sync & storage
- Rate Limiter โ distributed token bucket
- Notification System โ multi-channel delivery
- URL Shortener โ hash + redirect at scale
CAP Theorem in Practice
- P (partition) always happens โ choose C or A
- CP: strong consistency, less availability (payments, banking)
- AP: high availability, eventual consistency (social feeds, DNS)
- PACELC: even without partition, latency vs consistency
- Redis: AP by default (async replication)
- PostgreSQL sync replication: CP
Quiz โ System Design Judgment
Q1. 10M daily active users, each makes 3 API calls/day. Roughly how many requests/second?
Q2. A social feed that can show slightly stale posts is tolerating which trade-off?
Q3. When should you start drawing the architecture in a system design interview?
02Behavioral Interviews โ STAR Method
The STAR Framework
Situation โ Context in 1-2 sentences. Project, team size, stakes.
Task โ Your specific responsibility. Distinguish "I" from "the team".
Action โ 70% of your answer. What did YOU specifically do, decide, build? Name the technical choices and why.
Result โ Always quantify. "P99 latency from 250ms to 18ms", "60% reduction in deployment time", "zero incidents in 6 months". Vague results signal weak answers.
Production DB couldn't handle projected write load for an upcoming launch in 3 weeks. Risk of customer-facing downtime.
I was lead engineer responsible for choosing the storage solution and migrating 50M rows without downtime.
Compared Cassandra vs sharded PostgreSQL vs DynamoDB. Chose Cassandra for LSM-tree write performance. Built dual-write migration: new writes to both stores, backfilled old data, shifted read traffic 10% at a time, monitored error rates at each step with Grafana dashboards.
Zero-downtime migration in 10 days. Write throughput from 5K to 80K/sec. Feature launched on time, no production incidents.
Team planning to build a custom message queue to save infra cost, despite unclear requirements and 3-month timeline.
I was scoping the implementation but believed we were solving the wrong problem.
Prepared a data-driven TCO comparison: custom build (2 engineers ร 6 months) vs managed Kafka ($800/month). Presented to tech lead with risk analysis. Proposed Kafka with a 2-week integration spike instead. Used data, not opinion.
Team adopted Kafka. Saved ~4 months of engineering. I led the integration. System has run reliably for 18 months with zero major incidents.
Service hitting 8% error rate under load. On-call alerts every other night. Customer complaints escalating.
I was the primary owner of this service and responsible for diagnosing and fixing it.
Used distributed tracing (Jaeger) to isolate the bottleneck โ a DB query missing a composite index on a high-frequency path. Added index, implemented PgBouncer connection pooling, added circuit breakers on downstream dependencies.
Error rate: 8% โ 0.02%. P99 latency: 1.2s โ 90ms. Zero on-call pages for 4 months post-fix.
Common Behavioral Questions
- "Tell me about yourself" (practice: 90 seconds)
- "What's your biggest technical achievement?"
- "Tell me about a production incident you owned"
- "How do you handle competing priorities?"
- "Why do you want to leave your current role?"
- "Describe a time you failed. What did you learn?"
- "How do you keep up with new technologies?"
Leadership & Influence Questions
- "How do you resolve technical disagreements?"
- "How do you influence decisions without authority?"
- "How do you balance tech debt vs features?"
- "Tell me about a time a project went off-track"
- "How do you help junior engineers grow?"
- "Describe your approach to code review"
Quiz โ Behavioral Interview
Q1. In STAR, where should you spend the most time?
Q2. Which result statement is strongest?
Q3. You disagree with your team's technical choice. Best approach?
03Coding Interview Strategy
Read aloud. Ask: "Can the array be empty? Duplicates allowed? Input size?" Confirm examples. Never assume constraints.
State the naive approach first. "The brute force is O(nยฒ) โ nested loops. It works but won't scale. Let me optimize." Interviewers want your thinking, not just code.
What pattern fits? Sliding window for subarrays, two pointers for sorted arrays, hash map for O(1) lookup, stack for nested structures, BFS for shortest path, DP for overlapping subproblems.
Meaningful variable names. Comment the key insight. Handle edge cases as you go and say so aloud. "I'm checking the empty case here." Don't code silently.
Trace your example by hand. Then test one edge case. Finally: "This is O(n) time โ one pass. O(n) space โ hash map stores at most n elements."
Pattern โ When to Use
- Sliding Window โ max/min subarray of size k, substrings
- Two Pointers โ sorted array, pair sum, palindrome check
- Hash Map โ frequency count, two-sum, O(1) lookup
- Stack โ matching brackets, next greater element
- BFS โ shortest path, level-order, unweighted graph
- DFS โ all paths, tree traversal, island count
- Binary Search โ sorted array, find first/last position
- DP โ overlapping subproblems: knapsack, LCS, edit distance
Complexity Reference
- O(1) โ HashMap get/set, array index access
- O(log n) โ binary search, balanced BST
- O(n) โ single pass, linear scan
- O(n log n) โ merge sort, heapsort, TreeMap ops
- O(nยฒ) โ nested loops, bubble sort
- O(2โฟ) โ all subsets, recursive Fibonacci
- O(n!) โ all permutations
- Space O(n) โ storing input-sized data structure
Edge Cases to Always Test
- Empty input / null
- Single element array
- All duplicate values
- Already sorted / reverse sorted
- Negative numbers
- Integer overflow (large sums)
- Disconnected graph (multiple components)
- Cycle in linked list or graph
Quiz โ Coding Strategy
Q1. "Find two numbers in a sorted array that sum to a target." Most efficient approach?
Q2. When should you use BFS over DFS for graph problems?
Q3. Which algorithm has O(n log n) time complexity?
04Hard Technical Questions
Distributed Systems
- "How does Kafka guarantee exactly-once delivery?"
- "Explain split-brain and how to prevent it"
- "2PC vs Saga pattern โ when to use each?"
- "How do you handle DB failover with zero data loss?"
- "What are vector clocks and when do you need them?"
- "Explain strong consistency with a concrete example"
Concurrency & Performance
- "How do you detect and fix a deadlock?"
- "Mutex vs semaphore โ difference and use cases?"
- "How do you debug a service with rising P99 latency?"
- "What is false sharing in CPU caches?"
- "Explain the thundering herd problem and solutions"
- "How does connection pooling improve throughput?"
Databases & Storage
- "How does MVCC enable non-blocking reads?"
- "Explain WAL and why it enables point-in-time recovery"
- "B-tree vs LSM-tree โ write-heavy workload choice?"
- "How does consistent hashing minimize resharding?"
- "Explain bloom filter โ false positive rate formula?"
- "What is the N+1 query problem and how to fix it?"
Common Interview Traps
- โ Jumping to design before clarifying requirements
- โ "I'd use microservices" without justifying the split
- โ Proposing a perfect system with no trade-offs
- โ Behavioral answers with no quantified results
- โ Coding in silence โ always narrate your thinking
- โ Waiting to be asked about trade-offs โ raise them yourself
How to Answer "Explain X" Technical Questions
3-layer structure โ always:
1. What: One sentence definition. "MVCC is a concurrency control mechanism that lets multiple readers and one writer operate without blocking each other."
2. How: Explain the mechanism. "Each write creates a new row version with a transaction timestamp. Readers see a snapshot at their transaction start time โ old versions are kept until no active transaction needs them, then VACUUM cleans them up."
3. Trade-off / When: Show depth. "The downside is storage bloat from old row versions and the cost of VACUUM. You'd use MVCC in any OLTP system that needs high read throughput โ PostgreSQL, MySQL InnoDB, CockroachDB all use it."
Quiz โ Technical Depth
Q1. Kafka exactly-once delivery requires which producer configuration?
Q2. Best solution to prevent thundering herd on cache expiry?
Q3. WAL (Write-Ahead Log) primarily enables which capability?
05Compensation Strategy
5 Rules for Every Negotiation
1. Never state your number first. "I'd prefer to understand the full scope and compensation structure first. Can you share the range?" If pressed: "I'm flexible and looking at the full picture โ what's the band?"
2. Never accept on the spot. "Thank you โ I'm very excited about this. Can I take 24 hours to review the full package?" Always. Even if you're certain.
3. Negotiate base and equity separately. Get the highest base. Then negotiate equity. Different budget pools โ companies often have more flexibility on one.
4. Counter 20โ30% above the offer. The worst they say is no. "I'm excited about the role. The offer is below my target โ I was expecting X based on market data and the scope. Is there flexibility?" They almost always come back with something.
5. Use competing offers as leverage. Even a process elsewhere works. "I'm also in process with Y โ I'd love to make this work, but need the comp to be competitive."
| Component | What to Ask | Negotiation Leverage |
|---|---|---|
| Base Salary | "What is the band for this level?" | High โ most flexible lever |
| Equity / RSUs | "Grant size, vesting schedule, cliff?" | High at growth companies |
| Signing Bonus | "Is there a joining or retention bonus?" | Often used to close comp gaps |
| Performance Bonus | "Target %? Is first year guaranteed?" | Medium โ often fixed by policy |
| Equity Refreshes | "Do you grant annual refresh RSUs?" | Critical for multi-year value |
| Remote Policy | "Remote-first or in-office requirement?" | Can offset lower base significantly |
Questions to Ask the Interviewer
Always ask 2-3. These signal strategic thinking and genuine interest:
โข "What does success look like in this role at 6 months and 1 year?"
โข "What's the biggest technical challenge the team is facing right now?"
โข "How does engineering influence product decisions here?"
โข "What does the on-call rotation look like? How many incidents per month?"
โข "How does the team handle technical debt โ is there dedicated time for it?"
โข "What does career growth look like from this role?"
06Best Resources
Best visual system design walkthroughs on YouTube. Watch at least 5 full design videos before your interview.
Pattern-recognition approach to LeetCode. Start with NeetCode 150. Understand patterns, not just solutions.
Intuition-first. Watch how trade-offs are articulated verbally โ that's the real interview skill.
Built by FAANG hiring managers. Most practical written guide for system design at any level.
Free, comprehensive. Behavioral, coding, system design. The most complete single free resource.
Crowdsourced comp by company and level. Use this to anchor every salary negotiation.
Organized by pattern. Do 2-3 problems per pattern. Focus on understanding the approach, not memorizing answers.