01Key Concepts
Processes vs Threads
- Process: own virtual address space, PCB, file descriptors
- Thread: shares address space, cheaper context switch
- Context switch cost: ~1–10μs (register save/restore + TLB flush)
- Go goroutines: M:N threading, GMP scheduler
- fork() + exec(): how your shell spawns Go binaries
CPU Scheduling
- CFS (Linux): completely fair scheduler, virtual runtime
- Run queue per CPU (NUMA-aware)
- Priority: nice value -20 to 19
- Real-time: SCHED_FIFO, SCHED_RR (Kafka latency tuning)
- CPU affinity: pin Go worker goroutines to cores
Virtual Memory
- Each process: private virtual address space (48-bit on x86-64)
- Page table: VA → PA translation
- TLB: cache for page table entries (32–1024 entries)
- TLB miss: page table walk → expensive (100+ cycles)
- Huge pages (2MB): reduce TLB misses for large data
Memory Segments
- Text: executable code (read-only, shared)
- Stack: per-thread, grows down, limited (~8MB)
- Heap: malloc/new, grows up, fragmentation risk
- mmap: file-backed or anonymous (shared libs, Redis AOF)
- Go goroutine stack: starts 2KB, grows dynamically
I/O Models
- Blocking I/O: thread sleeps until data ready
- Non-blocking: EAGAIN if no data, spin polling
- Select/Poll: O(n) per call, kernel copies FD set
- epoll: O(1) event notification, edge vs level trigger
- io_uring: async I/O, zero-copy, submission queue rings
Concurrency Primitives
- Mutex: kernel futex → userspace fast path
- Semaphore: counting mutex (resource pools)
- Condition variable: wait/signal (producer-consumer)
- Spinlock: busy-wait, never sleep (interrupt context)
- RWMutex: concurrent reads, exclusive write
02Must-Know Deep Dives
🔥 The Life of a Syscall — Crossing the Boundary
Understanding how your code talks to hardware is critical for backend performance. When you call read() in Go or Rust, it doesn't just run code; it triggers a CPU context switch.
- 1. Trap: The app puts the syscall ID in a register and executes a
SYSCALLinstruction. - 2. Mode Switch: CPU switches from User Mode (Ring 3) to Kernel Mode (Ring 0).
- 3. Jump: Kernel looks at the Syscall Table and jumps to the function handling that ID.
- 4. Execution: Kernel performs the privileged action (reading disk, writing to socket).
- 5. Return: Kernel puts result in a register and switches back to User Mode.
🔥 epoll — How Go's Net Package Works
Go's net package uses epoll under the hood via the netpoller. When you do conn.Read(), Go registers the fd with epoll, parks the goroutine (not the OS thread), and wakes it when data arrives. This is how Go handles 100K+ concurrent connections with just hundreds of OS threads.
Level-triggered (LT) vs Edge-triggered (ET): LT fires as long as data available (default, safer). ET fires only on state change (zero-copy trick, more complex). Nginx uses ET mode for performance. Go uses LT for simplicity.
🔥 CPU Cache Hierarchy — Why Cache Misses Kill Performance
L1 cache: 32KB, ~1ns latency. L2: 256KB, ~5ns. L3: 8-32MB, ~30ns. Main memory: ~100ns. A single L3 cache miss costs 100× an L1 hit. This matters for your Go code.
Cache line: 64 bytes. When you access element [0] of a slice, elements [1]-[7] (for int64) are loaded into cache for free. Traversing an array is fast; pointer chasing (linked list) is slow because each pointer dereference is a potential cache miss.
False sharing: Two goroutines updating different fields in the same struct → same cache line → cache invalidation ping-pong between CPUs. Fix: pad struct fields to 64-byte boundaries. This is why Go's sync.Mutex has internal padding.
🔥 io_uring — The Future of Linux I/O
io_uring (Linux 5.1+) uses two shared memory ring buffers between kernel and user space: submission queue (SQE) and completion queue (CQE). No syscall needed per I/O — batch many ops, one io_uring_enter(). In kernel-poll mode: zero syscalls, pure ring operations.
Real impact: Cloudflare uses io_uring for their proxy. PostgreSQL 14+ has io_uring option. For Go: the github.com/pawelgaczynski/giouring library. Your distributed Rust service would benefit significantly — Tokio's io_uring backend is in progress.
🔥 OOM Killer + Memory Overcommit
Linux overcommits memory by default — malloc() doesn't actually allocate physical pages until touched (copy-on-write after fork). When physical memory runs out, OOM killer picks a process to kill based on oom_score.
In Kubernetes: containers with memory limit hit OOMKill (SIGKILL). This is why your K8s pods with memory pressure crash hard with no logs. Set memory.request == memory.limit (Guaranteed QoS) for critical services to avoid this. Go's GC doesn't help once you hit the cgroup limit.
03Resources
40-min lecture. Dense, no handholding. xv6 examples that map directly to Linux concepts.
30 min by the person who built it. Pure signal, no beginner explanations.
The canonical reference for Linux performance tools. CPU, memory, I/O flame graphs. Everything you need to debug production perf issues.
Read the schedule() function. 500 lines that explain how Go goroutines are scheduled onto OS threads. More valuable than any blog post.
Canonical source for paging, TLB, huge pages. Dry but authoritative.
Read: Chapters 13-23 (Memory), 26-33 (Concurrency), 36-44 (I/O). Skip the rest. Best OS book available, free online.
04Quick Revision Notes
FLASH CARDS — Review These Before Sleeping
_ [56]byte after hot field. Or use separate cacheline-aligned structs.