Operating Systems | EngineeringOS

Comprehensive guide to processes, memory management, concurrency, I/O models, and system internals that impact production software performance.

01Key Concepts

Processes vs Threads

Process: own virtual address space, PCB, file descriptors
Thread: shares address space, cheaper context switch
Context switch cost: ~1–10μs (register save/restore + TLB flush)
Go goroutines: M:N threading, GMP scheduler
fork() + exec(): how your shell spawns Go binaries

CPU Scheduling

CFS (Linux): completely fair scheduler, virtual runtime
Run queue per CPU (NUMA-aware)
Priority: nice value -20 to 19
Real-time: SCHED_FIFO, SCHED_RR (Kafka latency tuning)
CPU affinity: pin Go worker goroutines to cores

Virtual Memory

Each process: private virtual address space (48-bit on x86-64)
Page table: VA → PA translation
TLB: cache for page table entries (32–1024 entries)
TLB miss: page table walk → expensive (100+ cycles)
Huge pages (2MB): reduce TLB misses for large data

Memory Segments

Text: executable code (read-only, shared)
Stack: per-thread, grows down, limited (~8MB)
Heap: malloc/new, grows up, fragmentation risk
mmap: file-backed or anonymous (shared libs, Redis AOF)
Go goroutine stack: starts 2KB, grows dynamically

I/O Models

Blocking I/O: thread sleeps until data ready
Non-blocking: EAGAIN if no data, spin polling
Select/Poll: O(n) per call, kernel copies FD set
epoll: O(1) event notification, edge vs level trigger
io_uring: async I/O, zero-copy, submission queue rings

Concurrency Primitives

Mutex: kernel futex → userspace fast path
Semaphore: counting mutex (resource pools)
Condition variable: wait/signal (producer-consumer)
Spinlock: busy-wait, never sleep (interrupt context)
RWMutex: concurrent reads, exclusive write

02Must-Know Deep Dives

🔥 The Life of a Syscall — Crossing the Boundary

Understanding how your code talks to hardware is critical for backend performance. When you call read() in Go or Rust, it doesn't just run code; it triggers a CPU context switch.

1. Trap: The app puts the syscall ID in a register and executes a SYSCALL instruction.
2. Mode Switch: CPU switches from User Mode (Ring 3) to Kernel Mode (Ring 0).
3. Jump: Kernel looks at the Syscall Table and jumps to the function handling that ID.
4. Execution: Kernel performs the privileged action (reading disk, writing to socket).
5. Return: Kernel puts result in a register and switches back to User Mode.

// Optimization Insight:
Syscalls are expensive because they flush CPU pipelines and can cause TLB misses.
This is why io_uring is a game changer: it allows batching thousands of I/O operations 
in a single syscall, or even zero syscalls in kernel-polling mode.
    

🔥 epoll — How Go's Net Package Works

Go's net package uses epoll under the hood via the netpoller. When you do conn.Read(), Go registers the fd with epoll, parks the goroutine (not the OS thread), and wakes it when data arrives. This is how Go handles 100K+ concurrent connections with just hundreds of OS threads.

Level-triggered (LT) vs Edge-triggered (ET): LT fires as long as data available (default, safer). ET fires only on state change (zero-copy trick, more complex). Nginx uses ET mode for performance. Go uses LT for simplicity.

// Conceptual epoll loop (Go netpoller does this in runtime/netpoll_epoll.go)
fd = epoll_create1(0)
epoll_ctl(fd, EPOLL_CTL_ADD, conn_fd, &event)  // Register

for {
  n = epoll_wait(fd, events, maxEvents, timeout)  // Block until events
  for i := 0; i < n; i++ {
    // Wake goroutine waiting on this fd
    goready(goroutine_waiting_on(events[i].data.fd))
  }
}
    

🔥 CPU Cache Hierarchy — Why Cache Misses Kill Performance

L1 cache: 32KB, ~1ns latency. L2: 256KB, ~5ns. L3: 8-32MB, ~30ns. Main memory: ~100ns. A single L3 cache miss costs 100× an L1 hit. This matters for your Go code.

Cache line: 64 bytes. When you access element [0] of a slice, elements [1]-[7] (for int64) are loaded into cache for free. Traversing an array is fast; pointer chasing (linked list) is slow because each pointer dereference is a potential cache miss.

False sharing: Two goroutines updating different fields in the same struct → same cache line → cache invalidation ping-pong between CPUs. Fix: pad struct fields to 64-byte boundaries. This is why Go's sync.Mutex has internal padding.

// Bad: false sharing — counters on same cache line
type Counters struct { A, B int64 }

// Good: pad to cache line
type Counter struct {
    val int64
    _   [56]byte  // pad to 64 bytes
}
    

🔥 io_uring — The Future of Linux I/O

io_uring (Linux 5.1+) uses two shared memory ring buffers between kernel and user space: submission queue (SQE) and completion queue (CQE). No syscall needed per I/O — batch many ops, one io_uring_enter(). In kernel-poll mode: zero syscalls, pure ring operations.

Real impact: Cloudflare uses io_uring for their proxy. PostgreSQL 14+ has io_uring option. For Go: the github.com/pawelgaczynski/giouring library. Your distributed Rust service would benefit significantly — Tokio's io_uring backend is in progress.

🔥 OOM Killer + Memory Overcommit

Linux overcommits memory by default — malloc() doesn't actually allocate physical pages until touched (copy-on-write after fork). When physical memory runs out, OOM killer picks a process to kill based on oom_score.

In Kubernetes: containers with memory limit hit OOMKill (SIGKILL). This is why your K8s pods with memory pressure crash hard with no logs. Set memory.request == memory.limit (Guaranteed QoS) for critical services to avoid this. Go's GC doesn't help once you hit the cgroup limit.

03Resources

VIDEO

MIT 6.004: Operating System Organization (Short, Dense)

40-min lecture. Dense, no handholding. xv6 examples that map directly to Linux concepts.

VIDEO

io_uring Explained — Jens Axboe (Creator)

30 min by the person who built it. Pure signal, no beginner explanations.

BLOG

Brendan Gregg — Linux Performance

The canonical reference for Linux performance tools. CPU, memory, I/O flame graphs. Everything you need to debug production perf issues.

BLOG

Go Runtime Source: proc.go (GMP Scheduler)

Read the schedule() function. 500 lines that explain how Go goroutines are scheduled onto OS threads. More valuable than any blog post.

DOCS

Linux Kernel Docs: Memory Management

Canonical source for paging, TLB, huge pages. Dry but authoritative.

BOOK

OSTEP — Operating Systems: Three Easy Pieces (Free)

Read: Chapters 13-23 (Memory), 26-33 (Concurrency), 36-44 (I/O). Skip the rest. Best OS book available, free online.

04Quick Revision Notes

FLASH CARDS — Review These Before Sleeping

Context switch cost?

1–10μs. Register save, TLB flush, cache thrashing. Goroutines cheaper: ~0.2μs, no TLB flush (shared address space)

TLB miss penalty?

~100 cycles = ~30ns. Page table walk in hardware (x86 has hardware PTW). Huge pages reduce TLB pressure.

epoll complexity?

epoll_wait: O(1) — returns only ready fds. select/poll: O(n) every call — copies all fds to kernel.

What is a futex?

Fast userspace mutex. Uncontended: atomic CAS in userspace (no syscall). Contended: kernel syscall to sleep. Go's sync.Mutex uses this.

False sharing fix?

Pad struct fields to 64-byte cache line boundaries. _ [56]byte after hot field. Or use separate cacheline-aligned structs.

mmap vs read()?

mmap: zero-copy, page cache shared. read(): copy kernel→userspace. Redis uses mmap for AOF. Go runtime uses mmap for heap.

OOM Killer in K8s?

Hits when container exceeds memory.limit. SIGKILL — no cleanup. Fix: set requests==limits (Guaranteed QoS) for critical pods.

io_uring advantage?

Zero-copy, zero syscall per I/O. Shared ring buffer. Kernel-poll mode eliminates all syscalls. 3-5x faster than epoll for I/O-bound workloads.