πŸ‹ INFRA Β· 5 HOURS Β· CONTAINERS + SECURITY

Docker in a Day

Namespace and cgroup internals, OverlayFS, multi-stage builds, BuildKit, container security. Understanding the "why" behind every docker command you already use.

5h
Total Time
6
Core Topics
4
Quizzes
5h
Deep Dives
00Best Video ResourcesWATCH FIRST
TechWorld with Nana
Docker Tutorial for Beginners β†’ Advanced
Best structured Docker course on YouTube. Covers basics through docker-compose and networking. Watch at 1.5x if you know the basics.
~3h Β· Full Course
Liz Rice Β· Container Camp
Containers from Scratch (Go)
Build a container from zero Go code using namespaces and cgroups. The definitive "what is a container really" talk. Essential for internals understanding.
~40 min Β· Must Watch
Hussein Nasser
Docker Networking Deep Dive
How bridge networks, iptables, and veth pairs work. What actually happens when containers communicate. Direct relevance to K8s networking.
~35 min Β· Internals
Docker Official
BuildKit and Multi-stage Builds
BuildKit internals: parallel build graph, cache mounts, secret mounts. Multi-stage patterns for production Go images.
~25 min Β· BuildKit
LiveOverflow
Container Escape β€” Security Deep Dive
How container breakouts work. privileged containers, /proc mounts, seccomp bypass. Essential for understanding container security boundaries.
~20 min Β· Security
ByteByteGo
How Docker Works Internally
Visual overview of the full Docker architecture from daemon to container runtime. Good mental model refresh.
~12 min Β· Visual
01Container Internals β€” Namespaces & cgroupsLINUX PRIMITIVES

Namespaces β€” Isolation Primitives

  • pid: process sees its own PID tree. PID 1 in container β‰  PID 1 on host. /proc is namespace-scoped.
  • net: own network stack β€” interfaces, routes, iptables. Container gets eth0 (veth pair to host bridge br0).
  • mnt: own filesystem mount tree. pivot_root() changes root to container filesystem layer.
  • uts: own hostname and domain name. hostname inside container β‰  host.
  • ipc: isolated System V IPC, POSIX message queues. Containers can't interfere with host IPC.
  • user: map user IDs. UID 0 in container β†’ non-root UID on host (rootless containers). Critical for security.
  • cgroup (v2): own cgroup subtree view. Container sees its own resource limits as the "root".
  • Clone: clone(CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS...) β€” one syscall, all namespaces.

cgroups v2 β€” Resource Control

  • Control Groups: kernel mechanism to limit, account, and isolate resource usage of process groups.
  • cpu.max: "500000 1000000" = 500ms CPU per 1s period = 0.5 CPU. K8s CPU limits write this file.
  • memory.max: hard memory limit. Exceed = OOM kill (SIGKILL). memory.high: soft limit, triggers GC/throttling before OOM.
  • io.max: limit read/write IOPS and bandwidth per device.
  • cgroup v2 unified hierarchy: single tree, all resource controllers. v1 was multiple trees per controller (complex). K8s uses v2 on modern distros.
  • Container = namespace (isolation) + cgroup (resource limits) + filesystem layer (OverlayFS). That's it. No magic.
# What a container IS under the hood: # 1. New namespaces via clone() # 2. cgroup limits applied # 3. OverlayFS mounted as rootfs # 4. chroot/pivot_root into that filesystem # See container's cgroup limits: cat /sys/fs/cgroup/system.slice/docker-CONTAINERID.scope/cpu.max # Output: 500000 1000000 (0.5 CPU) cat /sys/fs/cgroup/system.slice/docker-CONTAINERID.scope/memory.max # Output: 536870912 (512MB) # See namespaces of a container process: ls -la /proc/$(docker inspect --format='{{.State.Pid}}' CONTAINER)/ns/
02OverlayFS β€” Container Filesystem LayersSTORAGE

Layer Architecture

  • Docker images = stack of immutable layers. Each Dockerfile instruction creates a layer.
  • OverlayFS: presents multiple directory trees as one unified view. lowerdir (read-only layers) + upperdir (read-write, container-specific) = merged view.
  • Copy-on-Write (CoW): reading a file from lowerdir = zero cost. Writing: file copied to upperdir, modification happens there. Only touched files are in upperdir.
  • Container's writable layer (upperdir): all runtime changes live here. Gone when container deleted (unless mounted volume).
  • Layer sharing: 10 containers from same image share same lowerdir layers in memory and on disk. Massive efficiency win.

Dockerfile Layer Optimization

  • Layer caching: each instruction is a layer. Layer cached if: instruction unchanged AND all previous layers unchanged. Cache invalidated = all subsequent layers rebuilt.
  • Order matters: put rarely-changing instructions first (base image, system deps). Put frequently-changing instructions last (COPY source code).
  • Layer squashing: combine RUN commands with &&. Each RUN = one layer. apt-get update && apt-get install -y curl && rm -rf /var/lib/apt/lists/* in ONE RUN.
  • Multi-stage: builder stage installs build deps (gcc, go tools) β†’ final stage copies only binary. Tiny final image with no build tools.
  • .dockerignore: exclude node_modules, .git, test files from build context. Build context sent to daemon β€” large context = slow build start.
# Production Go multi-stage Dockerfile FROM golang:1.22-alpine AS builder WORKDIR /app # Copy go.mod first β€” cached until dependencies change COPY go.mod go.sum ./ RUN go mod download # Source code changes frequently β€” copy last COPY . . RUN CGO_ENABLED=0 GOOS=linux go build -ldflags="-s -w" -o /app/server ./cmd/server FROM gcr.io/distroless/static-debian12 AS final # distroless: no shell, no package manager, no OS cruft # Attack surface: near zero. Image size: ~10MB vs ~300MB COPY --from=builder /app/server /server USER nonroot:nonroot ENTRYPOINT ["/server"]

Quiz β€” Layers & OverlayFS

1. You have 20 containers running from the same base image (500MB). How much disk space do the 20 containers' base layers consume?

2. Your Dockerfile has: COPY . . on line 3, then RUN go mod download on line 4. What is the problem?

03Networking β€” Bridge, veth, iptablesNETWORKING

Bridge Network (Default)

  • Docker creates docker0 Linux bridge (virtual switch) on host.
  • Each container gets: veth pair. One end in container (eth0), one end attached to docker0 bridge on host.
  • Container IP: assigned from bridge subnet (172.17.0.0/16 default). Host routes container traffic through bridge.
  • Container-to-container: same bridge = L2 communication directly. Different bridges = need routing or --link (deprecated).
  • User-defined bridge: docker network create mynet. Containers on same user bridge: DNS resolution by name (container1 can reach container2 by hostname). Default bridge: no DNS.

Port Publishing & iptables

  • -p 8080:80: Docker adds iptables DNAT rule: traffic to host:8080 β†’ container:80.
  • iptables FORWARD chain: Docker adds rules to allow inter-container and containerβ†’internet traffic.
  • --network=host: no network namespace. Container uses host's network stack directly. No NAT overhead. Port conflicts possible. K8s hostNetwork: true uses this.
  • --network=none: no network. Air-gapped container. Mount data via volumes only.
  • Overlay network (Swarm/K8s): VXLAN tunnels between hosts. Containers on different hosts communicate as if on same L2 network. K8s uses this via CNI plugins (Flannel, Calico).
04Container SecuritySECURITY

seccomp β€” Syscall Filtering

  • Linux Security Module: filter which syscalls a container can make.
  • Docker default profile: blocks ~44 dangerous syscalls (ptrace, reboot, kexec_load, etc.)
  • Custom profile: JSON allowlist of permitted syscalls. Principle of least privilege.
  • --security-opt seccomp=profile.json
  • Breaking seccomp = container escape vector. Use --security-opt seccomp=unconfined only for debugging, never production.

Capabilities & Rootless

  • Linux capabilities: divide root privileges into ~40 granular permissions. Containers drop all except needed.
  • --cap-drop=ALL --cap-add=NET_BIND_SERVICE: only allow binding to ports below 1024.
  • Rootless containers: Docker daemon and containers run as non-root user. Uses user namespace. No root on host = massive security win.
  • USER instruction in Dockerfile: always set. Never run as root in production.
  • Read-only rootfs: --read-only with tmpfs for /tmp. Container can't write to its own filesystem. Immutable runtime.

Image Security

  • Minimal base images: distroless, scratch, alpine. Fewer packages = fewer CVEs. Attack surface minimized.
  • Image scanning: Trivy, Snyk, Docker Scout. Scan for CVEs in OS packages and application dependencies.
  • No secrets in images: never ENV API_KEY=xxx or COPY .env. Use runtime secret injection (K8s Secrets, Vault agent, --secret BuildKit mount).
  • BuildKit --secret: mount secret file during build, not baked into layer. RUN --mount=type=secret,id=npmrc cat /run/secrets/npmrc
  • Image signing: cosign, Docker Content Trust. Verify provenance before deploy.

Quiz β€” Security

1. Your Dockerfile has: ENV DATABASE_PASSWORD=secret123. What is the security problem?

05BuildKit β€” Parallel, Cached, Efficient BuildsBUILD

BuildKit Architecture

  • BuildKit: next-gen Docker build engine. Replaces legacy "builder" (enabled by default Docker 23+).
  • Parallel execution: BuildKit parses Dockerfile as DAG. Independent stages built in parallel. Multi-stage builds: builder and asset stages run simultaneously.
  • Build cache: content-addressable cache. Layer cached by: instruction + inputs hash. Remote cache: --cache-from registry image. CI/CD: pull cache from registry before build.
  • Cache mounts: RUN --mount=type=cache,target=/root/.cache/go-build. Persist build cache between builds without baking into image. Go module cache, npm cache, pip cache.

BuildKit Features

  • Secret mounts: RUN --mount=type=secret,id=mysecret. Secret available during build only. Not in layer history.
  • SSH mounts: RUN --mount=type=ssh. Forward SSH agent. Clone private repos without copying keys.
  • Heredoc syntax (Docker 1.4+): write multi-line scripts inline: RUN <. Cleaner than chained &&.
  • TARGETPLATFORM: cross-platform builds. BuildKit builds for linux/amd64, linux/arm64 simultaneously via QEMU or native builders.
  • docker buildx: BuildKit frontend. buildx bake: multi-target build matrix from HCL/JSON config.
# BuildKit with Go cache mount β€” dramatic speedup in CI FROM golang:1.22-alpine AS builder WORKDIR /app COPY go.mod go.sum ./ RUN --mount=type=cache,target=/go/pkg/mod \ go mod download COPY . . RUN --mount=type=cache,target=/go/pkg/mod \ --mount=type=cache,target=/root/.cache/go-build \ CGO_ENABLED=0 go build -o /app/server ./cmd/server # /go/pkg/mod: module download cache (survives between builds) # /root/.cache/go-build: compiled package cache # Result: 30s build β†’ 3s on cache hit
06Production Patterns & TroubleshootingPRODUCTION

Container Resource Management

  • Always set limits: --memory and --cpus. Without limits: one runaway container can OOM the entire host.
  • Requests vs limits (K8s): request = guaranteed resource. Limit = maximum. CPU limit = throttling (not kill). Memory limit = OOM kill when exceeded.
  • PID limit: --pids-limit. Prevents fork bomb from exhausting host PIDs.
  • Health checks: HEALTHCHECK instruction or K8s liveness/readiness probes. Restart unhealthy containers. Remove from load balancer while starting.
  • Graceful shutdown: handle SIGTERM in your app. Docker stop sends SIGTERM (10s timeout), then SIGKILL. Go: context cancellation on SIGTERM, drain in-flight requests.

Debugging Containers

  • docker exec -it container sh: enter running container. Distroless has no shell β€” use ephemeral debug container.
  • kubectl debug: attach debug container (with tools) to running pod sharing its namespace. Non-invasive production debugging.
  • docker stats: real-time CPU, memory, network, block I/O per container.
  • docker events: stream of container lifecycle events. Detect crashes, restarts, OOM kills.
  • nsenter: enter container namespaces from host. nsenter -t PID -n -- tcpdump -i eth0. Network-capture inside container from host.
  • OOM debugging: dmesg | grep -i oom-kill. Shows which process was killed and how much memory it used.

βœ“ Docker Day Checkpoint

  • Explain what a container is at the Linux level β€” which two kernel features make isolation and resource limits work
  • Walk through what happens when you run docker run -p 8080:80 nginx β€” from daemon to iptables rule to first request
  • Build a production Dockerfile for a Go service: multi-stage, distroless, correct layer ordering, BuildKit cache mounts
  • Debug: container OOM-killed in production. Walk through your diagnosis steps.
  • Explain how BuildKit cache mounts work and why they don't affect final image size
  • Security: what are the three most impactful container security configurations for a production service?