IBM

3 interview questions · docker

dockerarchitectintermediate

How do rootless containers, Seccomp, and AppArmor harden Docker, and what breaks when you use all three?

architectsecuritydocker

▼

Quick Answer

Rootless containers run Docker without root privileges. Seccomp restricts which system calls a container can make. AppArmor limits file and network access per container. Together they provide layered security, but conflicts happen when rootless mode needs system calls that Seccomp blocks, or when AppArmor denies paths that rootless UID mapping requires.

Detailed Answer

Think of a building with three independent security systems: a keycard system that controls who can enter which floor, a phone system that blocks certain outgoing call types, and a camera system that restricts which rooms each employee can access. Each system works on its own, but sometimes a new employee's keycard triggers the camera alert because their access pattern looks unusual. Container security layers interact the same way. Rootless containers tackle the fundamental risk that the Docker daemon traditionally runs as root on the host. In rootless mode, dockerd runs under a regular user's UID, and user namespaces remap container UIDs so that root inside the container maps to an unprivileged UID on the host. This means even if an attacker escapes the container, they land as a non-root user on the host. Seccomp (Secure Computing Mode) is a Linux kernel feature that filters system calls. Docker applies a default Seccomp profile that blocks about 44 dangerous syscalls including mount, reboot, and kexec_load. AppArmor provides Mandatory Access Control that restricts which files, network sockets, and capabilities a container process can access, regardless of its UID. Internally, these three mechanisms operate at different kernel layers. User namespaces (rootless) work at the namespace level, remapping UIDs and GIDs through /etc/subuid and /etc/subgid. Seccomp operates at the system call interface, using BPF filters loaded before the container process starts. AppArmor operates at the LSM (Linux Security Module) layer, enforcing path-based access control policies loaded into the kernel. Docker applies all three during container creation: it sets up user namespace mapping, loads the Seccomp BPF filter via the OCI runtime spec, and assigns the AppArmor profile through the container's security options. At production scale, the combination provides real defense in depth. A container running the checkout-api might have rootless mode preventing host root access, a custom Seccomp profile allowing only the 150 system calls the Go binary actually uses, and an AppArmor profile restricting file writes to /tmp and /var/log/checkout only. Security teams should test Seccomp in audit mode before enforcing, generate AppArmor profiles using tools like bane or aa-logprof, and test rootless networking thoroughly because rootless mode uses slirp4netns or pasta for network isolation, which adds latency compared to native bridge networking. The tricky part is that rootless containers need the newuidmap and newgidmap system calls for user namespace setup, and an overly strict Seccomp profile can block these. Similarly, AppArmor may deny access to /proc/self/uid_map or /etc/subuid paths that rootless mode needs. When turning on all three, architects must test the specific combination: create a Seccomp profile that includes the system calls rootless mode requires, make sure the AppArmor profile allows the filesystem paths rootless networking and UID mapping use, and verify that slirp4netns or pasta networking works under both Seccomp and AppArmor constraints. Roll these out one at a time (rootless first, then Seccomp in audit mode, then AppArmor in complain mode) to avoid silent breakage in production.

Code Example

# Run the payments-api container with all three security layers enabled
docker run -d \
  # Name the container for operational identification
  --name payments-api \
  # Apply a custom Seccomp profile that allows only necessary syscalls
  --security-opt seccomp=/etc/docker/seccomp/payments-api.json \
  # Apply a custom AppArmor profile restricting file and network access
  --security-opt apparmor=payments-api-profile \
  # Run the container process as non-root user inside the container
  --user 1000:1000 \
  # Drop all Linux capabilities and add back only what is needed
  --cap-drop ALL \
  # Allow the process to bind to port 8080 without root
  --cap-add NET_BIND_SERVICE \
  # Mount the application config as read-only
  -v /etc/payments/config.yaml:/app/config.yaml:ro \
  # Expose the API port
  -p 8080:8080 \
  # Use the production image
  registry.company.com/payments-api:3.4.1

# Generate a Seccomp profile by tracing actual syscalls used by the container
# Run the container with strace to capture syscall usage
docker run --rm --security-opt seccomp=unconfined \
  registry.company.com/payments-api:3.4.1 \
  strace -c -f -S name /app/payments-api 2>&1 | tail -40

# Verify which security options are applied to a running container
docker inspect payments-api --format '{{.HostConfig.SecurityOpt}}'

# Check AppArmor profile status on the host
aa-status | grep payments-api-profile

# Test Seccomp in audit mode before enforcing (log violations without blocking)
# In the Seccomp profile JSON, set defaultAction to SCMP_ACT_LOG
# Then check the kernel audit log for denied syscalls
dmesg | grep -i seccomp | tail -20

◈ Architecture Diagram

┌──────────────────────┐
│   Container Process  │
└──────┬───────────────┘
       │
┌──────┴───────────────┐
│ AppArmor (LSM layer) │
│ file + net policy    │
└──────┬───────────────┘
       │
┌──────┴───────────────┐
│ Seccomp (BPF filter) │
│ syscall whitelist    │
└──────┬───────────────┘
       │
┌──────┴───────────────┐
│ User Namespace       │
│ UID remap (rootless) │
└──────────────────────┘

How do containerd and CRI-O differ as Kubernetes container runtimes, and what does runc actually do?

architectgeneraldocker

▼

Quick Answer

containerd and CRI-O both implement the Kubernetes Container Runtime Interface but differ in scope. containerd is a general-purpose daemon that works with Docker, Kubernetes, and standalone use. CRI-O is built exclusively for Kubernetes with a smaller footprint. Both hand off actual container creation to runc, which sets up Linux namespaces, cgroups, and the root filesystem per the OCI spec.

Detailed Answer

Think of two car engines designed for the same chassis. containerd is a versatile engine that powers sedans, trucks, and race cars. It does more than any single car needs, but it works everywhere. CRI-O is an engine designed for one car model only. It is lighter, has fewer parts, and is tuned for exactly that chassis. Both engines use the same fuel injectors (runc) to actually combust the fuel (run the container). In Kubernetes, the kubelet talks to the container runtime through the Container Runtime Interface (CRI), a gRPC API that handles image pulling, container lifecycle, and sandbox management. After Kubernetes dropped dockershim in v1.24, clusters moved to either containerd (with its built-in CRI plugin) or CRI-O. containerd is maintained by the CNCF and used by Docker Desktop, AWS EKS, Google GKE, and many self-managed clusters. CRI-O is maintained by the Kubernetes SIG-Node community and used mainly by Red Hat OpenShift and Fedora CoreOS. Both are production-proven and CNCF graduated projects. Under the hood, when the kubelet sends a CreateContainer CRI request, the high-level runtime (containerd or CRI-O) pulls the container image if needed, unpacks it into a root filesystem using a snapshotter driver (overlayfs is most common), generates an OCI runtime spec (config.json), and calls the low-level OCI runtime. runc, the reference OCI runtime, reads that config.json and creates the container: it sets up Linux namespaces (pid, net, mnt, uts, ipc, user, cgroup), configures cgroups for resource limits, pivots the root filesystem, applies Seccomp and AppArmor profiles, drops capabilities, and finally execs the container's entrypoint process. The OCI runtime spec standardizes this interface so alternatives like crun (written in C for faster startup), gVisor's runsc (kernel-level sandboxing), or Kata Containers' kata-runtime (VM-based isolation) can replace runc without changing the high-level runtime. At production scale, the choice between containerd and CRI-O depends on your organization. containerd is the safer default for most teams because it has broader ecosystem support, more documentation, and works across managed Kubernetes providers. CRI-O offers a smaller attack surface and tighter Kubernetes alignment since it matches Kubernetes release cycles and skips features irrelevant to Kubernetes. Performance differences are negligible for most workloads, but CRI-O's smaller codebase can be an advantage in security-audited environments. Teams should monitor runtime metrics including container start latency, image pull duration, pod sandbox creation time, and runtime daemon memory use. The non-obvious gotcha is that switching runtimes on a running cluster requires draining nodes, reconfiguring the kubelet's --container-runtime-endpoint, and potentially reformatting the container storage directory because containerd and CRI-O use different on-disk layouts. Another trap is assuming OCI compatibility means feature parity: gVisor's runsc intercepts system calls for sandboxing but has compatibility gaps with some applications, and Kata Containers add VM startup latency. Architects must test their specific workloads against alternative runtimes instead of assuming drop-in replacement.

Code Example

# Check which container runtime the kubelet is using on a node
kubectl get node worker-payments-01 -o jsonpath='{.status.nodeInfo.containerRuntimeVersion}'

# Inspect containerd status and loaded plugins on a node
ctr --address /run/containerd/containerd.sock version

# List containers managed by containerd in the Kubernetes namespace
ctr --address /run/containerd/containerd.sock \
  --namespace k8s.io containers list | grep payments-api

# For CRI-O: check runtime status and configured OCI runtimes
crictl info | jq '.config.runtimes'

# Inspect the OCI runtime spec generated for a running container
# Find the container ID first
crictl ps | grep checkout-worker
# Inspect the container's OCI bundle directory
ls /run/containerd/io.containerd.runtime.v2.task/k8s.io/<container-id>/

# containerd config snippet showing runc as the default OCI runtime
# /etc/containerd/config.toml
# version = 2 # containerd config file version
# [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
#   runtime_type = "io.containerd.runc.v2" # Use runc via the v2 shim
#   [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
#     SystemdCgroup = true # Use systemd cgroup driver matching kubelet

# Configure a RuntimeClass for workloads needing gVisor sandboxing
apiVersion: node.k8s.io/v1 # RuntimeClass API for selecting OCI runtimes
kind: RuntimeClass # Allows Pods to specify their low-level runtime
metadata:
  name: gvisor # Name referenced by Pod spec
handler: runsc # Maps to the containerd runtime configuration name
overhead:
  podFixed:
    cpu: 50m # Account for gVisor kernel CPU overhead in scheduling
    memory: 64Mi # Account for gVisor memory overhead in scheduling

◈ Architecture Diagram

┌──────────┐
│ kubelet  │
└────┬─────┘
     │ CRI gRPC
┌────┴─────┐
│containerd│
│ or CRI-O │
└────┬─────┘
     │ OCI spec
┌────┴─────┐
│   runc   │
│(or runsc)│
└────┬─────┘
     │
┌────┴─────┐
│namespaces│
│ cgroups  │
│ rootfs   │
└──────────┘

What is the difference between Docker and Podman, and why do enterprises prefer Podman for production?

intermediategeneraldocker

▼

Quick Answer

Docker uses a centralized daemon running as root that manages all containers, while Podman is daemonless and runs containers as the invoking user without requiring root privileges. Enterprises prefer Podman for production because it eliminates the single-point-of-failure daemon, supports rootless containers natively, provides better systemd integration, and aligns with the OCI standard without vendor lock-in.

Detailed Answer

Think of Docker like a traditional bank with a single central manager who handles every transaction, account opening, and customer request. Every teller must route their work through this manager, and if the manager calls in sick, the entire branch shuts down. Podman is like a modern bank where each teller is fully empowered to complete transactions independently. There is no single manager whose absence stops operations. Each teller operates within their own authority, processes their own work, and the overall system is more resilient because no single failure brings everything down. This architectural difference has profound implications for security, reliability, and operations in regulated banking environments. Docker's architecture centers on the Docker daemon, dockerd, a long-running background process that runs as root. Every docker CLI command communicates with this daemon through a Unix socket. The daemon manages image pulls, container lifecycle, networking, and storage. This design means that anyone with access to the Docker socket effectively has root access to the host, because they can mount any host directory, run privileged containers, or escape container isolation. In banking environments, this creates an unacceptable security risk: a compromised application or a developer with Docker socket access could potentially access the entire host system, read secrets from other containers, or modify the host filesystem. Podman eliminates the daemon entirely. When you run podman run, a new process is forked directly without communicating through a centralized service. Each container is a child process of the podman command, managed by the standard Linux process model. This means containers can run as regular unprivileged users without any root daemon. A developer running podman run as their own user account creates a container that runs with that user's permissions and cannot access resources beyond what the user could access directly. Podman achieves this through Linux user namespaces, which map container UIDs to unprivileged host UIDs. The security benefit is dramatic: even if an attacker breaks out of a rootless Podman container, they land in an unprivileged user context with no path to root. For enterprise operations, Podman offers several additional advantages. Systemd integration allows containers to be managed as standard systemd services with dependency ordering, automatic restart, and logging through journald. Podman can generate systemd unit files from running containers, making production deployment consistent with how enterprises manage all other services. Podman also introduces the concept of pods, groups of containers that share network and IPC namespaces, directly mirroring Kubernetes pod semantics. This means teams can develop and test pod configurations locally with Podman before deploying to Kubernetes, reducing the gap between development and production environments. Podman is also fully OCI-compliant and uses the same image format as Docker, so existing Dockerfiles and images work without modification. The production gotcha that enterprises discover during migration is that Docker Compose workflows do not directly translate to Podman. While podman-compose exists as a compatibility layer, it does not support all Docker Compose features. Enterprise teams often move to Podman's native pod YAML support or use Kubernetes manifests directly for orchestration. Another consideration is that rootless Podman containers use slirp4netns or pasta for networking, which adds slight latency compared to Docker's bridge networking. For most banking applications the difference is negligible, but high-frequency trading or ultra-low-latency services may need to benchmark both approaches. Build-time tooling is also different: Podman uses Buildah as its image building engine, which supports building images without a Dockerfile using scripted commands, providing additional flexibility for CI/CD pipelines in regulated environments where every build step must be auditable.

Code Example

# Docker: requires root daemon running — single point of failure
sudo systemctl start docker
docker run -d --name payments-api ecr.bank.com/payments-api:v2.3.1
# docker.sock access = root access to host (security risk)

# Podman: no daemon, runs as unprivileged user
podman run -d --name payments-api ecr.bank.com/payments-api:v2.3.1
# No root required, no daemon socket to protect

# Podman rootless: container runs with user namespace mapping
podman run --rm --user 10001:10001 \
  ecr.bank.com/payments-api:v2.3.1
# Container UID 10001 maps to unprivileged host UID — no root escape

# Create a pod (mirrors Kubernetes pod concept)
podman pod create --name fraud-detection-pod -p 8080:8080
podman run -d --pod fraud-detection-pod \
  --name fraud-detector ecr.bank.com/fraud-detector:v1.5.0
podman run -d --pod fraud-detection-pod \
  --name fraud-sidecar ecr.bank.com/fraud-sidecar:v1.2.0

# Generate systemd unit file for production service management
podman generate systemd --new --name payments-api \
  --restart-policy=always > /etc/systemd/system/payments-api.service
systemctl daemon-reload
systemctl enable --now payments-api.service
# Now managed like any Linux service: start, stop, logs via journalctl

# Build images with Buildah (Podman's build engine)
buildah bud -t ecr.bank.com/settlements-processor:v3.1.0 \
  -f Dockerfile.settlements .

# Security comparison
# Docker: any user in 'docker' group has root-equivalent access
ls -la /var/run/docker.sock
# srw-rw---- 1 root docker 0 Jun 21 /var/run/docker.sock

# Podman: no socket, no daemon, no root-equivalent group
podman info --format '{{.Host.Security.Rootless}}'
# true

# Alias for migration compatibility (drop-in replacement)
alias docker=podman  # Existing scripts work unchanged

◈ Architecture Diagram

┌─────────────────────────┐    ┌─────────────────────────┐
│       Docker            │    │       Podman             │
├─────────────────────────┤    ├─────────────────────────┤
│                         │    │                         │
│  ┌─────────────────┐    │    │  ┌──────┐  ┌──────┐    │
│  │  docker CLI     │    │    │  │ pod  │  │ pod  │    │
│  └────────┬────────┘    │    │  │ man  │  │ man  │    │
│           │ socket      │    │  │ run  │  │ run  │    │
│           ↓             │    │  └──┬───┘  └──┬───┘    │
│  ┌─────────────────┐    │    │     │         │        │
│  │  dockerd        │    │    │     ↓         ↓        │
│  │  (ROOT daemon)  │    │    │  ┌──────┐  ┌──────┐    │
│  │  SPOF           │    │    │  │ cont │  │ cont │    │
│  └────────┬────────┘    │    │  │ ainer│  │ ainer│    │
│           ↓             │    │  └──────┘  └──────┘    │
│  ┌─────┐ ┌─────┐       │    │  No daemon              │
│  │cont │ │cont │       │    │  No root                │
│  │ainer│ │ainer│       │    │  No socket               │
│  └─────┘ └─────┘       │    │  No SPOF                │
│                         │    │                         │
│  Risk: socket = root    │    │  Rootless by default    │
└─────────────────────────┘    └─────────────────────────┘