27 interview questions · docker, kubernetes, prometheus
Quick Answer
SLIs are measurable signals like latency and error rate. SLOs set targets around those SLIs (e.g., 99.95% of payment transactions complete under 500ms). Error budgets are the allowed failure margin — when the budget is exhausted, teams shift from feature work to reliability improvements.
Detailed Answer
Think of a bank account for reliability. Your SLO is like your minimum balance requirement, your SLI is the actual balance, and your error budget is the amount you can spend before hitting that minimum. When you overspend, the bank restricts your account — similarly, when your error budget is exhausted, engineering shifts priorities from new features to reliability work. In a banking context, SLIs (Service Level Indicators) are concrete measurements taken from your Kubernetes-hosted payments-api service. Common SLIs include request latency at the 99th percentile, error rate as a percentage of total requests, and availability measured as successful health checks over time. For a payments service handling wire transfers and ACH transactions, you might track the percentage of transactions that complete end-to-end within 2 seconds, the rate of 5xx errors returned by the settlements-processor, and the availability of the fraud-detector service during business hours. These metrics are collected via Prometheus ServiceMonitors scraping /metrics endpoints on each pod, and they feed into Grafana dashboards that the platform team monitors. SLOs (Service Level Objectives) are targets set around SLIs. For a regulated payments service, you might define: 99.95% of payment API requests return successfully within 500ms, 99.99% of settlement batch jobs complete within the processing window, and the fraud-detector must be available 99.97% of the time during trading hours. These SLOs are negotiated between the platform SRE team, product owners, and compliance officers. In banking, SLOs often need to align with regulatory requirements — PCI-DSS mandates certain uptime and data integrity guarantees, and your SLOs should be stricter than any regulatory floor to give you breathing room. Error budgets are calculated as 100% minus the SLO target over a rolling window. If your payments-api has a 99.95% availability SLO over 30 days, your error budget is 0.05%, which translates to roughly 21.6 minutes of allowed downtime per month. The platform team tracks error budget consumption in real time using tools like Sloth or custom Prometheus recording rules. When the budget is more than 50% consumed, alerts fire and the team reviews recent deployments and changes. When the budget is fully consumed, the team enacts a reliability freeze — no new feature deployments to the payments namespace, and all engineering effort shifts to reducing toil, fixing flaky tests, improving observability, and hardening the deployment pipeline. In production at a bank, the SRE team typically runs weekly error budget review meetings. These meetings examine which incidents consumed budget, whether the consumption was from planned maintenance or unexpected failures, and what systemic improvements would prevent recurrence. The payments-api team might discover that 80% of their error budget was consumed by a single database failover event, leading them to invest in connection pool tuning and read replica routing. The error budget model also drives architectural decisions — if the settlements-processor consistently burns through its budget, the team might propose moving from synchronous to asynchronous processing with Kafka queues, giving the service more resilience against downstream latency spikes. A critical gotcha in banking environments is that not all SLO violations are equal from a regulatory perspective. A brief latency spike on a read-only account balance endpoint is very different from a data integrity issue on a funds transfer. Teams should implement tiered SLOs — critical payment paths get stricter targets and separate error budgets from informational endpoints. Another common mistake is setting SLOs too aggressively early on (like 99.99% when your infrastructure can only realistically deliver 99.9%), which results in a permanently exhausted error budget and teams ignoring the system entirely. Start with achievable targets based on historical data, then tighten them as reliability improves.
Code Example
# Prometheus recording rules for SLI tracking on payments-api
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: payments-api-slos
namespace: banking-prod
spec:
groups:
- name: payments-api.slos
interval: 30s
rules:
# SLI: Request success rate for payments-api
- record: payments_api:sli:success_rate:5m
expr: |
sum(rate(http_requests_total{job="payments-api",code=~"2.."}[5m]))
/
sum(rate(http_requests_total{job="payments-api"}[5m]))
# SLI: Latency P99 for settlements-processor
- record: settlements:sli:latency_p99:5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="settlements-processor"}[5m])) by (le)
)
# Error budget remaining (30-day rolling window)
- record: payments_api:error_budget:remaining
expr: |
1 - (
(1 - payments_api:sli:success_rate:30d)
/
(1 - 0.9995) # SLO target: 99.95%
)
---
# Sloth SLO definition for fraud-detector
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: fraud-detector-slos
namespace: banking-prod
spec:
service: fraud-detector
labels:
team: platform-sre
tier: critical
slos:
- name: availability
objective: 99.97 # Regulatory minimum is 99.9%
sli:
events:
errorQuery: sum(rate(grpc_server_handled_total{grpc_service="FraudDetector",grpc_code!="OK"}[{{.window}}]))
totalQuery: sum(rate(grpc_server_handled_total{grpc_service="FraudDetector"}[{{.window}}]))
alerting:
name: FraudDetectorAvailability
pageAlert:
labels:
severity: critical
routing: banking-oncall
ticketAlert:
labels:
severity: warning◈ Architecture Diagram
┌─────────────────────────────────────────────────────────┐ │ SLO Framework │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │ │ │ SLI: │ │ SLO: │ │ Error Budget: │ │ │ │ Actual │───→│ Target │───→│ 100% - SLO │ │ │ │ Metrics │ │ 99.95% │ │ = 0.05% │ │ │ └──────────┘ └──────────┘ └────────┬─────────┘ │ │ │ │ │ ▼ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ Error Budget Policy │ │ │ │ │ │ │ │ Budget > 50% → Feature development continues │ │ │ │ Budget < 50% → Reliability review triggered │ │ │ │ Budget = 0% → Reliability freeze enacted │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────┐ ┌───────────────┐ ┌──────────────┐ │ │ │ payments-api│ │ settlements- │ │ fraud- │ │ │ │ SLO: 99.95% │ │ processor │ │ detector │ │ │ │ Budget: 21m │ │ SLO: 99.99% │ │ SLO: 99.97% │ │ │ │ /month │ │ Budget: 4.3m │ │ Budget: 13m │ │ │ └─────────────┘ └───────────────┘ └──────────────┘ │ └─────────────────────────────────────────────────────────┘
Quick Answer
The kube-scheduler uses a two-phase approach: first it filters out nodes that cannot run the Pod (predicates), then it scores the remaining feasible nodes using priority functions. The node with the highest aggregate score wins the binding.
Detailed Answer
Think of the Kubernetes scheduler like a hiring manager filling a position. First, you eliminate candidates who do not meet the minimum qualifications (no relevant degree, wrong location, missing certifications) -- that is the filtering phase. Then, among the qualified candidates, you rank them by how well they fit the role (years of experience, cultural fit, salary expectations) -- that is the scoring phase. The best-ranked candidate gets the offer, and in Kubernetes, the best-ranked node gets the Pod. In Kubernetes, the kube-scheduler is a control-plane component that watches the API server for newly created Pods that have no node assignment (spec.nodeName is empty). When it detects an unscheduled Pod, it begins the scheduling cycle. The scheduler maintains an internal scheduling queue that prioritizes Pods based on their priority class, creation timestamp, and other factors. The entire process happens in two distinct phases: filtering (also called predicates) and scoring (also called priorities). During the filtering phase, the scheduler evaluates each node against a set of filter plugins. These include PodFitsResources (checking CPU and memory requests against allocatable capacity), PodFitsHostPorts (ensuring requested host ports are available), NodeAffinity (matching node labels against affinity rules), TaintToleration (verifying the Pod tolerates all node taints), PodTopologySpread (enforcing topology spread constraints), and VolumeBinding (checking that required persistent volumes can be provisioned or are available on that node). Any node that fails even one filter is eliminated. If no nodes pass filtering, the Pod remains Pending and the scheduler may trigger preemption if the Pod has sufficient priority to evict lower-priority Pods. In the scoring phase, each surviving node is evaluated by scoring plugins that assign a value typically between 0 and 100. The NodeResourcesBalancedAllocation plugin favors nodes that would have balanced CPU and memory use after placing the Pod. The ImageLocality plugin gives higher scores to nodes that already have the container image cached, reducing pull time. InterPodAffinity scores nodes based on whether co-locating the Pod with other Pods matches affinity or anti-affinity preferences. The LeastAllocated strategy prefers nodes with the most free resources, while MostAllocated does the opposite for bin-packing. Each plugin score is multiplied by a configurable weight, and the weighted scores are summed. The node with the highest total score is selected, and the scheduler creates a Binding object to assign the Pod to that node. At production scale with thousands of nodes, the scheduler uses a percentageOfNodesToScore parameter (defaulting to a formula based on cluster size) to avoid evaluating every single node, which would be too slow. For a 5000-node cluster, it might only score 10% of feasible nodes once it has found enough candidates. The scheduler also supports scheduling profiles, allowing you to run multiple schedulers or customize the plugin chain. The scheduling framework has extension points like PreFilter, Filter, PreScore, Score, Reserve, Permit, PreBind, Bind, and PostBind, making it highly extensible. A non-obvious gotcha is that the scheduler makes decisions based on a snapshot of the cluster state, which can become stale in highly dynamic environments. If two Pods are being scheduled simultaneously and both target the same node, the second Pod may fail to bind because resources were consumed by the first. Additionally, the percentageOfNodesToScore optimization means the scheduler might not always find the globally optimal node -- it finds a good-enough node quickly. Resource requests (not limits) drive scheduling decisions, so Pods without requests are treated as requesting zero resources, which can lead to node overcommitment. Finally, DaemonSet Pods are not scheduled by the default scheduler since Kubernetes 1.12; the DaemonSet controller handles their node assignment directly.
Code Example
# Custom scheduler profile with specific plugins enabled
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: payments-scheduler # Custom scheduler name for payments workloads
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation # Prefer nodes with balanced CPU/memory
weight: 2 # Double the weight for balanced allocation
- name: ImageLocality # Prefer nodes that already have the image cached
weight: 1 # Standard weight for image locality
disabled:
- name: NodeResourcesMostAllocated # Disable bin-packing strategy
pluginConfig:
- name: PodTopologySpread # Configure topology spread constraints
args:
defaultingType: List # Use list-based defaulting
defaultConstraints: # Spread across zones by default
- maxSkew: 1 # Allow at most 1 Pod difference between zones
topologyKey: topology.kubernetes.io/zone # Spread across AZs
whenUnsatisfiable: ScheduleAnyway # Soft constraint - still schedule if skew exceeded
---
# Pod with resource requests that drive scheduling decisions
apiVersion: v1
kind: Pod
metadata:
name: payments-api-7f8d9c # Realistic Pod name with hash suffix
namespace: payments # Namespace for the payments service
labels:
app: payments-api # Label for service discovery
tier: backend # Label for topology spread
spec:
schedulerName: payments-scheduler # Use the custom scheduler defined above
topologySpreadConstraints: # Spread Pods across zones for HA
- maxSkew: 1 # Maximum difference in Pod count between zones
topologyKey: topology.kubernetes.io/zone # Spread across AZs
whenUnsatisfiable: DoNotSchedule # Hard constraint - block if cannot satisfy
labelSelector: # Match Pods with the same app label
matchLabels:
app: payments-api # Select all payments-api Pods
containers:
- name: payments-api # Main container name
image: registry.internal.io/payments-api:v2.4.1 # Internal registry image
resources:
requests: # These values drive the scheduler filtering phase
cpu: 500m # Request half a CPU core
memory: 512Mi # Request 512MB of memory
limits: # Limits enforce runtime cgroups constraints
cpu: "1" # Limit to 1 full CPU core
memory: 1Gi # Limit to 1GB of memory◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Unscheduled│ │ Filter │ │ Score │ │ Bind │
│ Pod │───→│ Phase │───→│ Phase │───→│ to Node │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
↓ ↓
┌──────────┐ ┌──────────┐
│ Eliminate │ │ Rank by │
│ Infeasible│ │ Weighted │
│ Nodes │ │ Scores │
└──────────┘ └──────────┘Quick Answer
When etcd loses quorum (majority of members are down), the cluster becomes read-only and cannot process writes, meaning no new Pods can be scheduled and no state changes can be persisted. Recovery involves either restoring enough members to regain quorum or rebuilding from a snapshot backup.
Detailed Answer
Imagine a board of directors that requires a majority vote to approve any decision. If the company has five board members and three resign suddenly, the remaining two cannot approve anything -- even if they agree -- because they lack the required majority. The company is paralyzed: no new hires, no budget changes, nothing. That is exactly what happens when etcd loses quorum: the remaining members know the current state but cannot authorize any changes. In Kubernetes, etcd is the single source of truth for all cluster state -- every Pod definition, Service, ConfigMap, Secret, and controller state lives in etcd. The kube-apiserver reads from and writes to etcd exclusively. Etcd uses the Raft consensus algorithm, which requires a strict majority (N/2 + 1) of members to agree on writes. For a 3-member etcd cluster, quorum requires 2 members; for 5 members, it requires 3. When quorum is lost, etcd switches to a degraded mode where it can serve stale reads (depending on consistency settings) but rejects all write operations. When quorum is lost, the chain of failure propagates quickly. The kube-apiserver begins returning errors for any mutating request (POST, PUT, DELETE) because etcd refuses writes. Controllers in the kube-controller-manager that rely on leader election through the apiserver may lose their leases. The scheduler cannot bind Pods to nodes. Existing workloads continue running because kubelets cache their Pod specs locally and container runtimes are independent of the control plane. However, no new Pods can be created, no scaling can occur, node heartbeats cannot be updated (which eventually triggers node NotReady conditions), and self-healing stops entirely. The cluster is alive but brain-dead. Recovery depends on the failure scenario. If etcd members are down due to transient issues (network partition, disk pressure, or crashed processes), the fastest path is to bring enough members back online to restore quorum. Check each member with etcdctl endpoint status and etcdctl member list. If a member's data is corrupted, remove it from the cluster with etcdctl member remove, then re-add it as a new member with etcdctl member add and let it rejoin and replicate. For catastrophic failure where all members are lost, you must restore from an etcd snapshot. Take regular snapshots with etcdctl snapshot save, then restore with etcdctl snapshot restore to a new data directory on each member, updating the initial-cluster and initial-advertise-peer-urls flags. After restoration, restart etcd and verify the kube-apiserver reconnects. In production, etcd failures at scale are often caused by slow disks, large key-value sizes from too many Kubernetes objects, or aggressive compaction settings. The write-ahead log (WAL) is sensitive to disk latency; etcd recommends dedicated SSDs with sub-10ms p99 latency. A non-obvious gotcha is that etcd v3 has a default storage limit of 2GB (configurable up to 8GB), and if the database exceeds this limit, etcd enters a maintenance mode that effectively looks like quorum loss. Another trap: during recovery, if you restore a snapshot to an odd number of members but start them with stale peer URLs, they may form split-brain scenarios. Always restore all members from the same snapshot simultaneously and use a fresh cluster token to prevent old members from rejoining.
Code Example
# Check etcd cluster health and member status ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # First etcd endpoint --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # CA certificate for TLS --cert=/etc/kubernetes/pki/etcd/server.crt \ # Server certificate --key=/etc/kubernetes/pki/etcd/server.key \ # Server private key endpoint health --cluster # Check health of all cluster members # List all etcd members and their status ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # Connect to surviving member --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # CA cert path --cert=/etc/kubernetes/pki/etcd/server.crt \ # Client cert for auth --key=/etc/kubernetes/pki/etcd/server.key \ # Client key for auth member list -w table # Output in table format for readability # Create a snapshot backup (run this as a CronJob in production) ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # Endpoint to snapshot from --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # TLS CA certificate --cert=/etc/kubernetes/pki/etcd/server.crt \ # TLS client certificate --key=/etc/kubernetes/pki/etcd/server.key \ # TLS client key snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db # Timestamped backup file # Restore from snapshot on each etcd member ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20260615-030000.db \ # Snapshot file to restore --name=etcd-0 \ # This member's name --initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380 \ # All members --initial-cluster-token=etcd-cluster-recovery-1 \ # New token prevents old members rejoining --initial-advertise-peer-urls=https://10.0.1.10:2380 \ # This member's peer URL --data-dir=/var/lib/etcd-restored # New data directory to avoid conflicts
◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐
│ etcd-0 │ │ etcd-1 │ │ etcd-2 │
│ HEALTHY │ │ DOWN │ │ DOWN │
└────┬─────┘ └──────────┘ └──────────┘
│
↓
┌──────────┐ ┌──────────┐
│ Quorum │───→│ API Srvr │
│ LOST │ │ Read │
│ No Write │ │ Only │
└──────────┘ └──────────┘
│
┌──────────┴──────────┐
↓ ↓
┌──────────┐ ┌──────────┐
│ Scheduler│ │Controller│
│ Blocked │ │ Blocked │
└──────────┘ └──────────┘Quick Answer
Cilium loads eBPF programs into the Linux kernel to handle packet forwarding, service load balancing, network policy, and L7 observability without iptables rules or per-pod sidecar proxies. Architects must evaluate kernel version requirements, observability maturity via Hubble, CNI migration complexity, and the loss of fine-grained L7 control that a full sidecar proxy provides.
Detailed Answer
Think of a highway toll system. Traditional kube-proxy is like a toll booth where every car stops, gets checked, and is directed to its lane. EBPF with Cilium is like an electronic pass reader embedded in the road surface — the car never stops, the toll is processed at wire speed, and the road itself knows which lane to direct traffic into without a booth. Cilium replaces the iptables-based kube-proxy and the user-space proxy model used by traditional service meshes. Instead of maintaining thousands of iptables rules that the kernel evaluates linearly, Cilium attaches eBPF programs to network hooks inside the kernel. These programs handle service IP translation, load balancing across endpoints, network policy enforcement, and even some L7 protocol parsing without packets ever leaving kernel space. This eliminates the context switches between kernel and user space that Envoy-based sidecars require for every connection. Internally, Cilium uses several eBPF map types to store service endpoints, identity labels, policy rules, and connection tracking state. When a packet arrives, the eBPF program attached to the network interface or socket looks up the destination service, selects a backend Pod using consistent hashing or round-robin, rewrites headers, and forwards the packet — all within a single kernel function call chain. Hubble, the observability layer built on top of Cilium, taps into these eBPF data paths to provide flow logs, DNS visibility, and HTTP metrics without injecting any proxy. At production scale, Cilium handles over 5,000 production deployments as of 2025, including platforms at Adobe, Bell Canada, and multiple hyperscalers. Teams should monitor eBPF program load errors, map memory usage, endpoint synchronization latency, dropped flow events in Hubble, and kernel version compatibility. Cilium requires Linux kernel 5.10 or later for full feature support, and some advanced features like bandwidth manager or BBR congestion control need even newer kernels. The non-obvious gotcha is that Cilium does not fully replicate every L7 feature of Envoy-based meshes. While it handles mTLS via SPIFFE identities, basic HTTP routing, and L7 policy, complex traffic management like retries with budgets, circuit breaking with outlier detection, or gRPC-aware load balancing may still require a sidecar or gateway proxy. Architects should map their actual L7 requirements before declaring a full service mesh unnecessary, because removing sidecars and then re-adding them later is a painful migration.
Code Example
# Install Cilium with kube-proxy replacement enabled on a fresh cluster
helm install cilium cilium/cilium --version 1.16.4 \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set k8sServiceHost=api.payments-cluster.internal \
--set k8sServicePort=6443 \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
# Verify Cilium replaced kube-proxy and is handling service translation
kubectl -n kube-system exec ds/cilium -- cilium status --verbose | grep KubeProxyReplacement
# View real-time network flows for the payments namespace using Hubble
kubectl -n kube-system exec deploy/hubble-relay -- hubble observe --namespace payments --protocol http
# Check eBPF program load status on a specific node
kubectl -n kube-system exec ds/cilium -- cilium bpf lb list
# Apply an L7 network policy that restricts HTTP methods on the checkout API
apiVersion: cilium.io/v2 # Cilium-specific CRD for extended network policy
kind: CiliumNetworkPolicy # Extends Kubernetes NetworkPolicy with L7 rules
metadata:
name: checkout-api-l7-policy # Policy name describing its scope
namespace: payments # Applies to the payments namespace
spec:
endpointSelector:
matchLabels:
app: checkout-api # Targets the checkout API pods
ingress:
- fromEndpoints:
- matchLabels:
app: web-frontend # Allows traffic only from the frontend
toPorts:
- ports:
- port: "8080" # The checkout API listening port
protocol: TCP # HTTP runs over TCP
rules:
http:
- method: POST # Allows POST for creating orders
path: /api/v2/orders # Restricts to the orders endpoint
- method: GET # Allows GET for reading order status
path: /api/v2/orders/.* # Permits path parameters for order lookups◈ Architecture Diagram
┌──────────┐ ┌──────────┐
│ Pod A │ │ Pod B │
└────┬─────┘ └────┬─────┘
│ │
↓ ↓
┌─────────────────────────────┐
│ eBPF (kernel) │
│ ┌────────┐ ┌────────────┐ │
│ │Svc LB │ │L7 Policy │ │
│ └────────┘ └────────────┘ │
│ ┌────────┐ ┌────────────┐ │
│ │ConnTrk │ │Hubble Tap │ │
│ └────────┘ └────────────┘ │
└─────────────────────────────┘Quick Answer
Architects tune etcd by sizing disks for low-latency IOPS, adjusting compaction and defragmentation schedules, monitoring database size and peer latency, and separating the events store. Sharding the main etcd or using virtual clusters becomes necessary when a single etcd instance approaches 8 GB or 30,000-40,000 objects and API server latency degrades.
Detailed Answer
Think of a library card catalog. When the library has a few thousand books, one cabinet handles lookups fine. But when the library grows to millions of books and hundreds of librarians are searching simultaneously, you either need a faster cabinet, multiple cabinets organized by subject, or a way to archive old cards. Etcd is that card catalog for Kubernetes — every resource definition, status update, and event is a card in the catalog. Etcd is the sole persistent store for Kubernetes cluster state. Every API server read and write flows through etcd, making its performance the ceiling for cluster responsiveness. For large clusters — those with tens of thousands of Pods, thousands of Services, or high churn from controllers and operators — etcd becomes the bottleneck before CPU, memory, or network do. The key metrics are fsync latency (which depends on disk IOPS), database size, number of keys, leader election frequency, and peer round-trip time between etcd members. Internally, etcd uses a B-tree index with multi-version concurrency control, or MVCC, keeping every revision of every key until compacted. Compaction removes old revisions, and defragmentation reclaims disk space after compaction. Without regular compaction, the database grows unboundedly. Kubernetes runs automatic compaction every five minutes by default, but operators must also schedule defragmentation because compaction alone does not free physical disk space. On cloud providers, using provisioned IOPS SSD volumes (like gp3 with 6000+ IOPS on AWS) is critical because etcd performance degrades sharply when fsync latency exceeds 10 milliseconds. At production scale, the first architectural decision is separating the events store. Kubernetes Events are high-volume, short-lived objects that create write pressure without carrying critical state. Running a dedicated etcd instance for Events reduces load on the main etcd cluster significantly. AWS EKS offers provisioned control plane tiers (XL, 2XL, 4XL) that scale etcd database limits up to 16 GB for clusters running AI and ML workloads with many custom resources. When even separated events and tuned compaction are insufficient, true etcd sharding — distributing different API groups to separate etcd clusters — or virtual clusters that maintain independent etcd instances per tenant become the next scaling lever. The non-obvious gotcha is that etcd performance problems often manifest as API server timeouts or slow kubectl responses, and teams blame the API server rather than looking at etcd disk latency. A single slow etcd member in a three-node cluster can drag down the entire quorum because the leader waits for a majority of followers to acknowledge writes. Architects should alert on p99 fsync duration, database size approaching 8 GB, and any leader changes, because a leader election storm during high write load can cascade into control-plane unavailability.
Code Example
# Check etcd database size and key count on the leader member ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://etcd-0.etcd.kube-system:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --write-out=table # Monitor fsync latency histogram from Prometheus metrics curl -s https://etcd-0.etcd.kube-system:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds # Trigger a manual defragmentation on a specific member during a maintenance window ETCDCTL_API=3 etcdctl defrag --endpoints=https://etcd-1.etcd.kube-system:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Configure API server to use a separate etcd instance for Event objects # In kube-apiserver manifest or startup flags: # --etcd-servers=https://etcd-main:2379 # Main etcd for all resources # --etcd-servers-overrides=/events#https://etcd-events:2379 # Separate etcd for Events # Check Kubernetes object counts by resource type to identify growth kubectl get --raw='/metrics' | grep apiserver_storage_objects | sort -t' ' -k2 -rn | head -20
◈ Architecture Diagram
┌──────────┐ │API Server│ └──┬───┬───┘ │ │ ↓ ↓ ┌─────┐ ┌─────────┐ │Main │ │Events │ │etcd │ │etcd │ │< 8GB│ │separate │ └─────┘ └─────────┘ │ ┌──┴──────────┐ │Compact+Defrag│ └──────────────┘
Quick Answer
Istio ambient mesh replaces per-pod Envoy sidecars with two shared components: ztunnel, a per-node L4 proxy handling mTLS and basic routing, and optional waypoint proxies for L7 policy. Architects must evaluate the migration path for existing sidecar workloads, L7 feature parity, multi-cluster ambient support maturity, and the operational tradeoff of shared node-level proxies versus isolated per-pod proxies.
Detailed Answer
Think of an apartment building with two security options. The sidecar model gives every apartment its own security guard who checks visitors at the apartment door — effective but expensive. The ambient model puts a guard at the building entrance who checks IDs for everyone, and only apartments that need advanced screening get a shared floor-level inspector. You get security everywhere with far fewer guards. Istio ambient mesh reached general availability with Istio 1.22 in late 2024 and has become production-stable through 2025 and into 2026. It fundamentally changes how the data plane is deployed. Traditional Istio injects an Envoy sidecar into every Pod, which adds memory overhead (typically 50-100 MB per Pod), increases startup latency, and creates operational complexity around sidecar injection, upgrade ordering, and resource accounting. Ambient mesh removes all of this by separating L4 and L7 concerns into shared infrastructure. The architecture has two layers. Ztunnel is a lightweight Rust-based proxy that runs as a DaemonSet on every node. It handles all L4 concerns: mTLS encryption and identity using SPIFFE certificates, TCP-level authorization policy, and basic connection routing. Ztunnel performance has improved 75 percent over recent releases and adds negligible latency. For workloads that need L7 features — HTTP routing, retries, header-based authorization, traffic splitting — architects deploy waypoint proxies, which are shared Envoy instances scoped to a namespace or service account rather than injected per Pod. In production migration, teams should start by enabling ambient mode on a namespace using the label istio.io/dataplane-mode=ambient. Existing sidecar workloads can coexist with ambient workloads during migration. The key evaluation points are: L7 feature gaps between sidecar and waypoint proxy configurations, whether multi-cluster ambient mesh is mature enough (alpha planned for Istio 1.27), how existing Istio AuthorizationPolicy and VirtualService resources translate, and whether shared ztunnel on a node creates a blast radius concern where a ztunnel crash affects all Pods on that node. The non-obvious gotcha is that ambient mesh changes the failure domain. In sidecar mode, a proxy crash affects one Pod. In ambient mode, a ztunnel crash can disrupt networking for every Pod on that node. This makes ztunnel reliability, resource limits, and upgrade strategy (rolling DaemonSet updates) critical. Architects should also verify that their observability stack captures ztunnel metrics and waypoint proxy metrics in the same dashboards, because the telemetry surface shifts from per-pod to per-node and per-namespace.
Code Example
# Enable ambient mesh mode on the payments namespace
kubectl label namespace payments istio.io/dataplane-mode=ambient
# Verify ztunnel is running on every node in the mesh
kubectl get pods -n istio-system -l app=ztunnel -o wide
# Deploy a waypoint proxy for L7 policy in the payments namespace
istioctl waypoint apply --namespace payments --name payments-waypoint
# Verify the waypoint proxy is ready and accepting traffic
kubectl get gateway payments-waypoint -n payments
# Apply an L7 AuthorizationPolicy that requires the waypoint proxy
apiVersion: security.istio.io/v1 # Istio security API for authorization
kind: AuthorizationPolicy # Controls which requests are allowed
metadata:
name: checkout-api-auth # Policy name describing its scope
namespace: payments # Namespace where the waypoint proxy runs
spec:
targetRefs:
- kind: Service # Targets a specific Kubernetes Service
group: "" # Core API group
name: checkout-api # The service to protect
action: ALLOW # Permits matching requests
rules:
- from:
- source:
principals: ["cluster.local/ns/payments/sa/web-frontend"] # SPIFFE identity of the caller
to:
- operation:
methods: ["POST"] # Allows only POST requests
paths: ["/api/v2/orders"] # Restricts to the orders endpoint
# Check ztunnel connection metrics on a specific node
kubectl -n istio-system exec ds/ztunnel -- curl -s localhost:15020/metrics | grep ztunnel_tcp_connections◈ Architecture Diagram
┌───── Node ─────────────────┐
│ ┌────────┐ ┌────────┐ │
│ │ Pod A │ │ Pod B │ │
│ │(no sidecar)(no sidecar) │
│ └───┬────┘ └───┬────┘ │
│ └─────┬─────┘ │
│ ┌─────┴─────┐ │
│ │ ztunnel │ (L4) │
│ │ mTLS+auth │ │
│ └─────┬─────┘ │
└───────────┼───────────────┘
↓
┌──────────┐
│ Waypoint │ (L7)
│ Proxy │
└──────────┘Quick Answer
Topology spread constraints tell the scheduler to distribute Pods across failure domains defined by node labels such as zone or hostname, using maxSkew to control imbalance. When combined with cluster autoscaling, problems arise if a zone has zero nodes — the autoscaler may not know about the zone, causing the scheduler to leave Pods pending indefinitely.
Detailed Answer
Think of seating guests at a wedding reception. You want to spread friends evenly across tables so no table is overcrowded and no group is isolated. The wedding planner checks how many people are at each table and seats the next guest at the most empty one, but if a table does not exist yet (no physical table has been set up), the planner cannot seat anyone there even if the venue has room. Topology spread constraints in Kubernetes work the same way. Kubernetes topology spread constraints are declared in the Pod spec under topologySpreadConstraints. Each constraint specifies a topologyKey (a node label like topology.kubernetes.io/zone or kubernetes.io/hostname), a maxSkew (the maximum allowed difference in Pod count between the most-populated and least-populated domain), a whenUnsatisfiable behavior (DoNotSchedule or ScheduleAnyway), and a labelSelector to identify which Pods count toward the spread calculation. Internally, the scheduler evaluates topology spread during the Filter and Score phases. In the Filter phase, it eliminates nodes where placing the Pod would violate the maxSkew when whenUnsatisfiable is DoNotSchedule. In the Score phase, it ranks remaining nodes by how well they balance the distribution. The scheduler considers the topologyKey label on existing nodes to define domains — a domain only exists if at least one node carries that label value. It then counts matching Pods per domain and calculates whether the new Pod can land in each domain without exceeding maxSkew. At production scale, the interaction with cluster autoscaling creates subtle failures. If a node pool in one availability zone scales to zero, that zone disappears from the scheduler's topology map. The scheduler only sees zones with active nodes, so it may consider a two-zone spread sufficient even when three zones are available. When maxSkew is 1 and whenUnsatisfiable is DoNotSchedule, the scheduler can leave Pods pending because it cannot place them in a zone that has no nodes, and the autoscaler may not create a node in the missing zone because it does not see pending Pods that specifically require it. This chicken-and-egg problem is one of the most common production issues with topology spread constraints. The non-obvious gotcha is that topology spread constraints count all matching Pods, including ones that are terminating, not-ready, or failing. During a rolling update, old Pods being terminated still count toward the spread calculation, which can cause new Pods to be unschedulable until the old ones are fully removed. Architects should set minDomains to explicitly declare how many zones the spread should consider, use node affinity in combination with spread constraints to ensure the autoscaler knows about expected zones, and monitor for unschedulable Pods with topology spread violation events.
Code Example
# Apply a Deployment with zone and node spread constraints
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages replicated Pods
metadata:
name: checkout-api # Production checkout service
namespace: payments # Team namespace
spec:
replicas: 6 # Six replicas to spread across three zones with two per zone
selector:
matchLabels:
app: checkout-api # Pod selector
template:
metadata:
labels:
app: checkout-api # Label used by spread constraint selector
spec:
topologySpreadConstraints:
- maxSkew: 1 # Allows at most one Pod difference between zones
topologyKey: topology.kubernetes.io/zone # Spreads across availability zones
whenUnsatisfiable: DoNotSchedule # Strictly enforces zone balance
labelSelector:
matchLabels:
app: checkout-api # Counts only checkout-api Pods
minDomains: 3 # Expects three zones even if some have zero nodes
- maxSkew: 1 # Allows at most one Pod difference between nodes within a zone
topologyKey: kubernetes.io/hostname # Spreads across individual nodes
whenUnsatisfiable: ScheduleAnyway # Prefers balance but allows imbalance
labelSelector:
matchLabels:
app: checkout-api # Counts only checkout-api Pods
containers:
- name: api # Application container
image: registry.company.com/checkout-api:3.7.2 # Versioned production image
resources:
requests:
cpu: 250m # Minimum CPU for scheduling
memory: 512Mi # Minimum memory for scheduling
# Check Pod distribution across zones
kubectl get pods -n payments -l app=checkout-api -o custom-columns='POD:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'
# Identify Pods pending due to topology spread violations
kubectl get events -n payments --field-selector reason=FailedScheduling | grep topology◈ Architecture Diagram
┌─── Zone A ──┐ ┌─── Zone B ──┐ ┌─── Zone C ──┐ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ │Pod1││Pod2││ │ │Pod3││Pod4││ │ │Pod5││Pod6││ │ └────┘└────┘│ │ └────┘└────┘│ │ └────┘└────┘│ │ maxSkew=1 │ │ maxSkew=1 │ │ maxSkew=1 │ └─────────────┘ └─────────────┘ └─────────────┘
Quick Answer
Scheduler plugins hook into the scheduling framework's extension points (PreFilter, Filter, PreScore, Score, Reserve, Permit, PreBind, Bind) to add custom logic like gang scheduling, co-scheduling, or capacity reservation. Scheduling profiles allow running multiple schedulers with different plugin configurations. Risks include increased scheduling latency, unintended Pod starvation, and complex debugging when plugins interact.
Detailed Answer
Think of a wedding seating planner with very specific rules. The basic planner checks table capacity and guest preferences. But this wedding also requires that certain groups of guests must all be seated simultaneously (gang scheduling), some tables are reserved for VIPs until the last minute (capacity reservation), and guests from rival families must never share an aisle (anti-affinity). Standard rules cannot express all of this, so the planner adds specialized checkers at different stages of the seating process. Kubernetes scheduler plugins work exactly this way. The Kubernetes scheduling framework replaced the old policy-based scheduler configuration with a plugin architecture. The scheduler processes each Pod through a pipeline of extension points: PreFilter (validate and preprocess), Filter (eliminate ineligible nodes), PostFilter (handle unschedulable Pods), PreScore (prepare scoring data), Score (rank eligible nodes), Reserve (tentatively claim resources), Permit (wait or approve), PreBind (prepare external resources), and Bind (commit the Pod to a node). Each extension point can have multiple plugins that run in order. Scheduling profiles allow a single kube-scheduler binary to expose multiple scheduler personalities. Each profile has a name and its own set of enabled, disabled, and configured plugins. A Pod selects its scheduler by setting spec.schedulerName. This means architects can run a default profile for general workloads and a specialized profile for GPU workloads, batch jobs, or latency-sensitive services without deploying separate scheduler binaries. The scheduler-plugins project from Kubernetes SIGs provides production-grade plugins like Coscheduling (gang scheduling for batch workloads that need all Pods scheduled together), Capacity Scheduling (enforcing elastic quotas across namespaces), and Trimaran (scoring based on real-time node use from metrics server). At production scale, custom scheduler plugins require careful testing because they affect every Pod placement decision. A slow PreFilter or Score plugin increases scheduling latency for all Pods using that profile. A buggy Filter plugin can make nodes ineligible when they should be available, causing Pods to remain pending. Plugin ordering matters because earlier plugins in the chain can mask or override later ones. Architects should measure scheduler latency percentiles (scheduling_duration_seconds), unschedulable Pod counts, and plugin-specific metrics before and after enabling custom plugins. The non-obvious gotcha is debugging scheduling failures with custom plugins. When a Pod is unschedulable, the scheduler event says which extension point rejected it, but the interaction between multiple plugins can create emergent behavior that is hard to trace. For example, a topology spread constraint combined with a capacity reservation plugin can create scenarios where Pods are pending not because of resource shortage but because the combination of constraints has no feasible solution. Architects should use the scheduler's verbose logging, the scheduling-queue metrics, and dry-run scheduling tools to validate plugin interactions before production deployment.
Code Example
# KubeSchedulerConfiguration with two profiles: default and batch-coscheduling
apiVersion: kubescheduler.config.k8s.io/v1 # Scheduler configuration API
kind: KubeSchedulerConfiguration # Configures the kube-scheduler binary
profiles:
- schedulerName: default-scheduler # Default profile for general workloads
plugins:
score:
enabled:
- name: NodeResourcesFit # Scores nodes by resource availability
weight: 1 # Standard weight
- name: InterPodAffinity # Scores based on Pod affinity preferences
weight: 1 # Standard weight
- schedulerName: batch-scheduler # Specialized profile for ML training jobs
plugins:
queueSort:
enabled:
- name: Coscheduling # Sorts Pods so gang members are scheduled together
preFilter:
enabled:
- name: Coscheduling # Validates that all gang members exist
postFilter:
enabled:
- name: Coscheduling # Preempts to make room for complete gangs
permit:
enabled:
- name: Coscheduling # Holds Pods until all gang members are schedulable
reserve:
enabled:
- name: Coscheduling # Reserves resources for the complete gang
# A batch training job that uses gang scheduling via the batch-scheduler profile
apiVersion: batch/v1 # Standard Job API
kind: Job # Batch workload requiring all workers to start together
metadata:
name: fraud-model-training # Distributed training job
namespace: ml-platform # ML team namespace
labels:
pod-group.scheduling.sigs.k8s.io/name: fraud-training-gang # Gang scheduling group name
pod-group.scheduling.sigs.k8s.io/min-available: "4" # All four workers must be scheduled
spec:
parallelism: 4 # Four parallel training workers
completions: 4 # Job completes when all four finish
template:
metadata:
labels:
pod-group.scheduling.sigs.k8s.io/name: fraud-training-gang # Same gang group label
spec:
schedulerName: batch-scheduler # Uses the coscheduling profile
containers:
- name: trainer # Distributed training worker container
image: registry.company.com/fraud-trainer:4.3.1 # ML training image
resources:
requests:
cpu: 4 # Four CPU cores per worker
memory: 16Gi # 16GB memory per worker
# Monitor scheduler performance metrics for the batch profile
kubectl get --raw='/metrics' | grep scheduling_duration_seconds | grep batch-scheduler◈ Architecture Diagram
┌──────────────────────────────────┐ │ Scheduling Pipeline │ │ │ │ PreFilter → Filter → PostFilter │ │ ↓ │ │ PreScore → Score → Reserve │ │ ↓ │ │ Permit → PreBind → Bind │ │ │ │ ┌──────────┐ ┌────────────────┐ │ │ │ Default │ │ Batch Profile │ │ │ │ Profile │ │ +Coscheduling │ │ │ └──────────┘ └────────────────┘ │ └──────────────────────────────────┘
Quick Answer
Gateway API separates infrastructure ownership from route ownership through resources such as GatewayClass, Gateway, and HTTPRoute. Platform teams can own listener and infrastructure policy, while application teams attach routes within allowed namespaces and hostnames.
Detailed Answer
Think of a shopping mall. The mall operator controls entrances, fire doors, and security rules, while each store controls its own sign and product layout. Traditional Ingress often turns the mall operator into the person editing every store sign. Gateway API gives the building owner and store owner separate but connected responsibilities. Gateway API was designed to make Kubernetes traffic management more expressive and role-oriented than the original Ingress API. Ingress is useful but often overloaded with controller-specific annotations, which blur ownership and make portability harder. Gateway API introduces clearer resources for infrastructure providers, cluster operators, and application teams so each group can manage the layer it actually owns. The flow starts with GatewayClass, which identifies the implementation or controller. A Gateway represents a deployed data-plane entry point with listeners such as HTTPS on port 443. HTTPRoute, GRPCRoute, or other route resources define application-level routing rules and attach to a Gateway when allowed by namespace and listener policy. The controller reconciles those resources into load balancer, proxy, or service mesh configuration. In production, Gateway API helps with multi-team environments. Platform engineers can standardize TLS, listener ports, allowed namespaces, and shared infrastructure. App teams can publish or update service routes without editing a central Ingress file. Operators should monitor route attachment status, accepted conditions, listener conflicts, certificate readiness, and controller reconciliation errors. This shifts troubleshooting from annotation archaeology to explicit status fields. The gotcha is that Gateway API does not magically remove governance. If allowedRoutes is too permissive, one team can still attach unexpected hostnames or paths. If it is too strict, teams see routes that never attach and traffic silently stays on the old path. Architects need namespace policy, hostname ownership, certificate automation, and clear dashboards showing which routes are accepted by which Gateway.
Code Example
# Install Gateway API CRDs for clusters that do not include them yet
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
# Apply the shared production Gateway owned by the platform team
kubectl apply -f platform-gateway.yaml
# Apply an application route owned by the payments team
kubectl apply -f payments-route.yaml
# Verify whether the route was accepted by the Gateway controller
kubectl get httproute payments-api -n payments -o jsonpath='{.status.parents[*].conditions[*].type}'
# platform-gateway.yaml
apiVersion: gateway.networking.k8s.io/v1 # Uses the stable Gateway API
kind: Gateway # Declares a shared traffic entry point
metadata:
name: public-web # Platform-owned public web Gateway
namespace: platform-ingress # Keeps infrastructure config in a platform namespace
spec:
gatewayClassName: prod-nginx # Selects the installed Gateway controller implementation
listeners:
- name: https # Listener name used by routes and status
protocol: HTTPS # Accepts encrypted HTTP traffic
port: 443 # Standard public HTTPS port
hostname: '*.interviewcatalog.example' # Restricts accepted hostnames to the platform domain
allowedRoutes:
namespaces:
from: Selector # Allows only selected namespaces to attach routes
selector:
matchLabels:
expose-public: 'true' # Namespace opt-in controlled by platform policy
---
apiVersion: gateway.networking.k8s.io/v1 # Uses the stable route API
kind: HTTPRoute # Defines app-owned HTTP routing rules
metadata:
name: payments-api # Route for the payments service
namespace: payments # Owned by the payments application team
spec:
parentRefs:
- name: public-web # Attaches to the platform Gateway
namespace: platform-ingress # References the Gateway namespace explicitly
hostnames:
- payments.interviewcatalog.example # Hostname this team is allowed to serve
rules:
- backendRefs:
- name: payments-api # Kubernetes Service receiving traffic
port: 8080 # Service port for the API backend◈ Architecture Diagram
┌──────────┐
│GatewayCls│
└────┬─────┘
↓
┌──────────┐
│ Gateway │
└────┬─────┘
↓
┌──────────┐
│HTTPRoute │
└────┬─────┘
↓
┌──────────┐
│ Service │
└──────────┘Quick Answer
The control plane consists of the API Server (REST frontend that validates all requests), etcd (distributed key-value store holding all cluster state), Scheduler (assigns Pods to nodes based on constraints), and Controller Manager (runs reconciliation loops that drive actual state toward desired state).
Detailed Answer
Think of the control plane like the management floor of a large warehouse. The API Server is the front desk receptionist who handles every single request coming in — whether it's a customer placing an order, a supervisor checking inventory, or a new employee asking for directions. Every interaction with the warehouse goes through this one desk, no exceptions. etcd is the filing cabinet behind the desk that holds the single source of truth: every order, every employee record, every inventory count. The Scheduler is the floor manager who decides which aisle worker handles which incoming package based on who has capacity. The Controller Manager is the quality inspector who constantly walks the floor comparing what should be happening (orders to fulfill) with what is actually happening (packages on shelves), and files corrective actions when they don't match. The API Server (kube-apiserver) is the only component that talks directly to etcd. Every kubectl command, every internal component communication, and every webhook goes through the API Server as HTTPS REST calls. It performs authentication (who are you?), authorization via RBAC (are you allowed to do this?), admission control (should this request be modified or rejected?), and validation (is this YAML well-formed?) before persisting anything to etcd. It also serves the watch API, which lets other components subscribe to changes in real-time rather than polling — this is how the entire system stays reactive. etcd is a distributed, strongly-consistent key-value store built on the Raft consensus algorithm. It stores every object in the cluster: every Pod spec, every Service definition, every Secret, every ConfigMap. In production, etcd runs as a 3 or 5 node cluster (always odd numbers for quorum) and is often the first component to cause cluster-wide outages when it becomes unhealthy. etcd performance directly determines API Server response time — slow disk I/O on etcd nodes is the number one silent killer of Kubernetes clusters. Production teams typically dedicate SSD-backed nodes exclusively for etcd and monitor fsync latency religiously. The Scheduler (kube-scheduler) watches the API Server for newly created Pods that have no node assigned (spec.nodeName is empty). For each unscheduled Pod, it runs a two-phase algorithm: filtering (eliminate nodes that don't meet hard requirements like resource requests, nodeSelector, taints/tolerations, and affinity rules) and scoring (rank remaining nodes by soft preferences like spreading Pods across failure domains, preferring nodes with the image already cached, or balancing resource utilization). The highest-scoring node wins, and the Scheduler writes the node assignment back to the API Server. The Controller Manager (kube-controller-manager) is actually dozens of separate control loops compiled into a single binary for simplicity. Each controller watches a specific resource type and reconciles actual state with desired state. The ReplicaSet controller ensures the right number of Pods exist. The Deployment controller manages ReplicaSets during rollouts. The Node controller detects when nodes go offline. The Endpoint controller populates Service endpoints. The Job controller manages batch workloads. If the Controller Manager crashes, no reconciliation happens — Pods keep running but nothing self-heals until it's back. A critical production gotcha: many teams monitor Pod health but forget to monitor control plane health. If the API Server is overloaded (common with too many custom controllers or misconfigured HPA polling intervals), the entire cluster becomes unresponsive — you can't deploy, can't scale, can't even see what's broken. Production clusters should have dedicated monitoring for API Server request latency (apiserver_request_duration_seconds), etcd fsync duration (etcd_disk_wal_fsync_duration_seconds), and scheduler queue depth (scheduler_pending_pods).
Code Example
# Check control plane component health kubectl get componentstatuses kubectl get --raw='/healthz?verbose' # View control plane Pods (self-hosted clusters like kubeadm) kubectl get pods -n kube-system -l tier=control-plane # NAME READY STATUS # etcd-master-01 1/1 Running # kube-apiserver-master-01 1/1 Running # kube-controller-manager-master-01 1/1 Running # kube-scheduler-master-01 1/1 Running # Check API Server response time (latency issues?) kubectl get --raw='/metrics' | grep apiserver_request_duration # Verify etcd cluster health kubectl exec -n kube-system etcd-master-01 -- etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health # Check scheduler is making decisions kubectl get events --field-selector reason=Scheduled -A # View controller-manager logs for reconciliation errors kubectl logs -n kube-system kube-controller-manager-master-01 \ --tail=50 | grep -i error # Monitor API Server audit logs for troubleshooting # (configured via --audit-policy-file on API Server) kubectl logs -n kube-system kube-apiserver-master-01 \ | grep payments-api
◈ Architecture Diagram
┌─────────────── Control Plane ──────────────────────────┐
│ │
│ ┌────────────────────────────────────────┐ │
│ │ kube-apiserver │ │
│ │ REST frontend + auth + admission │ │
│ └──────────┬────────────┬────────────────┘ │
│ │ │ │
│ watch │ │ read/write │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ etcd │ │
│ │ │ key-value │ │
│ │ │ cluster state│ │
│ │ └──────────────┘ │
│ │ │
│ ┌────────┴──────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────────────┐ │
│ │kube-scheduler│ │ kube-controller-manager │ │
│ │ │ │ │ │
│ │ filter → │ │ ReplicaSet controller │ │
│ │ score → │ │ Deployment controller │ │
│ │ bind Pod to │ │ Node controller │ │
│ │ best node │ │ Endpoint controller │ │
│ └──────────────┘ └──────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────┘
│
│ API Server communicates with
▼
┌─────── Worker Nodes ──────┐
│ kubelet │ kube-proxy │
└───────────────────────────┘Quick Answer
Each worker node runs kubelet (agent that starts and monitors Pods), kube-proxy (manages networking rules for Service routing), and a container runtime (containerd or CRI-O that actually runs containers). The kubelet communicates with the API Server via HTTPS to receive Pod assignments and report node status.
Detailed Answer
Think of a worker node like a restaurant kitchen station. The kubelet is the line cook who receives tickets (Pod specs) from the head chef (API Server) and actually prepares the dishes (starts containers). The container runtime (containerd or CRI-O) is the set of pots, pans, and ovens that the cook uses — the actual tools that do the cooking. And kube-proxy is the waiter who knows which table ordered which dish, making sure the right plate gets to the right customer via the correct route. Without any one of these three, the kitchen cannot function. The kubelet is the primary node agent. It registers the node with the API Server, then watches (via the API Server's watch mechanism) for PodSpecs assigned to its node. When a new Pod is scheduled to its node, the kubelet calls the container runtime through the Container Runtime Interface (CRI) to pull images and start containers. It then continuously monitors container health by executing liveness and readiness probes at configured intervals. If a liveness probe fails, the kubelet restarts the container. It reports Pod status and node conditions (memory pressure, disk pressure, PID pressure, network unavailable) back to the API Server every 10 seconds by default (the node-status-update-frequency). If the API Server doesn't receive heartbeats for 40 seconds (the node-monitor-grace-period), the node is marked NotReady. The container runtime is the software that actually creates and runs containers using Linux kernel features like namespaces (isolation) and cgroups (resource limits). Kubernetes removed direct Docker support in v1.24 — it now requires a CRI-compatible runtime. The two production choices are containerd (lightweight, used by EKS, GKE, and most managed platforms) and CRI-O (purpose-built for Kubernetes, used by OpenShift). The kubelet communicates with the runtime via a Unix socket using the gRPC-based CRI protocol. The runtime handles image pulling, container lifecycle, and log management. kube-proxy runs on every node and implements the Service abstraction. When you create a Service, kube-proxy watches the API Server for Service and Endpoint objects, then programs the node's networking stack to route traffic correctly. In the default iptables mode, it creates iptables rules that perform DNAT (destination NAT) to translate the virtual Service IP to a real Pod IP, with random selection for load balancing. In IPVS mode (better for clusters with thousands of Services), it uses the Linux kernel's IPVS load balancer which supports multiple algorithms (round-robin, least-connections, source-hash). kube-proxy does NOT proxy traffic through itself — it only configures networking rules; actual packets flow directly from source to destination Pod. The communication between worker nodes and the control plane is strictly one-directional in terms of initiation: the kubelet always initiates connections to the API Server, never the reverse. This is a security design — worker nodes can be in untrusted networks and the API Server never needs to push data to them. The kubelet establishes a persistent watch connection to the API Server, which means it receives updates the instant they happen (Pod scheduled, Pod deleted, ConfigMap changed) without polling. For features like `kubectl exec` and `kubectl logs`, the API Server does establish a reverse connection to the kubelet's HTTPS endpoint (port 10250), which is why kubelet has its own TLS certificate. A common production gotcha: if the container runtime's socket becomes unresponsive (containerd hung, disk full preventing image pulls), the kubelet cannot start new Pods or report accurate status. The node might show Ready because the kubelet process itself is fine, but Pods scheduled there will be stuck in ContainerCreating forever. Monitoring containerd/CRI-O process health separately from kubelet health is essential for catching this early.
Code Example
# Check node status and conditions kubectl get nodes -o wide kubectl describe node worker-node-01 # Look for: Conditions section (MemoryPressure, DiskPressure, PIDPressure) # View kubelet logs on the node (SSH required) sudo journalctl -u kubelet --since "10 minutes ago" | tail -50 # Check kubelet's view of Pods on this node kubectl get pods --field-selector spec.nodeName=worker-node-01 -A # Verify container runtime is healthy sudo crictl info # Runtime status sudo crictl ps # Running containers sudo crictl pods # Running Pod sandboxes # Check kube-proxy is programming iptables rules sudo iptables -t nat -L KUBE-SERVICES | head -20 # View kube-proxy mode and configuration kubectl get configmap kube-proxy -n kube-system -o yaml # Check if kubelet can reach the API Server kubectl get --raw='/api/v1/nodes/worker-node-01/proxy/healthz' # Debug networking by checking kube-proxy logs kubectl logs -n kube-system -l k8s-app=kube-proxy \ --tail=30 # Check node resource capacity vs allocatable kubectl describe node worker-node-01 | grep -A5 'Capacity\|Allocatable' # Capacity: # cpu: 8 # memory: 32Gi # Allocatable: ← what's available for Pods (after system reserved) # cpu: 7600m # memory: 30Gi
◈ Architecture Diagram
┌────────────── Worker Node ──────────────────────────────┐
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ kubelet │ │
│ │ • Registers node with API Server │ │
│ │ • Watches for Pod assignments │ │
│ │ • Executes liveness/readiness probes │ │
│ │ • Reports node status every 10s │ │
│ └────────┬──────────────────────┬─────────────────┘ │
│ │ CRI (gRPC) │ HTTPS │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ Container Runtime│ │ │
│ │ (containerd) │ │ │
│ │ │ │ │
│ │ ┌────┐ ┌────┐ │ │ │
│ │ │Pod │ │Pod │ │ │ │
│ │ │ A │ │ B │ │ │ │
│ │ └────┘ └────┘ │ │ │
│ └──────────────────┘ │ │
│ │ │
│ ┌─────────────────┐ │ │
│ │ kube-proxy │ │ │
│ │ │ │ │
│ │ iptables/IPVS │ │ │
│ │ rules for │ │ │
│ │ Service routing │ │ │
│ └────────┬────────┘ │ │
│ │ │ │
└───────────┼──────────────────────┼──────────────────────┘
│ │
│ watch Services │ watch PodSpecs
│ & Endpoints │ report status
▼ ▼
┌──────────────────────────────────┐
│ kube-apiserver │
│ (Control Plane) │
└──────────────────────────────────┘Quick Answer
The control plane is the brain of the cluster: the API Server is the single point of communication, etcd stores all cluster state, the Scheduler assigns Pods to nodes, and the Controller Manager runs reconciliation loops that maintain desired state. On worker nodes, the kubelet manages Pods and kube-proxy handles networking.
Detailed Answer
Think of a Kubernetes cluster like an airport. The control plane is the airport operations center — the people and systems that coordinate everything. The API server is the main radio tower: every communication between pilots (kubectl), ground crew (kubelets), and air traffic control (controllers) goes through it. Nobody talks directly to anyone else. Etcd is the flight database — every flight plan, gate assignment, and schedule is recorded there, and if this database goes down, the airport can't function. The Scheduler is the gate assignment officer who decides which arriving plane goes to which gate based on gate size, availability, and terminal capacity. The Controller Manager is the operations team that constantly walks the airport comparing the schedule to reality: 'Gate 3 should have a plane — it doesn't — redirect one there.' The API Server (kube-apiserver) is the only component that talks to etcd. When you run `kubectl get pods`, kubectl sends an HTTPS request to the API server, which authenticates you, checks your RBAC permissions, retrieves the data from etcd, and returns it. When you create a Deployment, the API server validates the manifest, stores it in etcd, and notifies watching controllers via its built-in watch mechanism. Every component in the cluster — scheduler, controllers, kubelets — communicates exclusively through the API server. Etcd is a distributed key-value store that holds the entire state of the cluster: every Pod, Service, Secret, ConfigMap, and node registration. It uses the Raft consensus algorithm to maintain consistency across multiple replicas (production clusters run 3 or 5 etcd members). Etcd is the most critical component — if etcd data is lost and there's no backup, the cluster is unrecoverable. This is why etcd backup is a non-negotiable operational requirement. The Scheduler (kube-scheduler) watches for newly created Pods that have no node assigned. For each unscheduled Pod, it runs a two-phase process: filtering (which nodes CAN run this Pod — enough CPU? right architecture? matching tolerations?) and scoring (which node is BEST — least loaded? closest to existing Pods? has the image cached?). The winning node is written to the Pod's spec.nodeName field, and the kubelet on that node picks it up. The Controller Manager (kube-controller-manager) runs dozens of control loops, each responsible for one type of resource. The Deployment controller watches Deployments and manages ReplicaSets. The ReplicaSet controller watches ReplicaSets and manages Pods. The Node controller monitors node heartbeats and marks nodes as unhealthy. The Endpoints controller updates Service endpoints when Pods change. Each controller follows the same pattern: observe current state → compare to desired state → take action to converge. This reconciliation loop is the fundamental operating principle of Kubernetes. On each worker node, the kubelet is the agent that actually runs Pods. It watches the API server for Pods assigned to its node, pulls container images, starts containers via the container runtime (containerd), and reports status back. Kube-proxy runs on every node and maintains network rules (iptables or IPVS) that implement Service routing. A common misconception is that kube-proxy proxies traffic — in iptables mode, it doesn't. It just programs the kernel's packet filtering rules and gets out of the way.
Code Example
# Check control plane component health kubectl get componentstatuses # View all control plane Pods (they run as static Pods on master nodes) kubectl get pods -n kube-system # Check API server endpoint kubectl cluster-info # Kubernetes control plane is running at https://10.0.0.1:6443 # View etcd members (if you have access to the master node) ETCDCTL_API=3 etcdctl member list \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Take an etcd backup (CRITICAL for disaster recovery) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Check node status and kubelet version kubectl get nodes -o wide # View kubelet logs on a node (SSH required) journalctl -u kubelet -f # Check kube-proxy mode (iptables vs IPVS) kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
◈ Architecture Diagram
┌──────────── Control Plane ──────────────┐
│ │
│ ┌──────────┐ watches ┌────────┐ │
│ │Scheduler │◄─────────────►│ API │ │
│ │ │ │ Server │ │
│ └──────────┘ ┌─────────►│ │ │
│ │ │ only │ │
│ ┌──────────┐ │ │component│ │
│ │Controller│────┘ │that │ │
│ │ Manager │ watches │talks to │ │
│ └──────────┘ │ etcd │ │
│ └───┬────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ etcd │ │
│ │ (cluster │ │
│ │ state) │ │
│ └───────────┘ │
└─────────────────────────────────────────┘
│
API server
watches/updates
│
┌────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐┌──────────┐ ┌──────────┐
│ Node 1 ││ Node 2 │ │ Node 3 │
│ ││ │ │ │
│ kubelet ││ kubelet │ │ kubelet │
│ kube- ││ kube- │ │ kube- │
│ proxy ││ proxy │ │ proxy │
│ ││ │ │ │
│ Pod Pod ││ Pod Pod │ │ Pod Pod │
└──────────┘└──────────┘ └──────────┘Quick Answer
Check the Ingress resource rules (host, path, backend service/port), verify the controller is running and has processed the Ingress, confirm the backend Service has healthy endpoints, validate TLS secrets exist, and inspect controller logs for routing errors. Work from outside in: DNS, load balancer, Ingress controller, Service, Endpoints, Pods.
Detailed Answer
Think of a mall directory kiosk. Visitors look up the store name (host), follow the floor and section (path), and expect to reach the store (backend). If the directory has a typo, the store moved, the hallway is blocked, or the store is closed, visitors cannot get there. Ingress troubleshooting follows the same logic: verify every link in the chain from the external request to the running pod. An Ingress resource is a routing declaration, not a router itself. It tells an Ingress controller (NGINX, ALB, Traefik, Istio Gateway) how to route external traffic based on hostname and URL path. The controller watches Ingress objects, updates its routing table (nginx.conf, ALB rules, envoy routes), and directs traffic to the backend Service and port specified. If any part of this chain is misconfigured, traffic either returns 404, 503, or times out with no obvious error. The troubleshooting sequence starts at the DNS and load balancer layer. Verify that the domain resolves to the correct IP or ALB hostname. Check whether the load balancer health checks pass for the Ingress controller pods. Then inspect the Ingress resource: does the host field match the exact domain being requested? Does the path match the URL pattern (prefix vs exact)? Is the backend service name and port correct? A common mistake is specifying the container port instead of the Service port, or using a service name that does not exist in the same namespace. Next, check the Ingress controller itself. Is the controller pod running and ready? Has it synced the Ingress resource (describe the Ingress and look for events or Address field population)? Check controller logs for configuration reload errors, upstream connection failures, or TLS certificate problems. For NGINX Ingress Controller, the logs show every routing decision and upstream selection. An empty Address field on the Ingress means the controller has not processed it, often because the ingressClassName does not match or the controller is filtering by namespace. Behind the Ingress, the Service must have healthy endpoints. Run kubectl get endpoints to confirm the Service has pod IPs listed. Empty endpoints mean either no pods match the Service selector, pods exist but fail readiness probes, or the selector labels are mismatched. Even with endpoints populated, the target port must match what the container listens on. A Service targeting port 8080 when the app listens on 3000 will show connection refused in controller logs. The non-obvious gotcha is TLS misconfiguration. If the Ingress specifies a TLS section but the Secret does not exist, contains an expired certificate, or has a subject name that does not match the host field, some controllers serve a default fake certificate while others reject the connection entirely. Another common issue is path precedence: if you have both /api and /api/v2 paths, the ordering and pathType (Prefix vs Exact vs ImplementationSpecific) determine which rule matches. Some controllers require a trailing slash; others do not.
Code Example
# Inspect the Ingress resource for host, path, backend, and TLS configuration
kubectl describe ingress payments-api -n payments
# Check if the Ingress has an Address assigned (empty = controller has not processed it)
kubectl get ingress payments-api -n payments
# Verify the backend Service exists and has the correct port
kubectl get svc payments-api -n payments
# Check if the Service has endpoints (empty = no ready pods match the selector)
kubectl get endpoints payments-api -n payments
# Verify pod readiness — unready pods are excluded from endpoints
kubectl get pods -n payments -l app=payments-api -o wide
# Check the Ingress controller logs for routing errors or upstream failures
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 | grep payments
# Verify the TLS secret exists and is not expired
kubectl get secret payments-tls -n payments
kubectl get secret payments-tls -n payments -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -dates
# Test connectivity from inside the cluster to the Service directly (bypass Ingress)
kubectl exec -n payments payments-api-7f8d9c-x4k -- curl -s http://payments-api.payments.svc:8080/health◈ Architecture Diagram
┌──────────┐
│ Client │
└────┬─────┘
↓ DNS
┌──────────┐
│ LB │
└────┬─────┘
↓ health check
┌──────────────────┐
│ Ingress Controller│
│ (nginx/alb) │
└────┬─────────────┘
↓ host + path match
┌──────────┐
│ Service │
└────┬─────┘
↓ endpoints
┌──────────┐
│ Pods │
│ (ready?) │
└──────────┘Quick Answer
Three control plane nodes provide high availability through etcd's Raft consensus, which requires a majority quorum. With 3 members, quorum is 2 — so the cluster survives one node failure. With 2 members, losing one loses quorum.
Detailed Answer
Think of it like a committee that makes decisions by majority vote. If you have 3 committee members, you need 2 to agree (majority) to pass any decision. If one member is sick, the remaining 2 can still vote and make decisions. But if you only had 2 members and one got sick, you'd have 1 out of 2 — not a majority — so no decisions can be made and everything stops. The reason is etcd, the distributed key-value store that holds all Kubernetes cluster state. Etcd uses the Raft consensus algorithm, which requires a strict majority of members to agree on any write. This majority is called quorum. For 3 members, quorum = 2 (you can lose 1). For 5 members, quorum = 3 (you can lose 2). For 2 members, quorum = 2 (you can lose 0 — making 2 members WORSE than 1 for availability). When one control plane node fails in a 3-node setup, here's what happens: etcd continues operating because 2 of 3 members still form quorum. The API server pods on the remaining 2 nodes handle all requests (the load balancer in front of them routes around the failed node). The scheduler and controller manager use leader election — one was active, the others were on standby. If the active leader was on the failed node, a new leader is elected within seconds. From the user's perspective, kubectl commands might have a brief hiccup (~5-10 seconds) during leader re-election, but the cluster continues operating normally. However, losing TWO of three control plane nodes is catastrophic: etcd loses quorum (only 1 of 3 remaining), and all writes fail. The API server can serve reads from the remaining etcd member but cannot process any creates, updates, or deletes. Existing workloads on worker nodes keep running (the kubelet continues managing pods independently), but you cannot deploy anything new, scale, or recover pods that fail. The cluster is in a read-only degraded state until quorum is restored. Why not 5 or 7 control plane nodes? Each etcd write must be acknowledged by a majority before it's committed. More members means more network round trips and higher write latency. For most clusters, the trade-off of 3 nodes (survive 1 failure, fast writes) is optimal. Large enterprise clusters sometimes use 5 nodes for extra resilience, but 7+ is almost never justified because the write performance penalty outweighs the marginal availability gain.
Code Example
# Check etcd member health ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://master-0:2379,https://master-1:2379,https://master-2:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # master-0:2379 is healthy: committed index = 458923 # master-1:2379 is healthy: committed index = 458923 # master-2:2379 is healthy: committed index = 458923 # Check etcd member list ETCDCTL_API=3 etcdctl member list --write-out=table # Check which controller-manager and scheduler are the leader kubectl get endpoints kube-scheduler -n kube-system -o yaml kubectl get endpoints kube-controller-manager -n kube-system -o yaml # Check control plane node status kubectl get nodes -l node-role.kubernetes.io/control-plane # NAME STATUS ROLES AGE # master-0 Ready control-plane 365d # master-1 Ready control-plane 365d # master-2 NotReady control-plane 365d ← failed # Backup etcd (CRITICAL — do this before any maintenance) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
◈ Architecture Diagram
3-Node Control Plane: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ │ │ │ │ │ │ API Srvr │ │ API Srvr │ │ API Srvr │ │ etcd │ │ etcd │ │ etcd │ │ Sched │ │ Sched │ │ Sched │ │ CtrlMgr │ │ CtrlMgr │ │ CtrlMgr │ └──────────┘ └──────────┘ └──────────┘ leader ★ standby standby Quorum math: 3 members → quorum = 2 → tolerate 1 failure ✓ 5 members → quorum = 3 → tolerate 2 failures ✓ 2 members → quorum = 2 → tolerate 0 failures ✗ Master-2 fails: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ etcd ✓ │ │ etcd ✓ │ │ etcd ✗ │ │ leader ★ │ │ standby │ │ DOWN │ └──────────┘ └──────────┘ └──────────┘ 2/3 = quorum ✓ → cluster operational Master-0 ALSO fails: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ etcd ✗ │ │ etcd ✓ │ │ etcd ✗ │ │ DOWN │ │ alone! │ │ DOWN │ └──────────┘ └──────────┘ └──────────┘ 1/3 = NO quorum ✗ → read-only mode
Quick Answer
Multi-stage builds keep build tools out of runtime images. Cache ordering speeds up builds. Pinning base images helps reproducibility, but you must rebuild often and scan for vulnerabilities so old layers don't hide security holes.
Detailed Answer
Think of a restaurant kitchen that uses mixers, cutting boards, and raw ingredients to make a meal, but only sends the finished plate to the customer. A bad container image ships the entire kitchen to the table. A good multi-stage Docker build uses one stage as the kitchen and a second stage as the clean plate. The final image has only what the app needs to run, which cuts size, attack surface, and surprises in production. Docker's build best practices call for multi-stage builds, smart base image choices, a solid .dockerignore file, skipping unnecessary packages, using the build cache wisely, and rebuilding images regularly. The real concern is not just size. Every extra package, shell, compiler, credential, or leftover file in the final image gives attackers something to inspect or exploit. Smaller runtime images are easier to scan, faster to transfer, quicker to start, and simpler to reason about when things go wrong. The build process runs Dockerfile instructions in order and can reuse cached layers when an instruction and its inputs have not changed. This is why you copy stable dependency files before frequently changing source code. With BuildKit, teams can also use cache mounts and secret mounts so dependency downloads go faster and credentials never become image layers. Multi-stage builds then copy only selected files from builder stages into the final runtime stage. In production pipelines, engineers pin base images by digest for repeatable builds, scan images for known vulnerabilities, generate SBOMs (software bills of materials), sign artifacts, and rebuild even when app code has not changed. Rebuilds matter because pinning a tag or digest freezes the base layer, while security patches land in newer versions. Good teams track image size, critical CVE counts, startup time, pull latency, and rollback success rates as key metrics. The tricky part is that reproducibility and freshness pull in opposite directions. Pinning python:3.12-slim@sha256:... makes builds predictable, but it also locks in vulnerabilities until someone bumps the digest. Floating tags pick up patches automatically, but they can change under you and create builds you cannot reproduce. Senior engineers solve this with automated dependency-update PRs, signed digests, scheduled CI rebuilds, and policy gates. The goal is to make image supply-chain safety boring and routine rather than heroic and manual.
Code Example
docker buildx build --pull --tag registry.internal/payments-api:2026.06.18 --file Dockerfile . # Builds with a refreshed base image tag and a traceable release tag.
docker history registry.internal/payments-api:2026.06.18 # Inspects final layers to confirm build tools and secrets were not copied into runtime.
docker scout cves registry.internal/payments-api:2026.06.18 # Scans the image for known vulnerabilities before promotion.
docker image inspect registry.internal/payments-api:2026.06.18 --format '{{json .RepoDigests}}' # Captures immutable digests for deployment manifests.
docker push registry.internal/payments-api:2026.06.18 # Publishes the reviewed image to the internal registry.◈ Architecture Diagram
┌──────────┐
│ Source │
└────┬─────┘
↓
┌──────────┐
│ Builder │
└────┬─────┘
↓ copy
┌──────────┐
│ Runtime │
└────┬─────┘
↓ scan
┌──────────┐
│ Registry │
└────┬─────┘
↓
┌──────────┐
│ Deploy │
└──────────┘Quick Answer
Multi-stage builds use multiple FROM lines to separate build tools from runtime artifacts, so the final image has only the compiled binary and minimal OS libraries. Ordering dependency installs before source code copies maximizes cache hits and avoids full rebuilds when only app code changes.
Detailed Answer
Think of a woodworking shop. You need saws, clamps, sandpaper, and a workbench to build a cabinet, but the customer only gets the finished cabinet. They do not take home the saw. A multi-stage Docker build works the same way: one stage has all the build tools, and the final stage holds only the finished product. In Docker, a multi-stage build uses multiple FROM instructions in a single Dockerfile. Each FROM begins a new stage with its own base image and filesystem. Intermediate stages can install compilers, download dependencies, run tests, and produce artifacts. The final stage starts from a tiny base like distroless or alpine and copies only the compiled binaries or bundled assets from earlier stages using COPY --from. This means the production image never contains gcc, npm, pip, or any build toolchain, eliminating hundreds of megabytes and thousands of CVE-carrying packages from the runtime image. Under the hood, Docker and BuildKit process each stage as an independent node in the build graph. BuildKit can run independent stages in parallel, which is a major speed advantage over the legacy builder. When the Dockerfile is ordered correctly (base image first, dependency manifest copy second, dependency install third, source code copy fourth, build fifth), BuildKit reuses cached layers for everything up to the point where content changes. Since dependency manifests like package.json or go.sum change far less often than source code, this ordering means most CI builds only rebuild the final compilation step instead of re-downloading all dependencies. At production scale, teams running 50 or more microservices through CI see dramatic results. A payments-api image that was 1.2 GB with a single-stage node build drops to 85 MB with a multi-stage build using distroless as the final base. CI time drops from 8 minutes to 2 minutes because dependency layers are cached. Security scanners report 90 percent fewer vulnerabilities because the final image has no compilers, shells, or package managers. Teams should also use .dockerignore to exclude test fixtures, documentation, and local configs from the build context, which prevents unnecessary cache busting and reduces context transfer time. The tricky gotcha is that COPY --from references are position-based by default (stage 0, stage 1), which breaks silently when someone adds a new stage. Always name stages with AS and reference by name. Another trap is copying an entire directory from the build stage instead of specific artifacts, which can accidentally include build caches, test output, or sensitive files in the production image. Architects should also know that multi-stage builds do not automatically clean up intermediate images in CI. BuildKit's garbage collection handles this, but disk pressure on CI runners can still build up if max-storage is not configured.
Code Example
# Dockerfile for payments-api using multi-stage build # Stage 1: Install dependencies in a full Node image FROM node:22-bookworm AS deps # Set the working directory for dependency installation WORKDIR /build # Copy only the dependency manifests first to maximize cache hits COPY package.json package-lock.json ./ # Install production dependencies with exact versions from lockfile RUN npm ci --production # Stage 2: Build the application with dev dependencies FROM node:22-bookworm AS builder # Set the working directory for the build process WORKDIR /build # Copy all dependency manifests for full install including dev deps COPY package.json package-lock.json ./ # Install all dependencies including TypeScript compiler and test tools RUN npm ci # Copy source code after dependencies to preserve layer cache COPY src/ ./src/ # Copy TypeScript config for compilation COPY tsconfig.json ./ # Compile TypeScript to JavaScript in the dist directory RUN npm run build # Stage 3: Production image with only runtime artifacts FROM gcr.io/distroless/nodejs22-debian12 AS production # Set a non-root user for security hardening USER 1000 # Set the working directory for the application WORKDIR /app # Copy only production node_modules from the deps stage COPY --from=deps /build/node_modules ./node_modules/ # Copy only the compiled JavaScript from the builder stage COPY --from=builder /build/dist ./dist/ # Expose the API port for documentation and container networking EXPOSE 8080 # Run the compiled application entry point CMD ["dist/server.js"]
◈ Architecture Diagram
┌──────────┐
│ deps │
│ npm ci │
└────┬─────┘
│
┌────┴─────┐
│ builder │
│ compile │
└────┬─────┘
│ COPY --from
┌────┴─────┐
│production│
│distroless│
└──────────┘Quick Answer
Secure Docker images use multi-stage builds to exclude build tools from the final image, run as non-root users with explicit UIDs, start from minimal base images like Distroless or Alpine, pin dependencies to digests, and drop all unnecessary capabilities. This reduces the attack surface from hundreds of exploitable packages to a minimal runtime footprint.
Detailed Answer
Think of building a secure Docker image like constructing a bank vault room. During construction, workers bring in welding equipment, power tools, scaffolding, and raw materials. Once the vault is complete, every construction tool is removed from the room. The vault door is keyed to specific authorized personnel, not a master key. The room contains only what is needed for its purpose: reinforced walls, a locking mechanism, and a ventilation system. If a thief breaks in, they find no tools to use against the vault itself. Multi-stage Docker builds follow the same principle: build tools exist only during construction and never ship to production. A multi-stage Dockerfile separates the build environment from the runtime environment using multiple FROM statements. The first stage installs compilers, package managers, testing frameworks, and build dependencies needed to compile the application. The second stage starts from a minimal base image and copies only the compiled binary or application artifacts from the build stage. For a Java banking application, the build stage might use a full JDK image with Maven, while the runtime stage uses a Distroless Java image that contains only the JRE and no shell, package manager, or system utilities. This dramatically reduces the number of packages that vulnerability scanners flag and eliminates tools that attackers could use for post-exploitation activities like installing malware or pivoting to other services. Running containers as non-root is a fundamental security control that prevents container breakout exploits from gaining host-level root access. The Dockerfile creates a dedicated application user with a specific numeric UID and GID, changes ownership of application files to that user, and switches to that user with the USER directive before the ENTRYPOINT. In banking environments, the specific UID matters because it must match file permissions on mounted volumes and satisfy Pod Security Standards that require runAsNonRoot in Kubernetes. Using numeric UIDs instead of usernames avoids dependency on /etc/passwd, which may not exist in Distroless images. The non-root user should have no shell assigned and no home directory beyond what the application needs. Minimal base images are the foundation of attack surface reduction. A standard Ubuntu base image contains over 100 installed packages including shells, text editors, network utilities, and package managers. An Alpine image reduces this to roughly 15 packages. A Google Distroless image contains only the application runtime and its direct dependencies, with no shell at all. For banking applications, Distroless is preferred for production because if an attacker gains code execution inside the container, they cannot open a shell, install tools, or inspect the filesystem interactively. When debugging is needed, teams use ephemeral debug containers through kubectl debug rather than shipping debug tools in production images. The production gotcha that catches many teams is the interaction between read-only root filesystems and application behavior. Many frameworks write temporary files, session data, or compilation caches to the filesystem at runtime. When the root filesystem is read-only, these writes fail and the application crashes. Teams must identify every path the application writes to and mount emptyDir volumes at those paths. Log files should go to stdout and stderr rather than filesystem paths. Another subtle issue is layer ordering in the Dockerfile: placing frequently changing instructions like COPY of application code after rarely changing instructions like dependency installation maximizes build cache utilization and reduces build times from minutes to seconds. In regulated environments, every base image must also be scanned and approved through the organization's software supply chain process before it can be used as a FROM source.
Code Example
# Secure multi-stage Dockerfile for payments-api (Spring Boot)
# Stage 1: Build — full JDK with Maven for compilation
FROM eclipse-temurin:17-jdk-alpine AS builder
WORKDIR /build
# Cache dependencies separately from application code
COPY pom.xml .
RUN mvn dependency:go-offline -B
# Copy source and build
COPY src/ src/
RUN mvn package -DskipTests -B && \
# Extract layered Spring Boot JAR for optimal Docker layers
java -Djarmode=layertools -jar target/payments-api.jar extract --destination extracted
# Stage 2: Runtime — minimal Distroless image (no shell, no pkg manager)
FROM gcr.io/distroless/java17-debian12:nonroot
# Labels for audit and compliance tracking
LABEL maintainer="[email protected]" \
app="payments-api" \
compliance="sox-pci" \
base-image="distroless-java17"
WORKDIR /app
# Copy Spring Boot layers in dependency order for cache efficiency
COPY --from=builder /build/extracted/dependencies/ ./
COPY --from=builder /build/extracted/spring-boot-loader/ ./
COPY --from=builder /build/extracted/snapshot-dependencies/ ./
COPY --from=builder /build/extracted/application/ ./
# Run as non-root user (UID 65532 is the nonroot user in Distroless)
USER 65532:65532
# Health check for Kubernetes readiness probes
EXPOSE 8080
ENTRYPOINT ["java", "-XX:MaxRAMPercentage=75.0", \
"-Djava.security.egd=file:/dev/./urandom", \
"org.springframework.boot.loader.launch.JarLauncher"]
# Compare image sizes to prove attack surface reduction
# docker images
# REPOSITORY TAG SIZE
# payments-api-full latest 580MB (JDK + Maven + OS tools)
# payments-api latest 210MB (Distroless JRE only)
# Verify no shell exists in the production image
# docker run --rm payments-api /bin/sh
# exec: "/bin/sh": stat /bin/sh: no such file or directory
# Scan the final image for vulnerabilities
trivy image --severity CRITICAL,HIGH ecr.bank.com/payments-api:v2.3.1◈ Architecture Diagram
┌─────────────────────────────────────────────┐ │ Multi-Stage Build │ │ │ │ Stage 1: Builder │ │ ┌────────────────────────────┐ │ │ │ JDK 17 + Maven │ │ │ │ Source Code │ │ │ │ Test Frameworks │ ← DISCARDED │ │ │ Build Tools │ │ │ │ OS Packages (580MB) │ │ │ └─────────────┬─────────────┘ │ │ │ COPY --from=builder │ │ ↓ (JAR only) │ │ Stage 2: Runtime │ │ ┌────────────────────────────┐ │ │ │ Distroless Java 17 │ │ │ │ payments-api.jar │ ← SHIPPED │ │ │ USER 65532 (non-root) │ │ │ │ No shell, no pkg mgr │ │ │ │ Read-only rootFS (210MB) │ │ │ └────────────────────────────┘ │ └─────────────────────────────────────────────┘
Quick Answer
A multi-stage build uses multiple FROM lines in one Dockerfile. You compile code in one stage and copy only the finished artifact into a tiny runtime image. This slashes image size and attack surface.
Detailed Answer
Think of a multi-stage build like a factory assembly line. In a car factory, welding robots, paint booths, and heavy machinery stay on the factory floor. Only the finished car rolls out and into the showroom. A multi-stage Docker build works the same way: all the bulky compilers, build tools, and source code stay in the build stage, while only the lean, finished binary ships in the final image. The key idea is separating what you need to build from what you need to run. A multi-stage Docker build is a Dockerfile pattern where you write multiple FROM instructions, each starting a fresh stage. The first stage usually pulls a full SDK or compiler image, installs dependencies, compiles source code, runs tests, and produces a deployable file. Later stages start from a tiny base image like alpine or distroless and use COPY --from to grab only the necessary files from earlier stages. The result is a final image that contains nothing except what the app needs to run, with no leftover build tools, package caches, or temporary files. Under the hood, Docker treats each stage as a separate build context with its own layer history. When Docker hits a second FROM instruction, it starts a clean image context while keeping previous stages in memory for reference. The COPY --from=builder command reaches into the filesystem of the named stage and pulls out specific paths. Each stage can use a completely different base image. For example, you might build with golang:1.21 and run with gcr.io/distroless/static-debian12. The build cache works per-stage, so changes to later stages do not force earlier stages to rebuild, which makes development iterations faster. In production, multi-stage builds are considered a must-have for several reasons. First, they shrink image size dramatically. A Go app built in a standard golang image might weigh 900MB, but the final distroless image with just the static binary can be under 15MB. Smaller images pull faster across registries, scale quicker in Kubernetes, cost less to store, and give security scanners far less to audit. Second, multi-stage builds fit cleanly into CI/CD pipelines because the entire build process lives inside the Dockerfile. No external build scripts or Makefiles needed. Third, they enable reproducible builds since every developer and every CI runner uses the exact same build environment defined in that first stage. A common mistake is copying too many files from the build stage. Using COPY --from=builder / /app/ instead of targeting a specific directory can accidentally pull in source code, credential files, or package caches that bloat the image and create security risks. Always copy only the exact artifact you need. Another subtle issue: build arguments (ARG) defined in one stage are not available in later stages. You must redeclare ARG after each FROM if you need the same value. Finally, intermediate stages are not always cleaned up automatically, so running docker image prune regularly is important to free disk space on build servers.
Code Example
# Stage 1: Build the payments-api Go binary FROM golang:1.21-alpine AS builder # Set the working directory inside the build container WORKDIR /src # Copy go.mod and go.sum first to leverage layer caching COPY go.mod go.sum ./ # Download dependencies (cached if go.mod/go.sum unchanged) RUN go mod download # Copy the entire source code into the build container COPY . . # Compile the payments-api binary with CGO disabled for static linking RUN CGO_ENABLED=0 GOOS=linux go build -o /payments-api ./cmd/server # Stage 2: Create a minimal production image FROM gcr.io/distroless/static-debian12 # Copy only the compiled binary from the builder stage COPY --from=builder /payments-api /payments-api # Copy the config file needed at runtime COPY --from=builder /src/config/production.yaml /config/production.yaml # Expose the port the payments-api listens on EXPOSE 8080 # Set the entrypoint to run the payments-api binary ENTRYPOINT ["/payments-api"]
◈ Architecture Diagram
┌─────────────────────────────┐
│ Stage 1: builder │
│ FROM golang:1.21-alpine │
│ │
│ ┌───────────────────────┐ │
│ │ Source Code + go.mod │ │
│ └───────────┬───────────┘ │
│ ↓ │
│ ┌───────────────────────┐ │
│ │ go build → /payments │ │
│ └───────────┬───────────┘ │
└──────────────┼──────────────┘
↓ COPY --from=builder
┌──────────────┼──────────────┐
│ Stage 2: runtime │
│ FROM distroless │
│ ↓ │
│ ┌───────────────────────┐ │
│ │ /payments (binary) │ │
│ │ /config/prod.yaml │ │
│ └───────────────────────┘ │
│ │
│ EXPOSE 8080 │
└─────────────────────────────┘Quick Answer
Multi-window burn rate alerting fires when the error rate burns through the error budget faster than expected across both a long window (1h) and a short window (5m). This reduces alert noise compared to static thresholds by only alerting when the burn rate is sustained enough to exhaust the budget within the SLO period.
Detailed Answer
Think of a car's fuel gauge. A static threshold alert says 'warn at 25% fuel' — but that ignores whether you are on a highway burning fuel fast or parked with the engine off. Multi-window burn rate is like saying 'warn when fuel consumption over the last hour would empty the tank before you reach the next gas station, AND you are still burning fast right now.' This catches real problems while ignoring brief spikes. SLO-based alerting starts with defining an error budget. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — about 43 minutes of downtime. The burn rate is how fast you are consuming this budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the period. A burn rate of 14.4x means you will exhaust the 30-day budget in just 2 days. Multi-window burn rate uses two windows to reduce false positives. The long window (typically 1 hour) detects sustained error rates that threaten the budget. The short window (typically 5 minutes) confirms the problem is still happening right now. Both conditions must be true for the alert to fire. This prevents alerting on brief spikes that self-resolve (short window would not fire) and on historical errors that have already been fixed (long window shows the past, short window confirms the present). Google's SRE book recommends multiple severity tiers: 14.4x burn rate over 1h/5m for critical (page), 6x over 6h/30m for warning (ticket). At production scale, teams define recording rules that pre-compute error ratios for each SLI at multiple windows. The error ratio is calculated as rate(http_requests_total{status=~"5.."}[window]) / rate(http_requests_total[window]). Recording rules at 5m, 30m, 1h, and 6h windows avoid expensive queries at alert evaluation time. Grafana dashboards show the remaining error budget as a percentage, making it visual whether the team can ship features or must focus on reliability. The non-obvious gotcha is that burn rate alerts assume a uniform error distribution, which rarely matches reality. A 5-minute outage that burns 10% of the monthly budget followed by 29 days of perfect operation is very different from a constant 0.1% error rate. Teams should complement burn rate alerts with absolute threshold alerts for catastrophic failures (error rate > 50% for 1 minute) that would cause immediate user impact regardless of the monthly budget.
Code Example
# Recording rules for multi-window error ratios
# prometheus-rules.yaml
groups:
- name: slo-payments-api
rules:
# 5-minute error ratio (short window)
- record: payments_api:error_ratio:5m
expr: |
sum(rate(http_requests_total{service="payments-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payments-api"}[5m]))
# 1-hour error ratio (long window)
- record: payments_api:error_ratio:1h
expr: |
sum(rate(http_requests_total{service="payments-api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="payments-api"}[1h]))
# Multi-window burn rate alert (14.4x = exhausts 30-day budget in 2 days)
- alert: PaymentsAPIHighBurnRate
expr: |
payments_api:error_ratio:1h > (14.4 * 0.001)
and
payments_api:error_ratio:5m > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "payments-api burning error budget at 14.4x rate"Quick Answer
Prometheus reuses the newest sample only if it falls within the lookback window, which defaults to 5 minutes. When a target or metric disappears, Prometheus writes a staleness marker so queries stop returning the old value instead of silently carrying it forever.
Detailed Answer
Think of a train station display board. If the 8:10 train reported its location two minutes ago, the board can still show a useful last-known position. If the train has not reported for an hour, showing that old position would mislead passengers. Prometheus has the same problem with metrics: a recent sample is fine to use at query time, but an old sample should eventually disappear so graphs and alerts do not pretend the system is healthy. PromQL, the Prometheus query language, evaluates instant queries at a single timestamp and range queries at many evenly spaced timestamps. For each evaluation timestamp, Prometheus looks backward for the newest sample inside the lookback window. The default lookback is 5 minutes, and it is configurable. This lets queries work even when scrapes do not land exactly on graph step boundaries. Without this behavior, normal scrape timing jitter would create broken graphs and unreliable aggregations. Staleness adds another layer. If a target scrape no longer returns a series that previously existed, or if service discovery removes a target entirely, Prometheus can write a staleness marker for that time series. After that marker, instant queries no longer return the old value for that series. This prevents stale readings from being treated as current values in aggregations like sum, avg, or alert expressions. If fresh samples later arrive for the same label set, the series simply reappears. Production alerting gets subtle here. An alert like `up == 0` catches failed scrapes where the target is still known but unreachable. However, it may not catch a target that vanished from service discovery, because there may be no `up` series left to evaluate. For detecting missing services, `absent()` or inventory-based alerts are usually needed. Engineers also tune scrape_interval, scrape_timeout, evaluation_interval, and the alert `for` duration so brief network hiccups do not page people while true disappearances still get caught quickly. The experienced gotcha is that a graph can look flat or empty for different reasons. A flat line may mean Prometheus is carrying a recent last sample inside the lookback window, while an empty graph may mean the series went stale, not that the value became zero. Exporters that attach their own timestamps can behave differently and may keep the last value visible until lookback expires. A common band-aid is using `or vector(0)` everywhere, which makes dashboards look tidy but hides missing telemetry. Senior engineers learn to distinguish between zero, missing, stale, and failed-scrape states explicitly rather than papering over the differences.
Code Example
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=up{job="payments-api"}' # Checks whether Prometheus still sees the payments-api target and whether the latest scrape succeeded.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=absent(up{job="payments-api"})' # Detects the case where the target disappeared from service discovery and no up series exists.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=max_over_time(up{job="payments-api"}[10m])' # Shows whether the target was present at any point during the last 10 minutes.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=time() - timestamp(up{job="payments-api"})' # Measures how old the newest up sample is for the target.
promtool check rules /etc/prometheus/rules/payments-availability.yml # Validates alert rules before reloading them into Prometheus.◈ Architecture Diagram
┌──────────┐
│ Target │
└────┬─────┘
↓ scrape
┌──────────┐
│ Sample │
└────┬─────┘
↓ query
┌──────────┐
│ Lookback │
└────┬─────┘
↓
┌──────────┐ ┌──────────┐
│ Present │ │ Stale │
└────┬─────┘ └────┬─────┘
↓ ↓
┌──────────┐ ┌──────────┐
│ Alert │ │ Absent │
└──────────┘ └──────────┘Quick Answer
Use histograms when you need to aggregate percentiles across many instances and tie them to SLOs. Classic histograms need explicit bucket boundaries, native histograms reduce that manual work, and summaries calculate percentiles inside the app but cannot be safely combined across replicas.
Detailed Answer
Think of measuring checkout wait times by placing customers into labeled bins: under 100 ms, under 300 ms, under 1 second, and so on. If the bins are chosen around the thresholds the business actually cares about, the data is useful. If every bin is too wide, too narrow, or missing the SLO boundary, the final percentile looks precise but answers the wrong question. Prometheus histograms are that binning system for measurements like request duration. A classic Prometheus histogram exposes cumulative bucket counters using the `le` label, which stands for less than or equal, plus `_sum` and `_count` series. Prometheus calculates percentiles using the `histogram_quantile()` function over rates of those buckets. The big advantage of this design is that you can aggregate across pods, nodes, clusters, or jobs before calculating the percentile, which is why histograms are the go-to for distributed services. The cost is extra time series: each bucket boundary creates another series for every label combination. Native histograms change the storage model by representing many bucket spans more compactly and letting Prometheus handle histogram samples directly. They reduce some of the pain of choosing bucket boundaries manually and support more flexible percentile exploration. However, they require compatible Prometheus settings, client libraries, remote write backends, and query paths, so you need to check the full chain before adopting them. Summaries are a different animal: they compute selected quantiles inside each application process. That can be useful for a single process, but averaging p95 values across replicas is statistically wrong because each process saw a different number and shape of requests. The query path matters for getting correct results. For classic histograms, you typically apply `rate()` to `_bucket` counters, aggregate with `sum by (le, service)` or similar, then call `histogram_quantile()`. The `le` label must survive until the quantile function runs because it represents the bucket boundary. For SLO checks like seeing what fraction of requests finish under 300 ms, having an exact bucket at that boundary makes the calculation simple and reliable. For Apdex-style scores, you need buckets at both the satisfied and tolerated thresholds. The gotcha is that histogram percentiles are estimates, and how good the estimate is depends entirely on where you place the buckets. A p99 alert built on buckets of 100 ms, 1 second, and 10 seconds cannot accurately tell the difference between 1.2 seconds and 8 seconds. Another common mistake is averaging per-pod p95 values in Grafana, which gives equal weight to quiet pods and busy pods. Experienced engineers pick bucket boundaries around the SLO thresholds users care about, keep labels low-cardinality, aggregate buckets before computing quantiles, and verify that the remote storage path preserves the histogram type they depend on.
Code Example
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="payments-api"}[5m])))' # Computes fleet-wide p95 latency from classic histogram buckets after aggregating by bucket boundary.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=sum(rate(http_request_duration_seconds_bucket{job="payments-api",le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count{job="payments-api"}[5m]))' # Calculates the fraction of payments-api requests completed within the 300 ms SLO bucket.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=sum(rate(http_request_duration_seconds_sum{job="payments-api"}[5m])) / sum(rate(http_request_duration_seconds_count{job="payments-api"}[5m]))' # Calculates average latency from histogram sum and count without using quantile math.
promtool check rules /etc/prometheus/rules/payments-latency-slo.yml # Validates histogram-based recording and alerting rules before deployment.◈ Architecture Diagram
┌──────────┐
│ Request │
└────┬─────┘
↓
┌──────────┐
│ Buckets │
└────┬─────┘
↓
┌──────────┐
│ Rate │
└────┬─────┘
↓
┌──────────┐
│ Sum by le│
└────┬─────┘
↓
┌──────────┐
│ p95 │
└──────────┘Quick Answer
Remote write tails the Prometheus WAL into per-destination queues, shards the work across parallel senders, batches samples, and retries on failure. If queues fill up, Prometheus stops reading from the WAL for that destination. If the receiver stays down too long, unsent samples can be lost as WAL data gets compacted away.
Detailed Answer
Think of a shipping dock that sends packages from a factory to a central warehouse. The factory keeps making boxes, workers load them into several truck lanes, and sometimes the warehouse slows down. If the lanes fill up, boxes pile up at the dock. If the warehouse stays closed for hours, the factory has to choose between halting intake, using more dock space, or eventually throwing away boxes it can no longer hold. Prometheus remote write is that shipping dock for metrics. Prometheus first ingests samples locally through its normal scrape and WAL path. A remote write component then reads from the WAL, maps internal series IDs to their label sets, queues samples, and sends compressed HTTP requests to the configured remote endpoint. That endpoint might be Grafana Mimir, Thanos Receive, Cortex, VictoriaMetrics, or a managed cloud service. Remote write is not a magic way to backfill historical data; it is mainly a streaming replication path from the local ingestion flow. Backpressure shows up when the remote endpoint is slow, returning errors, rate-limiting, or totally unreachable. Prometheus uses shards, which are parallel sending workers, to improve throughput. Each shard has an in-memory queue with a capacity limit and a maximum batch size. Failed requests get retried with exponential backoff. Prometheus can automatically adjust the shard count based on the incoming sample rate and how long sends are taking. But once queues fill up, reading from the WAL for that remote write target is blocked, and pending samples start piling up. In production, the key metrics to watch are pending samples, failed samples, retried samples, send batch duration, current shard count, and queue capacity. Tuning usually starts with the receiver side: confirm it is healthy, not throttling, and not rejecting samples due to tenant limits or bad labels. Then tune Prometheus settings like `max_samples_per_send`, `capacity`, `max_shards`, and backoff values. Capacity should generally be several times the batch size, but setting it too high increases Prometheus memory usage. Write relabeling can drop expensive or unnecessary samples before they even leave Prometheus. The gotcha is that cranking every knob up can make the outage worse. More shards can overwhelm a backend that is trying to recover. More queue capacity can cause Prometheus memory pressure, especially during high series churn because remote write caches series labels. Another gotcha involves the two-hour WAL window: if remote write stays blocked longer than the WAL can hold unsent data, samples get lost when the WAL is compacted. Senior engineers treat remote write tuning as end-to-end flow control, not just a matter of making the queue bigger.
Code Example
remote_write: # Sends locally ingested samples to a central backend such as Mimir or Thanos Receive.
- url: https://mimir-write.monitoring.svc/api/v1/push # Points Prometheus at the remote write receiver endpoint.
name: payments-mimir # Gives this remote write queue a stable name in metrics and logs.
remote_timeout: 30s # Bounds each send request so slow receivers do not hang workers forever.
queue_config: # Controls memory queues and parallel send workers for this remote write target.
max_samples_per_send: 5000 # Sends larger batches to improve throughput when the receiver supports them.
capacity: 30000 # Keeps per-shard capacity about six times the batch size to absorb short slowdowns.
max_shards: 10 # Caps parallelism so Prometheus does not overload the central backend during recovery.
min_shards: 2 # Starts with two workers so the queue can drain promptly after restart.
min_backoff: 1s # Waits at least one second before retrying a failed send.
max_backoff: 30s # Prevents retry storms by backing off repeated failures.
write_relabel_configs: # Drops samples before remote write to reduce bandwidth and receiver load.
- source_labels: [__name__] # Selects samples by metric name before deciding whether to send them.
regex: 'go_.*' # Matches noisy runtime metrics that the central backend does not need.
action: drop # Drops matching samples from remote write while keeping local scrape data.◈ Architecture Diagram
┌──────────┐
│ WAL │
└────┬─────┘
↓
┌──────────┐
│ Queue │
└────┬─────┘
↓
┌──────────┐
│ Shards │
└────┬─────┘
↓
┌──────────┐
│ Receiver │
└────┬─────┘
↓
┌──────────┐
│ Object │
└──────────┘Quick Answer
Alertmanager groups related alerts, deduplicates notifications, routes them to the right receiver, silences planned noise, and inhibits lower-level alerts when a parent alert explains them. In HA mode, Prometheus should send alerts directly to every Alertmanager peer, not through a load balancer.
Detailed Answer
Think of a hospital emergency department during a city-wide power failure. Thousands of alarms pour in from buildings, traffic lights, and elevators. Operators do not want a separate phone call for each alarm. They want one grouped incident per affected area, with enough detail to know which buildings still need help. Alertmanager is that dispatch layer for Prometheus alerts. Prometheus evaluates alerting rules and sends firing or resolved alerts to Alertmanager over HTTP. Alertmanager then groups alerts by chosen labels, routes each group through a routing tree to the right receiver (Slack, PagerDuty, email), deduplicates repeated notifications, applies silences for planned maintenance, and applies inhibitions when one alert makes another redundant. For example, if a ClusterDown alert is firing, an inhibition rule can suppress thousands of pod-level alerts from that same cluster because they are all symptoms of the same root cause. Grouping is label-driven. The group_by setting picks which labels define a notification group. group_wait delays the first notification briefly so related alerts can arrive together. group_interval controls how often new alerts get added to an existing group. repeat_interval controls how frequently unresolved alerts are re-sent. Inhibition rules compare a source alert against target alerts using matchers and equality labels. Silences use matchers and time windows, and are usually created through the Alertmanager UI or API during maintenance windows. For high availability, multiple Alertmanager instances form a cluster and share notification state through a gossip protocol. Prometheus should be configured with all Alertmanager peers listed as targets. The Prometheus docs warn against putting a load balancer between Prometheus and Alertmanager because each Prometheus instance needs to deliver alerts to the full cluster so deduplication and state replication work correctly. Teams also set external labels like cluster, region, and replica carefully so Alertmanager can tell independent environments apart while still deduplicating HA Prometheus replicas. The gotcha is that label design can either flood your team or hide a real outage. If group_by includes pod, every single pod failure during a deployment becomes a separate page. If it only groups by alertname, unrelated production and staging incidents might collapse into one notification. Inhibition can be dangerous too -- if the source alert is too broad or fires too easily, it can silence real alerts. Senior engineers test alert routes with sample payloads, keep grouping labels tied to ownership and blast radius, and regularly review active silences to make sure planned maintenance windows have not turned into black holes for real incidents.
Code Example
alerting: # Configures where Prometheus sends evaluated alerts.
alertmanagers: # Lists Alertmanager targets for alert delivery.
- static_configs: # Uses explicit peer targets instead of a load-balanced single endpoint.
- targets: ['alertmanager-0:9093','alertmanager-1:9093','alertmanager-2:9093'] # Sends alerts directly to every HA Alertmanager peer.
route: # Defines the root Alertmanager routing tree.
receiver: sre-pager # Sends unmatched production alerts to the SRE paging receiver.
group_by: ['cluster','namespace','alertname'] # Groups by blast radius without grouping unrelated clusters together.
group_wait: 30s # Waits briefly so related alerts from the same incident can arrive together.
group_interval: 5m # Controls how often new alerts are added to an existing notification group.
repeat_interval: 4h # Prevents repeated pages for the same unresolved alert group.
inhibit_rules: # Suppresses noisy child alerts when a parent outage alert is already firing.
- source_matchers: ['alertname="ClusterDown"'] # Uses the cluster-level outage alert as the inhibition source.
target_matchers: ['severity="warning"'] # Suppresses lower-severity warning alerts during the parent outage.
equal: ['cluster'] # Applies inhibition only inside the same cluster label value.◈ Architecture Diagram
┌──────────┐
│ Rules │
└────┬─────┘
↓
┌──────────┐
│ Alerts │
└────┬─────┘
↓
┌──────────┐
│ Group │
└────┬─────┘
↓
┌──────────┐ ┌──────────┐
│ Inhibit │←────│ Silence │
└────┬─────┘ └──────────┘
↓
┌──────────┐
│ Route │
└────┬─────┘
↓
┌──────────┐
│ Pager │
└──────────┘Quick Answer
Store dashboard JSON and alert rule YAML in Git. Use Grafana provisioning, Grafonnet (a Jsonnet library), or Terraform's Grafana provider to define dashboards as code. Changes go through PR review, CI validates syntax, and CD applies them automatically. Updating 10 dashboards means changing one template and pushing a single commit.
Detailed Answer
Think of it like managing a chain of restaurants where every location has to serve the same menu. Instead of calling each manager and dictating changes over the phone, which is like clicking around in Grafana's UI, you update the master menu in a shared drive, the managers review it, and an automated system prints and ships the new menus to all locations at once. The GitOps workflow for Grafana has three main approaches, from simple to powerful. The simplest is Grafana's built-in provisioning: you put dashboard JSON files and alert rule YAML files in a directory that Grafana watches, usually mounted via a ConfigMap in Kubernetes. When the files change, Grafana reloads them. You store these files in Git, and your CI/CD pipeline updates the ConfigMap every time a change merges to main. The second approach uses Grafonnet, a Jsonnet library for generating Grafana dashboard JSON programmatically. Instead of writing raw 500-line JSON files by hand, you write concise Jsonnet code that generates them. This is where updating 10 dashboards at once becomes easy: if all 10 share a common template, say a service dashboard with CPU, memory, error rate, and latency panels, you define the template once and pass in parameters per service. Changing the template changes all 10 dashboards in one commit. Jsonnet compiles down to JSON, which then gets provisioned into Grafana. The third approach uses Terraform with the Grafana provider. You define dashboards, folders, alert rules, and notification channels as Terraform resources. The CI pipeline runs `terraform plan` on pull requests to show what would change and `terraform apply` on merge. This gives you state management, drift detection, and the full Terraform workflow. For large organizations managing hundreds of dashboards across multiple Grafana instances, this is the most maintainable path. For alerts, Grafana's alerting rules and notification policies can also be defined in YAML and provisioned alongside dashboards. The entire alerting chain -- rules, routing policies, contact points, and message templates -- lives in Git, version-controlled and reviewable. The day-to-day workflow looks like this: a developer creates a branch, modifies dashboard Jsonnet or Terraform files, opens a pull request, CI runs syntax checks like jsonnet lint or terraform validate and optionally renders a preview, a reviewer approves, the PR merges to main, and the CD pipeline applies changes to Grafana. The big win is that every change is code-reviewed, version-controlled, and reversible with a simple `git revert`.
Code Example
# ─── Approach 1: Grafana Provisioning via ConfigMap ───
# dashboards.yaml (Grafana provisioning config)
apiVersion: 1
providers:
- name: default
type: file
options:
path: /var/lib/grafana/dashboards # Watch this directory
foldersFromFilesStructure: true
# Mount dashboards from ConfigMap in Kubernetes
# kubectl create configmap grafana-dashboards \
# --from-file=dashboards/ -n monitoring
# ─── Approach 2: Grafonnet (Jsonnet) ───
# service-dashboard.jsonnet — one template, many dashboards
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
# Template function — reused for all services
local serviceDashboard(name, namespace) =
dashboard.new(name + ' Service Dashboard')
+ dashboard.withUid(name + '-svc')
+ dashboard.withPanels([
# CPU panel
grafana.panel.timeSeries.new(name + ' CPU')
+ { targets: [prometheus.new(
'sum(rate(container_cpu_usage_seconds_total{namespace="' + namespace + '", pod=~"' + name + '.*"}[5m]))'
)] },
# Error rate panel
grafana.panel.timeSeries.new(name + ' Error Rate')
+ { targets: [prometheus.new(
'sum(rate(http_requests_total{namespace="' + namespace + '", status=~"5.."}[5m]))'
)] },
]);
# Generate 10 dashboards from one template
{
'payments-api.json': serviceDashboard('payments-api', 'production'),
'checkout-svc.json': serviceDashboard('checkout-svc', 'production'),
'user-auth.json': serviceDashboard('user-auth', 'production'),
# ... 7 more services
}
# Build: jsonnet -J vendor/ -m output/ service-dashboard.jsonnet
# ─── Approach 3: Terraform ───
resource "grafana_dashboard" "payments" {
config_json = file("dashboards/payments-api.json")
folder = grafana_folder.production.id
}
# CI Pipeline (.github/workflows/grafana.yml)
# on PR: terraform plan → post diff as comment
# on merge: terraform apply → dashboards updated◈ Architecture Diagram
GitOps Workflow for Grafana:
┌──────────┐ PR ┌──────────┐
│Developer │───────────►│ Git │
│ │ │ (main) │
│ edit │ review │ │
│ .jsonnet │◄───────────│ CI runs: │
│ or .tf │ approve │ lint │
└──────────┘ │ plan │
└────┬─────┘
│ merge
▼
┌─────────────────┐
│ CD Pipeline │
│ │
│ jsonnet build │
│ OR │
│ terraform apply │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Grafana │
│ │
│ 10 dashboards │
│ updated from │
│ 1 template │
└─────────────────┘
Template → 10 dashboards:
┌──────────────┐
│ svc-template │
│ .jsonnet │──► payments-api.json
│ │──► checkout-svc.json
│ │──► user-auth.json
│ │──► ... 7 more
│ 1 change = │
│ 10 updates │
└──────────────┘Quick Answer
Symptom-based alerting fires on things users actually feel, like high error rates, slow responses, or SLO budget burn, instead of internal causes like high CPU or disk at 80%. It cuts alert noise dramatically because many internal causes map to just a few user-facing symptoms. You implement it with SLO-based burn rate alerts in Prometheus using multi-window, multi-burn-rate rules.
Detailed Answer
Think of it like a car dashboard. Cause-based alerting would mean separate warning lights for every internal part: fuel injector pressure, alternator voltage, coolant thermostat position, oxygen sensor reading. You would have 200 lights and no idea which ones matter. Symptom-based alerting gives you one light that says 'engine temperature high,' and the mechanic investigates the cause from there. Traditional monitoring creates alerts for every possible internal state: CPU above 80%, disk above 85%, memory above 90%, Pod restarts above 3, queue depth above 1000. This leads to massive alert fatigue. A team with 50 microservices might have 500-plus alert rules, most of which fire for brief spikes that fix themselves. Engineers start ignoring alerts, and when a real outage happens, the critical signal is buried in noise. Symptom-based alerting flips this around. You alert on what users experience: the error rate is burning through the SLO budget faster than sustainable, latency has crossed the SLO target, or availability has dropped below the threshold. These are called SLI-based alerts, where SLI stands for Service Level Indicator. If CPU is at 95% but the error rate is 0% and latency is normal, there is no user impact, so no alert is needed. If CPU is at 40% but the error rate is 5%, users are hurting, so you alert right away. The best way to implement this is Google's multi-window, multi-burn-rate approach from the SRE book. You define an SLO such as 99.9% availability over 30 days, which gives you an error budget of 43.2 minutes of allowed downtime. Then you create burn rate alerts. A fast-burn alert fires when the error rate is consuming budget at 14.4 times the sustainable rate, meaning the entire monthly budget would be gone in 2 hours. This catches acute incidents. A slow-burn alert fires at 1 times the sustainable rate held over 3 days, catching gradual degradation. Each alert uses two time windows, a short one like 5 minutes and a long one like 1 hour, so a single brief spike does not trigger a false alarm. In Prometheus, this translates to recording rules that calculate error ratios over multiple windows, plus alert rules that compare burn rates against thresholds. Grafana displays an SLO dashboard showing remaining error budget, burn rate trends, and alert status. The result: instead of 500 noisy alerts, you might have 10 to 20 SLO-based alerts across all services, each one actionable and tied to real user impact.
Code Example
# ─── SLO Definition ───
# Service: payments-api
# SLO: 99.9% availability (error budget: 0.1% or 43.2 min/month)
# ─── Recording Rules (prometheus-rules.yaml) ───
groups:
- name: payments-slo
rules:
# Error ratio over different windows
- record: payments:error_ratio:5m
expr: |
sum(rate(http_requests_total{job="payments-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payments-api"}[5m]))
- record: payments:error_ratio:1h
expr: |
sum(rate(http_requests_total{job="payments-api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="payments-api"}[1h]))
- record: payments:error_ratio:6h
expr: |
sum(rate(http_requests_total{job="payments-api",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="payments-api"}[6h]))
# ─── Alert Rules (burn rate) ───
# Fast burn: 14.4x budget consumption → page immediately
- alert: PaymentsSLOFastBurn
expr: |
payments:error_ratio:5m > (14.4 * 0.001)
and
payments:error_ratio:1h > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Payments API burning error budget 14x too fast"
description: "At this rate, monthly budget exhausted in 2 hours"
# Slow burn: 1x sustained → ticket (not page)
- alert: PaymentsSLOSlowBurn
expr: |
payments:error_ratio:6h > (1 * 0.001)
and
payments:error_ratio:3d > (1 * 0.001)
for: 30m
labels:
severity: warning
annotations:
summary: "Payments API slowly burning error budget"
description: "Gradual degradation — investigate this week"◈ Architecture Diagram
Cause-Based (noisy): Symptom-Based (actionable): ┌────────────────────┐ ┌────────────────────┐ │ CPU > 80% PAGE │ │ │ │ Disk > 85% PAGE │ │ Error rate > SLO │ │ Memory > 90% PAGE │ │ burn rate? │ │ Restarts > 3 PAGE │ │ │ │ Queue > 1000 PAGE │ │ YES → PAGE │ │ Latency spike PAGE │ │ NO → silence │ └────────────────────┘ └────────────────────┘ 500+ alerts, most noise 10-20 alerts, all real Multi-Window Burn Rate: Error Budget: 43.2 min/month (99.9% SLO) ┌─── Fast Burn ──────────────────────┐ │ 14.4x burn rate │ │ 5min window AND 1hr window │ │ → exhausts budget in 2 hours │ │ → PAGE immediately │ └────────────────────────────────────┘ ┌─── Slow Burn ──────────────────────┐ │ 1x burn rate │ │ 6hr window AND 3day window │ │ → exhausts budget in 30 days │ │ → ticket, investigate this week │ └────────────────────────────────────┘
Quick Answer
Recording rules pre-compute expensive PromQL queries and save the results as new time series, making dashboards load faster. Alerting rules check PromQL conditions at regular intervals and fire alerts to Alertmanager when conditions stay true for a set duration.
Detailed Answer
Think of recording rules like a restaurant that preps ingredients before the dinner rush. Instead of chopping vegetables from scratch for every order, the kitchen pre-chops during quiet hours. Recording rules pre-compute expensive PromQL queries on a schedule so dashboards load instantly. Alerting rules are like a smoke detector: they continuously check a condition and sound the alarm when something crosses a threshold for long enough to be a real problem. Both types of rules are defined in YAML files and loaded by Prometheus through the rule_files config. They are organized into rule groups, where each group has a name and an optional evaluation interval. Recording rules have a record field (the name of the new metric to create) and an expr field (the PromQL expression to evaluate). The naming convention follows the pattern level:metric:operations -- for example, namespace:http_requests_total:rate5m tells you the aggregation level, the base metric, and what operation was applied. Alerting rules have an alert field (the alert name), an expr field, an optional for duration, labels to attach, and annotations for human-readable descriptions. Under the hood, Prometheus evaluates rules within each group sequentially but can run multiple groups in parallel. The evaluation interval defaults to the global setting but can be overridden per group. For recording rules, each evaluation writes a new sample to the TSDB with the current timestamp. For alerting rules, the evaluation produces one of three states: inactive (the expression returned nothing), pending (the expression matched but the for duration has not passed yet), or firing (the expression has been true for at least the for duration). When an alert hits firing state, Prometheus sends it to all configured Alertmanagers. In production, recording rules are essential for scaling dashboards. Without them, 50 engineers opening the same Grafana dashboard during an incident would each trigger the same expensive aggregation 50 times per refresh. Recording rules compute it once and store the result. A common pattern is building a pyramid: raw metrics get aggregated into per-service rates, then those rates get aggregated into per-team totals. For alerting, the for clause is critical -- it prevents false alarms from momentary spikes. A for: 5m clause means the condition must be continuously true for 5 minutes before the alert fires. A key gotcha with recording rules is circular dependencies. If rule A depends on the output of rule B, both must be in the same group with B listed first, because rules within a group run sequentially. Across groups, evaluation order is not guaranteed. For alerting rules, a common mistake is leaving out the for clause entirely, which causes alerts to fire on every brief spike. Another pitfall is hardcoding values in annotations instead of using template variables. Always include {{ $labels.instance }} and {{ $value }} in your annotation templates so on-call engineers can immediately see which target is affected and how bad it is.
Code Example
# prometheus-rules.yml - Recording and Alerting Rules
# Loaded via: rule_files: ['prometheus-rules.yml'] in prometheus.yml
groups:
# Recording rules for payments-api performance metrics
- name: payments_api_recording_rules # Group name for organization
interval: 30s # Evaluate every 30 seconds
rules:
# Pre-compute per-service request rate
- record: service:http_requests_total:rate5m # New time series name (level:metric:operation)
expr: > # PromQL expression to evaluate
sum by (service, environment) (
rate(http_requests_total[5m]) # Rate of requests over 5 minutes
)
labels:
aggregated_by: "recording_rule" # Custom label to identify pre-computed metrics
# Pre-compute error rate percentage
- record: service:http_error_rate:ratio_rate5m # Error ratio as a recording rule
expr: > # Avoids expensive division in dashboards
sum by (service) (rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
# Pre-compute p99 latency per service
- record: service:http_request_duration:p99_5m # 99th percentile latency
expr: > # histogram_quantile is expensive at query time
histogram_quantile(0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Alerting rules for checkout-service SLOs
- name: checkout_service_alerts # Alerting rule group
rules:
# Alert when error rate exceeds 1% for 5 minutes
- alert: CheckoutHighErrorRate # Alert name (PascalCase convention)
expr: > # PromQL condition to evaluate
service:http_error_rate:ratio_rate5m{service="checkout-service"} > 0.01
for: 5m # Must be true for 5 min before firing
labels:
severity: critical # Routing label for Alertmanager
team: checkout # Team responsible for this alert
annotations:
summary: "High error rate on checkout-service" # Short description
description: > # Detailed description with templates
Error rate is {{ $value | humanizePercentage }}
for {{ $labels.service }} in {{ $labels.environment }}.
runbook_url: "https://wiki.internal/runbooks/checkout-errors" # Link to remediation steps
# Alert when p99 latency exceeds 2 seconds
- alert: CheckoutHighLatency # Latency SLO violation alert
expr: > # Use pre-computed recording rule
service:http_request_duration:p99_5m{service="checkout-service"} > 2.0
for: 10m # Longer for-clause to reduce noise
labels:
severity: warning # Warning severity, not critical
team: checkout # Ownership label
annotations:
summary: "P99 latency exceeds 2s on checkout-service"
description: > # Include actual value for quick triage
P99 latency is {{ $value | humanizeDuration }}
for {{ $labels.service }}.◈ Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐ │ Recording Rules vs Alerting Rules │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Recording Rules │ │ │ │ │ │ │ │ Expensive PromQL ──→ Evaluate every 30s │ │ │ │ expr ──→ Store as new metric │ │ │ │ in TSDB │ │ │ │ │ │ │ │ sum(rate(http_total[5m])) → service:http:rate5m │ │ │ │ [complex query] [pre-computed] │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Alerting Rules │ │ │ │ │ │ │ │ PromQL expr ──→ Evaluate ──→ State Machine │ │ │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ INACTIVE │──→│ PENDING │──→│ FIRING │ │ │ │ │ │ expr=∅ │ │ expr=true│ │ for:5m │ │ │ │ │ └──────────┘ │ timer<5m │ │ elapsed │ │ │ │ │ ↑ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ │ │ │ └──expr=false──┘ ↓ │ │ │ │ ┌────────────┐ │ │ │ │ │Alertmanager│ │ │ │ │ │ routing │ │ │ │ │ └────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘
Quick Answer
Alertmanager receives alerts from Prometheus, groups related ones by labels, routes them to the right receiver (Slack, PagerDuty, email) using a routing tree, and supports silences to temporarily mute notifications. Grouping reduces noise by batching alerts that share the same labels into one notification.
Detailed Answer
Think of Alertmanager as a hospital triage system. Patients (alerts) arrive and are grouped by condition. Cardiac cases go to cardiology, broken bones go to orthopedics (routing rules). If the hospital is doing planned maintenance on radiology machines, they put up a sign saying ignore false alarms from 2am to 4am (silences). The triage nurse does not page the doctor twice for the same patient (deduplication), and waits a few minutes to batch patients arriving together (group_wait). Alertmanager is a separate process that receives alerts from one or more Prometheus servers through its /api/v2/alerts endpoint. Its configuration defines receivers (notification channels like Slack or PagerDuty), a routing tree (which alerts go where), inhibition rules (suppress certain alerts when others are already firing), and templates for formatting notifications. The routing tree starts with a root route that has a default receiver. Child routes match on alert labels using matchers. Routes are evaluated top to bottom, and the first matching child wins unless continue: true is set, which lets evaluation continue to the next sibling. Grouping is Alertmanager's most important noise reduction feature. When group_by is set to something like [service, environment], all alerts with the same service and environment label values get bundled into a single notification. Three timing settings control notification behavior: group_wait is how long to wait for more alerts before sending the first notification for a new group (default 30 seconds), group_interval is the minimum time between updates to an existing group when new alerts arrive (default 5 minutes), and repeat_interval is how long to wait before resending an unresolved alert (default 4 hours). Getting these right is the difference between a useful alert system and one that either floods your phone or misses real problems. In production, a well-designed routing tree mirrors your organization's on-call structure. Critical payment alerts go to PagerDuty for immediate paging. Warning-level alerts for batch jobs go to a Slack channel. Silences are created through the Alertmanager UI or API and match alerts by label matchers -- they are essential during deployments and maintenance windows. Inhibition rules automatically suppress downstream alerts: when the entire cluster is unreachable (KubeAPIDown), you do not want 500 pod alerts flooding the channel. The inhibition rule says if KubeAPIDown is firing, suppress all alerts with the same cluster label. A common mistake is setting group_by to too many labels, like [service, pod, instance]. This creates one notification per pod, which defeats the purpose of grouping. On the other hand, the special value group_by: ['...'] groups nothing -- every alert becomes its own group. Another pitfall is setting repeat_interval too low, causing alert fatigue from constant re-notifications for chronic issues. The sweet spot is usually 4 to 12 hours. Engineers also forget that silences expire. If you create a 1-hour silence for a deployment that takes 2 hours, alerts will resume halfway through. Always add generous padding to silence durations.
Code Example
# alertmanager.yml - Alertmanager configuration
global:
resolve_timeout: 5m # Mark alert as resolved if not re-received in 5m
slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxx' # Default Slack webhook
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' # PagerDuty events API
# Notification templates
templates:
- '/etc/alertmanager/templates/*.tmpl' # Path to custom notification templates
# Routing tree - determines which alerts go to which receivers
route:
receiver: 'slack-default' # Default receiver if no child route matches
group_by: ['alertname', 'service'] # Group alerts by these labels
group_wait: 30s # Wait 30s for more alerts before first notification
group_interval: 5m # Wait 5m between updates to existing groups
repeat_interval: 4h # Re-send unresolved alerts every 4 hours
routes:
# Critical payment alerts go to PagerDuty immediately
- match:
severity: critical # Match alerts with severity=critical
team: payments # AND team=payments
receiver: 'pagerduty-payments' # Route to PagerDuty
group_wait: 10s # Shorter wait for critical alerts
repeat_interval: 1h # Re-page every hour if unresolved
continue: false # Stop matching after this route
# All critical alerts (non-payments) go to PagerDuty general
- match:
severity: critical # Match any critical alert
receiver: 'pagerduty-general' # General on-call PagerDuty
group_wait: 15s # Quick notification for critical
# Warning alerts go to team-specific Slack channels
- match:
severity: warning # Match warning-level alerts
receiver: 'slack-default' # Default Slack channel
routes:
- match:
team: checkout # Checkout team warnings
receiver: 'slack-checkout' # Team-specific Slack channel
- match:
team: payments # Payments team warnings
receiver: 'slack-payments' # Payments Slack channel
# Inhibition rules - suppress alerts when others are firing
inhibit_rules:
- source_match: # When this alert is firing...
alertname: 'KubeAPIDown' # Kubernetes API server is down
target_match_re: # ...suppress these alerts
alertname: 'Kube.*' # All Kubernetes-related alerts
equal: ['cluster'] # Only if cluster label matches
- source_match: # When critical alert is firing...
severity: 'critical' # For a specific service
target_match:
severity: 'warning' # Suppress warning alerts
equal: ['alertname', 'service'] # For the same alert and service
# Receivers - notification channel configurations
receivers:
- name: 'slack-default' # Default Slack receiver
slack_configs:
- channel: '#alerts-general' # Slack channel name
send_resolved: true # Notify when alert resolves
title: '{{ .GroupLabels.alertname }}' # Alert name as title
text: >- # Notification body template
{{ range .Alerts }}
*{{ .Labels.service }}* - {{ .Annotations.summary }}
{{ end }}
- name: 'pagerduty-payments' # PagerDuty for payments team
pagerduty_configs:
- service_key: 'payments-service-key-xxx' # PagerDuty integration key
severity: '{{ .GroupLabels.severity }}' # Map to PD severity
description: '{{ .CommonAnnotations.summary }}' # Alert summary
- name: 'slack-checkout' # Checkout team Slack channel
slack_configs:
- channel: '#checkout-alerts' # Team-specific channel
send_resolved: true # Send resolution notifications
- name: 'slack-payments' # Payments team Slack channel
slack_configs:
- channel: '#payments-alerts' # Team-specific channel
send_resolved: true # Send resolution notifications
- name: 'pagerduty-general' # General on-call PagerDuty
pagerduty_configs:
- service_key: 'general-oncall-key-xxx' # General integration key◈ Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐ │ Alertmanager Routing Tree │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ Prometheus ──→ POST /api/v2/alerts ──→ Alertmanager │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Root Route │ │ │ │ receiver: slack-default │ │ │ │ group_by: [alertname, service] │ │ │ │ │ │ │ │ ├── severity=critical AND team=payments │ │ │ │ │ └── receiver: pagerduty-payments │ │ │ │ │ │ │ │ │ ├── severity=critical │ │ │ │ │ └── receiver: pagerduty-general │ │ │ │ │ │ │ │ │ └── severity=warning │ │ │ │ ├── team=checkout │ │ │ │ │ └── receiver: slack-checkout │ │ │ │ └── team=payments │ │ │ │ └── receiver: slack-payments │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ Grouping Timeline: │ │ ┌────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Alert │→ │group_wait│→ │ 1st Notify │→ │group_interval│ │ │ │ Arrives│ │ (30s) │ │ (batch sent) │ │ (5m) │ │ │ └────────┘ └──────────┘ └──────────────┘ └──────┬───────┘ │ │ │ │ │ ┌───────↓───────┐ │ │ │repeat_interval│ │ │ │ (4h) │ │ │ │ re-send if │ │ │ │ unresolved │ │ │ └───────────────┘ │ └──────────────────────────────────────────────────────────────────┘