Netflix

28 interview questions · docker, kubernetes, prometheus

dockerkubernetesprometheusadvancedarchitectbeginnerintermediate

How do you define SLOs, SLIs, and error budgets for a Kubernetes-hosted payments service, and how do they drive engineering decisions?

advancedmonitoringkubernetes

▼

Quick Answer

SLIs are measurable signals like latency and error rate. SLOs set targets around those SLIs (e.g., 99.95% of payment transactions complete under 500ms). Error budgets are the allowed failure margin — when the budget is exhausted, teams shift from feature work to reliability improvements.

Detailed Answer

Think of a bank account for reliability. Your SLO is like your minimum balance requirement, your SLI is the actual balance, and your error budget is the amount you can spend before hitting that minimum. When you overspend, the bank restricts your account — similarly, when your error budget is exhausted, engineering shifts priorities from new features to reliability work. In a banking context, SLIs (Service Level Indicators) are concrete measurements taken from your Kubernetes-hosted payments-api service. Common SLIs include request latency at the 99th percentile, error rate as a percentage of total requests, and availability measured as successful health checks over time. For a payments service handling wire transfers and ACH transactions, you might track the percentage of transactions that complete end-to-end within 2 seconds, the rate of 5xx errors returned by the settlements-processor, and the availability of the fraud-detector service during business hours. These metrics are collected via Prometheus ServiceMonitors scraping /metrics endpoints on each pod, and they feed into Grafana dashboards that the platform team monitors. SLOs (Service Level Objectives) are targets set around SLIs. For a regulated payments service, you might define: 99.95% of payment API requests return successfully within 500ms, 99.99% of settlement batch jobs complete within the processing window, and the fraud-detector must be available 99.97% of the time during trading hours. These SLOs are negotiated between the platform SRE team, product owners, and compliance officers. In banking, SLOs often need to align with regulatory requirements — PCI-DSS mandates certain uptime and data integrity guarantees, and your SLOs should be stricter than any regulatory floor to give you breathing room. Error budgets are calculated as 100% minus the SLO target over a rolling window. If your payments-api has a 99.95% availability SLO over 30 days, your error budget is 0.05%, which translates to roughly 21.6 minutes of allowed downtime per month. The platform team tracks error budget consumption in real time using tools like Sloth or custom Prometheus recording rules. When the budget is more than 50% consumed, alerts fire and the team reviews recent deployments and changes. When the budget is fully consumed, the team enacts a reliability freeze — no new feature deployments to the payments namespace, and all engineering effort shifts to reducing toil, fixing flaky tests, improving observability, and hardening the deployment pipeline. In production at a bank, the SRE team typically runs weekly error budget review meetings. These meetings examine which incidents consumed budget, whether the consumption was from planned maintenance or unexpected failures, and what systemic improvements would prevent recurrence. The payments-api team might discover that 80% of their error budget was consumed by a single database failover event, leading them to invest in connection pool tuning and read replica routing. The error budget model also drives architectural decisions — if the settlements-processor consistently burns through its budget, the team might propose moving from synchronous to asynchronous processing with Kafka queues, giving the service more resilience against downstream latency spikes. A critical gotcha in banking environments is that not all SLO violations are equal from a regulatory perspective. A brief latency spike on a read-only account balance endpoint is very different from a data integrity issue on a funds transfer. Teams should implement tiered SLOs — critical payment paths get stricter targets and separate error budgets from informational endpoints. Another common mistake is setting SLOs too aggressively early on (like 99.99% when your infrastructure can only realistically deliver 99.9%), which results in a permanently exhausted error budget and teams ignoring the system entirely. Start with achievable targets based on historical data, then tighten them as reliability improves.

Code Example

# Prometheus recording rules for SLI tracking on payments-api
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: payments-api-slos
  namespace: banking-prod
spec:
  groups:
    - name: payments-api.slos
      interval: 30s
      rules:
        # SLI: Request success rate for payments-api
        - record: payments_api:sli:success_rate:5m
          expr: |
            sum(rate(http_requests_total{job="payments-api",code=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{job="payments-api"}[5m]))
        # SLI: Latency P99 for settlements-processor
        - record: settlements:sli:latency_p99:5m
          expr: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{job="settlements-processor"}[5m])) by (le)
            )
        # Error budget remaining (30-day rolling window)
        - record: payments_api:error_budget:remaining
          expr: |
            1 - (
              (1 - payments_api:sli:success_rate:30d)
              /
              (1 - 0.9995)  # SLO target: 99.95%
            )
---
# Sloth SLO definition for fraud-detector
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: fraud-detector-slos
  namespace: banking-prod
spec:
  service: fraud-detector
  labels:
    team: platform-sre
    tier: critical
  slos:
    - name: availability
      objective: 99.97  # Regulatory minimum is 99.9%
      sli:
        events:
          errorQuery: sum(rate(grpc_server_handled_total{grpc_service="FraudDetector",grpc_code!="OK"}[{{.window}}]))
          totalQuery: sum(rate(grpc_server_handled_total{grpc_service="FraudDetector"}[{{.window}}]))
      alerting:
        name: FraudDetectorAvailability
        pageAlert:
          labels:
            severity: critical
            routing: banking-oncall
        ticketAlert:
          labels:
            severity: warning

◈ Architecture Diagram

┌─────────────────────────────────────────────────────────┐
│                    SLO Framework                        │
│                                                         │
│  ┌──────────┐    ┌──────────┐    ┌──────────────────┐   │
│  │  SLI:    │    │  SLO:    │    │  Error Budget:   │   │
│  │ Actual   │───→│ Target   │───→│ 100% - SLO      │   │
│  │ Metrics  │    │ 99.95%   │    │ = 0.05%          │   │
│  └──────────┘    └──────────┘    └────────┬─────────┘   │
│                                           │             │
│                                           ▼             │
│  ┌────────────────────────────────────────────────────┐  │
│  │              Error Budget Policy                   │  │
│  │                                                    │  │
│  │  Budget > 50%  → Feature development continues     │  │
│  │  Budget < 50%  → Reliability review triggered      │  │
│  │  Budget = 0%   → Reliability freeze enacted        │  │
│  └────────────────────────────────────────────────────┘  │
│                                                         │
│  ┌─────────────┐  ┌───────────────┐  ┌──────────────┐   │
│  │ payments-api│  │ settlements-  │  │ fraud-       │   │
│  │ SLO: 99.95% │  │ processor     │  │ detector     │   │
│  │ Budget: 21m │  │ SLO: 99.99%   │  │ SLO: 99.97%  │   │
│  │ /month      │  │ Budget: 4.3m  │  │ Budget: 13m  │   │
│  └─────────────┘  └───────────────┘  └──────────────┘   │
└─────────────────────────────────────────────────────────┘

How do you design CI/CD pipelines that deploy to Kubernetes with approval gates, canary analysis, and automatic rollback?

advancedcicdkubernetes

▼

Quick Answer

Build a multi-stage pipeline with automated testing gates, manual approval for production, progressive canary deployment using Argo Rollouts or Flagger, metric-based canary analysis from Prometheus, and automatic rollback when error rates or latency exceed thresholds.

Detailed Answer

Think of deploying software to a bank like introducing a new procedure at a branch. You would not roll it out to all 500 branches simultaneously. You would train one branch first (canary), monitor customer satisfaction and error rates for a few days, get management approval (gate), then gradually roll it out to more branches while watching for problems. If complaints spike, you immediately revert to the old procedure. CI/CD pipelines for Kubernetes follow this exact pattern with automation replacing the manual monitoring. A production-grade CI/CD pipeline for banking has distinct stages. The CI portion runs on every pull request: code checkout, dependency vulnerability scanning (Snyk or Trivy), unit tests, static analysis (SonarQube), container image build, image vulnerability scanning, and push to ECR with a git-SHA tag. The CD portion triggers when code merges to main: deploy to staging, run integration tests against staging, wait for manual approval from a tech lead or release manager, deploy canary to production (5% traffic), run automated canary analysis for 30 minutes, gradually shift traffic (5% → 25% → 50% → 100%), and verify post-deployment health checks. In a regulated bank, the approval gate is not optional — SOX compliance requires documented approval for production changes, and the pipeline must log who approved, when, and what was deployed. Canary analysis is where the pipeline becomes intelligent. Instead of a human watching dashboards during canary, tools like Argo Rollouts with the Prometheus metrics provider or Flagger with its canary analysis engine automatically compare the canary's metrics against the baseline (stable version). You define success criteria: error rate must be below 1%, P99 latency must be below 500ms, and no new error log patterns. The tool queries Prometheus every 60 seconds during the canary window, compares canary metrics against the stable version's metrics, and makes a pass/fail decision. If any metric fails the threshold for two consecutive checks, the canary is automatically rolled back — no human intervention needed. This is critical for banking because a bad deployment to the payments-api could cause failed transactions, and automatic rollback limits the blast radius to the 5% canary traffic. Argo Rollouts replaces the standard Kubernetes Deployment with a Rollout resource that supports canary and blue-green strategies natively. The Rollout resource defines the canary steps (traffic weight, pause duration, analysis run), and an AnalysisTemplate defines the Prometheus queries and thresholds. When a new image is pushed, the Rollout controller creates a canary ReplicaSet, configures the Istio VirtualService (or nginx ingress) to split traffic, runs the analysis, and either promotes or aborts. The entire process is declarative and version-controlled — auditors can review the Git history to see exactly what canary criteria were in place for each deployment. In production at a bank, the pipeline must also handle database migrations, feature flags, and compliance artifacts. Database migrations run before the canary deployment using a Kubernetes Job with a migration container. Feature flags (via LaunchDarkly or Unleash) allow code to be deployed but not activated until the canary is promoted. Compliance artifacts — SBOM (Software Bill of Materials), vulnerability scan results, approval records, and deployment timestamps — are stored in an immutable artifact store (JFrog Artifactory or AWS CodeArtifact) and linked to the deployment for audit trails. The pipeline also enforces branch protection rules: only code that has passed peer review (minimum two approvals), all CI checks, and security scanning can reach the production deployment stage. The biggest gotcha is canary analysis that gives false confidence. If your canary only receives 5% of traffic and you are analyzing error rate, low traffic volume means a single error can swing your error rate from 0% to 10%, causing false rollbacks. Use absolute error counts alongside percentages for low-traffic services. Another gotcha is not testing the rollback path — if your canary deployment includes a database migration that is not backward-compatible, rolling back the application while the database has already migrated forward causes data issues. Always make database migrations backward-compatible (add columns but do not remove them until the next release). Finally, approval gates must have timeouts — a deployment waiting for approval for 48 hours in a banking context creates risk if the codebase has moved on.

Code Example

# Argo Rollouts - Canary deployment for payments-api
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments-api
  namespace: banking-prod
spec:
  replicas: 10
  revisionHistoryLimit: 5
  selector:
    matchLabels:
      app: payments-api
  strategy:
    canary:
      canaryService: payments-api-canary
      stableService: payments-api-stable
      trafficRouting:
        istio:
          virtualServices:
            - name: payments-api-vsvc
              routes:
                - primary
      steps:
        # Step 1: 5% canary traffic + analysis
        - setWeight: 5
        - analysis:
            templates:
              - templateName: payments-api-canary-analysis
            args:
              - name: service-name
                value: payments-api-canary
        # Step 2: Manual approval gate (SOX compliance)
        - pause: {}  # Requires manual promotion
        # Step 3: Gradual rollout
        - setWeight: 25
        - pause: {duration: 5m}
        - setWeight: 50
        - pause: {duration: 5m}
        - setWeight: 100
      # Automatic rollback on failure
      abortScaleDownDelaySeconds: 30
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: payments-api
          image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payments-api:v2.5.1
          ports:
            - containerPort: 8080
---
# AnalysisTemplate - Prometheus-based canary validation
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: payments-api-canary-analysis
  namespace: banking-prod
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 60s
      count: 10        # 10 checks over 10 minutes
      successCondition: result[0] < 0.01  # < 1% error rate
      failureLimit: 2  # Rollback after 2 consecutive failures
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{app="{{args.service-name}}",code=~"5.."}[2m]))
            /
            sum(rate(http_requests_total{app="{{args.service-name}}"}[2m]))
    - name: latency-p99
      interval: 60s
      count: 10
      successCondition: result[0] < 0.5  # < 500ms P99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{app="{{args.service-name}}"}[2m])) by (le)
            )
---
# GitHub Actions CI pipeline with security gates
# .github/workflows/payments-api-cicd.yaml
name: payments-api CI/CD
on:
  push:
    branches: [main]
    paths: ['services/payments-api/**']
jobs:
  ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run unit tests
        run: go test ./... -coverprofile=coverage.out
      - name: SonarQube analysis
        uses: sonarsource/sonarqube-scan-action@v2
      - name: Build container image
        run: |
          docker build -t payments-api:${{ github.sha }} \
            -f services/payments-api/Dockerfile .
      - name: Trivy vulnerability scan
        uses: aquasecurity/trivy-action@master
        with:
          image-ref: payments-api:${{ github.sha }}
          severity: CRITICAL,HIGH
          exit-code: 1  # Fail pipeline on critical vulns
      - name: Generate SBOM for compliance audit trail
        run: syft payments-api:${{ github.sha }} -o spdx-json > sbom.json
      - name: Push to ECR
        run: |
          aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
          docker tag payments-api:${{ github.sha }} $ECR_REGISTRY/payments-api:${{ github.sha }}
          docker push $ECR_REGISTRY/payments-api:${{ github.sha }}

◈ Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              CI/CD Pipeline with Canary Analysis                │
│                                                                 │
│  ┌──────┐  ┌──────┐  ┌───────┐  ┌──────┐  ┌──────────────────┐ │
│  │ Code │→ │ Unit │→ │ SAST  │→ │Image │→ │ Trivy Scan +     │ │
│  │Commit│  │ Test │  │Sonar  │  │Build │  │ SBOM Generation  │ │
│  └──────┘  └──────┘  └───────┘  └──────┘  └────────┬─────────┘ │
│                                                     │           │
│                          ┌──────────────────────────▼────────┐  │
│                          │         Staging Deploy            │  │
│                          │    Integration Tests + E2E        │  │
│                          └──────────────┬────────────────────┘  │
│                                         │                       │
│                          ┌──────────────▼────────────────────┐  │
│                          │    Manual Approval Gate           │  │
│                          │    (SOX: who + when + what)       │  │
│                          └──────────────┬────────────────────┘  │
│                                         │                       │
│  ┌──────────────────────────────────────▼─────────────────────┐ │
│  │                  Canary Deployment                         │ │
│  │                                                            │ │
│  │   5% ──→ Analysis ──→ 25% ──→ 50% ──→ 100%               │ │
│  │           (Prometheus)                                     │ │
│  │           error rate < 1%                                  │ │
│  │           P99 < 500ms         ┌──────────────────┐        │ │
│  │                               │ Auto Rollback    │        │ │
│  │   If analysis fails ─────────→│ (abort canary,   │        │ │
│  │                               │  restore stable) │        │ │
│  │                               └──────────────────┘        │ │
│  └────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘

How do you implement chaos engineering in Kubernetes using LitmusChaos or Gremlin, and how do you control blast radius?

advancedgeneralkubernetes

▼

Quick Answer

Chaos engineering systematically injects failures (pod kills, network latency, disk stress) into Kubernetes to validate resilience. Blast radius is controlled through namespace targeting, label selectors, percentage-based injection, mandatory rollback plans, and SLO-based abort conditions.

Detailed Answer

Think of chaos engineering like a fire drill in a bank's headquarters. You don't set the entire building on fire to test the evacuation plan — you trigger a controlled alarm in one wing, observe how people respond, time the evacuation, identify bottlenecks at stairwells, and then expand to more wings only after the first drill succeeds. The 'blast radius' is which wing you pick and how many floors are involved. Chaos engineering in Kubernetes follows the same principle: start small, observe carefully, expand gradually. Before running any experiment, you must define the steady state hypothesis tied to your SLOs. For a banking payments platform, steady state might be: p99 latency for /api/payments is below 200ms, error rate is below 0.1%, and transaction throughput is above 500 TPS. The experiment asks: 'If we kill 30% of payments-api pods, does the system maintain steady state?' If it does, your horizontal scaling and readiness probes work. If it doesn't, you've found a resilience gap before a real incident exposes it at 3 AM. LitmusChaos is a CNCF project that runs as a Kubernetes operator. You install the ChaosCenter (control plane) and deploy ChaosEngine resources that reference ChaosExperiment templates. LitmusChaos provides a library of pre-built experiments: pod-delete, pod-network-latency, pod-cpu-hog, node-drain, disk-fill, and more. Each experiment has configurable parameters for blast radius: you specify the target namespace, label selectors, number of pods to affect, and duration. The ChaosEngine runs the experiment as a Kubernetes Job, injects the failure, observes the results via probes (HTTP health checks, Prometheus queries, custom scripts), and reports pass/fail. Gremlin is a commercial alternative that provides a SaaS control plane with a richer UI, team-based RBAC, and pre-built attack scenarios. Gremlin deploys a daemonset on your cluster that receives attack commands from the Gremlin cloud API. Blast radius control is the most critical aspect and requires multiple layers. First, always start in non-production environments — your staging cluster should mirror production topology (same number of replicas, same resource limits, same network policies). Second, use namespace isolation: target only the payments namespace, never run experiments against kube-system or cert-manager namespaces that affect the entire cluster. Third, use percentage-based targeting: affect 1 out of 5 pods first, then 2, then 3 — never jump to 100%. Fourth, set experiment duration limits: a 30-second pod kill is recoverable; a 30-minute network partition might trigger cascading failures in downstream services. Fifth, define abort conditions: if error rate exceeds 1% or latency exceeds 500ms, automatically halt the experiment. In a banking environment, chaos engineering requires additional governance. Every experiment needs a formal approval process, a documented rollback plan, and on-call engineers explicitly notified before execution. Regulatory frameworks like SOC2 and PCI-DSS require evidence that resilience testing was controlled and documented. Run experiments during business hours when the team is fully staffed, never on Fridays or before holidays. Use GameDay scheduling in your chaos platform to coordinate experiments across teams and ensure only one experiment runs at a time to prevent compounding failures. The gotcha that catches most teams: chaos experiments expose not just application weaknesses but observability gaps. Your first pod-kill experiment might pass — the app recovers in 5 seconds. But if your alerting didn't fire, your dashboards didn't show the event, and your on-call engineer had no idea an experiment was running, you've discovered that your monitoring is blind to real failures. The most valuable outcome of chaos engineering is often improving your observability stack, not your application code.

Code Example

# Install LitmusChaos operator on the cluster
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmus litmuschaos/litmus \
  --namespace litmus --create-namespace \
  --set portal.frontend.service.type=ClusterIP

# Install chaos experiments for the payments namespace
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/experiments.yaml \
  -n payments

# Define a ChaosEngine targeting payments-api pods
# with strict blast radius controls
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: payments-pod-kill
  namespace: payments
spec:
  engineState: active
  appinfo:
    appns: payments
    applabel: app=payments-api         # Only target payments-api pods
    appkind: deployment
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-delete
    spec:
      components:
        env:
        - name: TOTAL_CHAOS_DURATION
          value: "30"                  # Only 30 seconds of chaos
        - name: CHAOS_INTERVAL
          value: "10"                  # Kill a pod every 10 seconds
        - name: PODS_AFFECTED_PERC
          value: "30"                  # Only affect 30% of pods
        - name: FORCE
          value: "false"               # Graceful termination first
      probe:
      - name: payments-health-check
        type: httpProbe
        httpProbe/inputs:
          url: http://payments-api.payments.svc:8080/health
          insecureSkipVerify: false
          responseTimeout: 3000        # 3s timeout
          method:
            get:
              criteria: ==
              responseCode: "200"
        mode: Continuous
        runProperties:
          probeTimeout: 5
          retry: 3
          interval: 5
      - name: slo-error-rate
        type: promProbe
        promProbe/inputs:
          endpoint: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(http_requests_total{service="payments-api",code=~"5.."}[1m]))
            / sum(rate(http_requests_total{service="payments-api"}[1m])) * 100
          comparator:
            type: float
            criteria: <="
            value: "1.0"               # Abort if error rate > 1%
        mode: Continuous

# Monitor chaos experiment status
kubectl get chaosresult payments-pod-kill-pod-delete -n payments -o yaml

# Check if experiment passed or failed
kubectl get chaosengine payments-pod-kill -n payments \
  -o jsonpath='{.status.experiments[0].status}'

◈ Architecture Diagram

┌─────────── Chaos Engineering Workflow ──────────────────┐
│                                                         │
│  ┌──────────────────────────────────────────────────┐   │
│  │ 1. Define Steady State (SLO-based)               │   │
│  │    • p99 latency < 200ms                         │   │
│  │    • Error rate < 0.1%                           │   │
│  │    • Throughput > 500 TPS                        │   │
│  └──────────────────────┬───────────────────────────┘   │
│                         ▼                               │
│  ┌──────────────────────────────────────────────────┐   │
│  │ 2. Blast Radius Controls                         │   │
│  │    ┌────────────┐  ┌──────────┐  ┌───────────┐  │   │
│  │    │ Namespace  │  │ Label    │  │ % Pods    │  │   │
│  │    │ isolation  │  │ selector │  │ affected  │  │   │
│  │    │ (payments) │  │ (app=    │  │ (30%)     │  │   │
│  │    │            │  │  pay-api)│  │           │  │   │
│  │    └────────────┘  └──────────┘  └───────────┘  │   │
│  └──────────────────────┬───────────────────────────┘   │
│                         ▼                               │
│  ┌──────────────────────────────────────────────────┐   │
│  │ 3. Run Experiment                                │   │
│  │                                                  │   │
│  │  ChaosEngine → Pod Kill Job → Inject Failure     │   │
│  │       │                            │             │   │
│  │       │    ┌────────────────────────┤             │   │
│  │       ▼    ▼                       ▼             │   │
│  │    HTTP Probe    Prom Probe    Abort Condition    │   │
│  │    (health OK?)  (SLO met?)   (error > 1%? STOP) │   │
│  └──────────────────────┬───────────────────────────┘   │
│                         ▼                               │
│  ┌──────────────────────────────────────────────────┐   │
│  │ 4. Analyze Results                               │   │
│  │    PASS → expand scope in next iteration         │   │
│  │    FAIL → document gap → create remediation      │   │
│  └──────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

How do you identify whether a pod restart is caused by OOMKilled, a connectivity failure, or an application-level bug?

advancedpodskubernetes

▼

Quick Answer

Check the container's last termination reason and exit code: OOMKilled shows reason=OOMKilled with exit 137, connectivity failures show timeout errors in application logs with exit 1, and application bugs show stack traces or panic messages in logs with exit 1 or 2. The distinction comes from correlating exit codes, termination reasons, and log content.

Detailed Answer

Think of a car that keeps stalling. A mechanic checks three things in order: is it out of fuel (OOMKilled — out of memory), is the road blocked (connectivity — cannot reach a dependency), or is the engine itself broken (application bug). Each has a different diagnostic signature, and checking them in the right order saves time. In Kubernetes, every container termination has metadata that points to the cause. The container status records the exit code, the termination reason, and the termination message. OOMKilled is the clearest: Kubernetes sets the reason field to OOMKilled and the exit code to 137. This means the kernel's Out-Of-Memory killer terminated the process because it exceeded its cgroup memory limit. The container did not choose to exit — it was killed by the kernel. For connectivity failures, the exit code is typically 1 (generic application error) and the logs show timeout or connection refused messages when trying to reach a database, cache, or external API. The key diagnostic is checking the application logs for patterns like 'connection refused,' 'timeout,' 'no such host,' or 'TLS handshake failed.' You can verify by execing into the pod and testing connectivity manually with nc, curl, or nslookup to isolate whether it is a DNS, network policy, or service availability issue. For application bugs, the exit code is 1 or sometimes 2 (misuse), and the logs show stack traces, null pointer exceptions, panic messages, or assertion failures. These are predictable (deterministic) — the same input or configuration triggers the same crash. You can distinguish them from connectivity issues because the error occurs during request processing or startup logic, not during a connection attempt. The non-obvious gotcha is that OOMKilled can masquerade as an application bug if you only check logs. When the OOM killer strikes, the process is terminated immediately — there may be no log line because the application never got a chance to write one. If you see a container with exit code 137, zero log output, and high restart count, check the termination reason field directly. Also, a JVM application may show exit code 1 with a java.lang.OutOfMemoryError in logs if it hits the JVM heap limit before hitting the cgroup limit — this is an application-level OOM, not a kernel OOMKill, and the fix is different (increase JVM heap, not container memory limit).

Code Example

# Step 1: Check termination reason and exit code
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason, exitCode, startedAt, finishedAt

# Step 2: If exit code 137, confirm OOMKilled
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -i 'oom\|killed\|reason' # Confirms OOMKilled

# Step 3: Check memory usage vs limits for OOMKilled
kubectl top pod payments-api-7d9f8b6c4-abc12 -n payments # Current memory usage
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
  -o jsonpath='{.spec.containers[0].resources.limits.memory}' # Configured memory limit

# Step 4: If exit code 1, check logs for connectivity vs application error
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=100 # Check for timeout/connection vs stack trace

# Step 5: Test connectivity from inside the pod
kubectl exec -n payments deploy/payments-api -- nc -zv payments-db.internal 5432 # Test database connectivity
kubectl exec -n payments deploy/payments-api -- nslookup redis-cache.payments.svc # Test DNS resolution

# Quick reference for exit codes:
# Exit 0   = Normal termination (container completed successfully)
# Exit 1   = Application error (check logs for stack trace or connection error)
# Exit 137 = SIGKILL (OOMKilled by kernel or killed by kubelet)
# Exit 143 = SIGTERM (graceful shutdown, often from liveness probe failure)

◈ Architecture Diagram

┌──────────────┐
│ Pod Restart  │
└──────┬───────┘
       ↓
┌──────────────┐
│ Exit Code?   │
├──────┬───────┤
│ 137  │  1    │
│ OOM  │ Logs? │
└──┬───┴───┬───┘
   ↓       ↓
┌─────┐ ┌────────┐
│ OOM │ │Timeout?│
│Kill │ ├────┬───┤
└─────┘ │Yes │No │
        ↓    ↓
     ┌────┐┌────┐
     │Conn││ Bug│
     └────┘└────┘

How do HPA and VPA work together for autoscaling in production?

advancedschedulingkubernetes

▼

Quick Answer

HPA (Horizontal Pod Autoscaler) scales the number of Pod replicas based on metrics like CPU or custom metrics, while VPA (Vertical Pod Autoscaler) adjusts individual Pod resource requests and limits. Using them together requires careful configuration to avoid conflicts where both try to respond to the same metric.

Detailed Answer

Imagine a restaurant kitchen during peak hours. Horizontal scaling is hiring more cooks to handle more orders in parallel -- each cook handles a portion of the workload. Vertical scaling is upgrading your existing cooks to faster, more skilled chefs who can each handle more complex dishes. In practice, you need both strategies: more cooks for sheer volume, and better-equipped cooks so each one operates efficiently. That is the relationship between HPA and VPA in Kubernetes. HPA watches specified metrics (CPU use, memory, or custom/external metrics) and adjusts the replica count of a Deployment, ReplicaSet, or StatefulSet. It runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period) that calculates the desired replicas using the formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)). VPA, on the other hand, monitors actual resource consumption of Pods over time and recommends or automatically updates the resource requests and limits in the Pod spec. VPA has three modes: Off (recommendations only), Initial (sets requests at Pod creation), and Auto (evicts and recreates Pods with updated requests). Internally, HPA queries the metrics API (metrics.k8s.io for resource metrics, custom.metrics.k8s.io for custom metrics, or external.metrics.k8s.io for external sources like Prometheus). The metrics-server or a Prometheus adapter populates these APIs. HPA calculates the ratio of current to desired metric values across all Pods, applies a tolerance (default 10%) to prevent flapping, and issues a scale request to the API server. VPA consists of three components: the Recommender (analyzes historical usage and computes recommendations), the Updater (evicts Pods whose requests deviate significantly from recommendations), and the Admission Controller (mutates Pod specs at creation time to inject recommended requests). The Recommender uses a decaying histogram of resource usage to generate its recommendations. In production, running HPA and VPA together on the same metric (like CPU) creates a conflict. HPA sees high CPU and adds replicas; VPA sees high CPU and increases requests per Pod. Both react to the same signal, leading to over-provisioning or oscillation. The recommended pattern is to use HPA for CPU-based scaling and VPA in recommendation-only mode (mode: Off) so operators can manually adjust requests based on VPA suggestions. Alternatively, use HPA with custom metrics (like requests-per-second from Prometheus) and let VPA manage CPU and memory requests in Auto mode, since they are responding to different signals. Multidimensional Pod Autoscaler (MPA), available in some managed Kubernetes distributions, attempts to coordinate both axes natively. A non-obvious gotcha is that VPA in Auto mode evicts Pods to apply new resource requests, which means it causes rolling restarts that can impact availability if your PodDisruptionBudget is not configured correctly. Another trap: HPA uses resource requests as the baseline for percentage calculations (e.g., 80% CPU target means 80% of the CPU request), so if VPA increases the request, the same absolute CPU usage now represents a lower percentage, potentially causing HPA to scale in and reduce replicas. This feedback loop can destabilize your scaling behavior. Always set VPA minAllowed and maxAllowed bounds to prevent runaway resource allocation, and use HPA stabilization windows (behavior.scaleDown.stabilizationWindowSeconds) to dampen rapid fluctuations.

Code Example

# HPA scaling on custom metric (requests-per-second) to avoid conflict with VPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api-hpa  # HPA for the payments API service
  namespace: payments  # Same namespace as the target Deployment
spec:
  scaleTargetRef:  # Reference to the workload being scaled
    apiVersion: apps/v1  # API version of the target
    kind: Deployment  # Scale a Deployment
    name: payments-api  # Name of the Deployment
  minReplicas: 3  # Never go below 3 replicas for HA
  maxReplicas: 25  # Cap at 25 to control costs
  metrics:  # Use custom metric to avoid conflict with VPA on CPU
    - type: Pods  # Per-pod custom metric
      pods:
        metric:
          name: http_requests_per_second  # Custom metric from Prometheus adapter
        target:
          type: AverageValue  # Target average across all Pods
          averageValue: "100"  # Scale up when RPS exceeds 100 per Pod
  behavior:  # Fine-tune scaling behavior to prevent flapping
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
        - type: Percent  # Scale down by percentage
          value: 10  # Remove at most 10% of Pods per period
          periodSeconds: 60  # Evaluate every 60 seconds
    scaleUp:
      stabilizationWindowSeconds: 30  # React quickly to traffic spikes
      policies:
        - type: Pods  # Scale up by fixed number
          value: 4  # Add at most 4 Pods per period
          periodSeconds: 60  # Evaluate every 60 seconds
---
# VPA managing CPU and memory requests (Auto mode safe since HPA uses custom metric)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payments-api-vpa  # VPA for the same payments API
  namespace: payments  # Same namespace
spec:
  targetRef:  # Reference to the workload
    apiVersion: apps/v1  # API version of the target
    kind: Deployment  # Target Deployment
    name: payments-api  # Same Deployment as HPA targets
  updatePolicy:
    updateMode: Auto  # Automatically evict and recreate Pods with new requests
  resourcePolicy:
    containerPolicies:
      - containerName: payments-api  # Apply to the main container
        minAllowed:  # Floor to prevent under-provisioning
          cpu: 250m  # Minimum 250 millicores
          memory: 256Mi  # Minimum 256MB
        maxAllowed:  # Ceiling to prevent runaway costs
          cpu: "2"  # Maximum 2 CPU cores
          memory: 2Gi  # Maximum 2GB memory
        controlledResources:  # Only manage these resources
          - cpu  # VPA manages CPU requests
          - memory  # VPA manages memory requests

◈ Architecture Diagram

┌──────────┐         ┌──────────┐
│  Metrics │         │  Metrics │
│  Server  │         │ Prometheus│
└────┬─────┘         └────┬─────┘
     │ cpu/mem             │ rps
     ↓                     ↓
┌──────────┐         ┌──────────┐
│   VPA    │         │   HPA    │
│ Adjusts  │         │ Adjusts  │
│ Requests │         │ Replicas │
└────┬─────┘         └────┬─────┘
     │                     │
     └──────────┬──────────┘
                ↓
         ┌──────────┐
         │ Payments │
         │   API    │
         │ Deploy   │
         └──────────┘

How do pod topology spread constraints work internally in the Kubernetes scheduler, and what production failures can occur when they interact with cluster autoscaling?

architectschedulingkubernetes

▼

Quick Answer

Topology spread constraints tell the scheduler to distribute Pods across failure domains defined by node labels such as zone or hostname, using maxSkew to control imbalance. When combined with cluster autoscaling, problems arise if a zone has zero nodes — the autoscaler may not know about the zone, causing the scheduler to leave Pods pending indefinitely.

Detailed Answer

Think of seating guests at a wedding reception. You want to spread friends evenly across tables so no table is overcrowded and no group is isolated. The wedding planner checks how many people are at each table and seats the next guest at the most empty one, but if a table does not exist yet (no physical table has been set up), the planner cannot seat anyone there even if the venue has room. Topology spread constraints in Kubernetes work the same way. Kubernetes topology spread constraints are declared in the Pod spec under topologySpreadConstraints. Each constraint specifies a topologyKey (a node label like topology.kubernetes.io/zone or kubernetes.io/hostname), a maxSkew (the maximum allowed difference in Pod count between the most-populated and least-populated domain), a whenUnsatisfiable behavior (DoNotSchedule or ScheduleAnyway), and a labelSelector to identify which Pods count toward the spread calculation. Internally, the scheduler evaluates topology spread during the Filter and Score phases. In the Filter phase, it eliminates nodes where placing the Pod would violate the maxSkew when whenUnsatisfiable is DoNotSchedule. In the Score phase, it ranks remaining nodes by how well they balance the distribution. The scheduler considers the topologyKey label on existing nodes to define domains — a domain only exists if at least one node carries that label value. It then counts matching Pods per domain and calculates whether the new Pod can land in each domain without exceeding maxSkew. At production scale, the interaction with cluster autoscaling creates subtle failures. If a node pool in one availability zone scales to zero, that zone disappears from the scheduler's topology map. The scheduler only sees zones with active nodes, so it may consider a two-zone spread sufficient even when three zones are available. When maxSkew is 1 and whenUnsatisfiable is DoNotSchedule, the scheduler can leave Pods pending because it cannot place them in a zone that has no nodes, and the autoscaler may not create a node in the missing zone because it does not see pending Pods that specifically require it. This chicken-and-egg problem is one of the most common production issues with topology spread constraints. The non-obvious gotcha is that topology spread constraints count all matching Pods, including ones that are terminating, not-ready, or failing. During a rolling update, old Pods being terminated still count toward the spread calculation, which can cause new Pods to be unschedulable until the old ones are fully removed. Architects should set minDomains to explicitly declare how many zones the spread should consider, use node affinity in combination with spread constraints to ensure the autoscaler knows about expected zones, and monitor for unschedulable Pods with topology spread violation events.

Code Example

# Apply a Deployment with zone and node spread constraints
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages replicated Pods
metadata:
  name: checkout-api # Production checkout service
  namespace: payments # Team namespace
spec:
  replicas: 6 # Six replicas to spread across three zones with two per zone
  selector:
    matchLabels:
      app: checkout-api # Pod selector
  template:
    metadata:
      labels:
        app: checkout-api # Label used by spread constraint selector
    spec:
      topologySpreadConstraints:
      - maxSkew: 1 # Allows at most one Pod difference between zones
        topologyKey: topology.kubernetes.io/zone # Spreads across availability zones
        whenUnsatisfiable: DoNotSchedule # Strictly enforces zone balance
        labelSelector:
          matchLabels:
            app: checkout-api # Counts only checkout-api Pods
        minDomains: 3 # Expects three zones even if some have zero nodes
      - maxSkew: 1 # Allows at most one Pod difference between nodes within a zone
        topologyKey: kubernetes.io/hostname # Spreads across individual nodes
        whenUnsatisfiable: ScheduleAnyway # Prefers balance but allows imbalance
        labelSelector:
          matchLabels:
            app: checkout-api # Counts only checkout-api Pods
      containers:
      - name: api # Application container
        image: registry.company.com/checkout-api:3.7.2 # Versioned production image
        resources:
          requests:
            cpu: 250m # Minimum CPU for scheduling
            memory: 512Mi # Minimum memory for scheduling

# Check Pod distribution across zones
kubectl get pods -n payments -l app=checkout-api -o custom-columns='POD:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'

# Identify Pods pending due to topology spread violations
kubectl get events -n payments --field-selector reason=FailedScheduling | grep topology

◈ Architecture Diagram

┌─── Zone A ──┐ ┌─── Zone B ──┐ ┌─── Zone C ──┐
│ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│
│ │Pod1││Pod2││ │ │Pod3││Pod4││ │ │Pod5││Pod6││
│ └────┘└────┘│ │ └────┘└────┘│ │ └────┘└────┘│
│  maxSkew=1  │ │  maxSkew=1  │ │  maxSkew=1  │
└─────────────┘ └─────────────┘ └─────────────┘

How should architects combine Vertical Pod Autoscaler and Horizontal Pod Autoscaler in the same cluster without creating scaling conflicts, and when does KEDA fit better than either?

architectgeneralkubernetes

▼

Quick Answer

VPA and HPA conflict when both scale on the same metric because VPA resizes Pods while HPA changes Pod count, creating feedback loops. The safe pattern is VPA on memory and HPA on CPU or custom metrics, or VPA in recommendation-only mode. KEDA fits better for event-driven workloads that scale to zero or react to external queue depth rather than Pod-level CPU or memory.

Detailed Answer

Think of a restaurant kitchen during dinner rush. The horizontal approach is adding more cooks to handle more orders. The vertical approach is giving each cook a bigger stove and more counter space so they can cook faster. If you try both approaches based on the same signal — how backed up the order queue is — the kitchen oscillates between adding cooks and upgrading stoves in a confusing loop. You need different signals for each scaling axis. The Horizontal Pod Autoscaler watches metrics like CPU use, memory use, or custom metrics, and adjusts the number of Pod replicas to meet a target value. The Vertical Pod Autoscaler observes actual resource consumption over time and recommends or applies changes to the resource requests and limits of individual Pods. When both use CPU as their metric, VPA may increase a Pod's CPU request, which changes the Pod's use percentage, which then triggers HPA to scale down replicas, which increases use again, creating an oscillation loop. The recommended production pattern separates their concerns. Run VPA in recommendation mode (updateMode: Off) or target only memory, while HPA scales on CPU or custom metrics. Alternatively, use VPA in Auto mode for stateful workloads where horizontal scaling is impractical — databases, caches, or single-instance controllers — and reserve HPA for stateless services that benefit from replica scaling. Some teams use VPA recommendations to set initial resource requests in CI/CD pipelines rather than letting VPA mutate Pods at runtime, which avoids the Pod restart that VPA triggers when updating in-place. At production scale, KEDA (Kubernetes Event-Driven Autoscaler) fills a gap that neither HPA nor VPA addresses well. KEDA scales based on external event sources — message queue depth in Kafka or SQS, pending items in a Redis stream, HTTP request rate from Prometheus, or custom metrics from any source with a KEDA scaler. Critically, KEDA can scale Deployments to zero replicas when there is no work, which standard HPA cannot do (HPA's minimum is one). This makes KEDA the right choice for batch processors, event consumers, and asynchronous workers where idle cost matters. KEDA works by deploying ScaledObject resources that create HPA objects under the hood, so it integrates with existing Kubernetes autoscaling infrastructure. The non-obvious gotcha is that VPA in Auto mode restarts Pods when it changes resource requests, which can cause brief service disruption and interact badly with PodDisruptionBudget limits. Teams often discover this during peak traffic when VPA decides to right-size all replicas simultaneously. Architects should set VPA update policies with minReplicas and eviction requirements, and test VPA behavior during high-traffic scenarios before enabling Auto mode on critical services.

Code Example

# Deploy VPA in recommendation mode for a service — no automatic Pod restarts
apiVersion: autoscaling.k8s.io/v1 # VPA API group
kind: VerticalPodAutoscaler # Recommends or applies resource changes
metadata:
  name: checkout-api-vpa # VPA for the checkout service
  namespace: payments # Same namespace as the target
spec:
  targetRef:
    apiVersion: apps/v1 # References a Deployment
    kind: Deployment # Target workload type
    name: checkout-api # The Deployment to analyze
  updatePolicy:
    updateMode: "Off" # Generates recommendations without applying them
  resourcePolicy:
    containerPolicies:
    - containerName: api # Target the main container
      minAllowed:
        cpu: 100m # Never recommend below 100m CPU
        memory: 256Mi # Never recommend below 256Mi memory
      maxAllowed:
        cpu: 2 # Cap recommendations at 2 CPU cores
        memory: 4Gi # Cap recommendations at 4Gi memory

# Deploy HPA scaling on CPU utilization (safe alongside VPA on memory)
apiVersion: autoscaling/v2 # HPA v2 API for custom metrics support
kind: HorizontalPodAutoscaler # Scales replica count
metadata:
  name: checkout-api-hpa # HPA for the checkout service
  namespace: payments # Same namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1 # References the same Deployment
    kind: Deployment # Target workload type
    name: checkout-api # The Deployment to scale
  minReplicas: 3 # Never scale below three replicas
  maxReplicas: 20 # Cap at twenty replicas
  metrics:
  - type: Resource # Uses built-in resource metrics
    resource:
      name: cpu # Scales on CPU utilization only
      target:
        type: Utilization # Target a percentage of requests
        averageUtilization: 70 # Scale up when average CPU exceeds 70 percent

# Read VPA recommendations without applying them
kubectl get vpa checkout-api-vpa -n payments -o jsonpath='{.status.recommendation.containerRecommendations[*]}'

◈ Architecture Diagram

┌─────────────────────────────────┐
│        Scaling Decision         │
├───────────┬───────────┬─────────┤
│  HPA      │  VPA      │  KEDA   │
│ ─ ─ ─ ─  │ ─ ─ ─ ─   │ ─ ─ ─  │
│ CPU/custom│ Memory    │ Queue   │
│ → replicas│ → sizing  │ → 0..N │
│ stateless │ stateful  │ events  │
└───────────┴───────────┴─────────┘

How does Karpenter differ from Cluster Autoscaler in node provisioning strategy, and when should architects choose one over the other for production workloads?

architectschedulingkubernetes

▼

Quick Answer

Karpenter provisions nodes directly from cloud provider APIs based on pending Pod requirements, selecting instance types dynamically without predefined node groups. Cluster Autoscaler adjusts the size of existing node groups. Karpenter is faster and more flexible for heterogeneous workloads, while Cluster Autoscaler is simpler for teams with well-defined node group templates and multi-cloud portability needs.

Detailed Answer

Think of hiring staff for a catering company. Cluster Autoscaler is like posting a job ad for a specific role — you have predefined job descriptions (node groups), and when you need more people, you hire from those templates. Karpenter is like a staffing agency that looks at the actual tasks on the board, finds a person with exactly the right skills and availability, and places them immediately. The agency is faster and more flexible, but you need to trust it with your hiring criteria. Cluster Autoscaler has been the standard Kubernetes node scaling solution since early Kubernetes versions. It works by monitoring pending Pods that cannot be scheduled due to insufficient resources, then scaling up one of the configured node groups (Auto Scaling Groups on AWS, Managed Instance Groups on GCP, or VM Scale Sets on Azure). It also scales down by identifying underutilized nodes and draining them. The key limitation is that node groups must be pre-configured with specific instance types, and the autoscaler chooses among existing groups rather than selecting the optimal instance type for each workload. Karpenter, originally created by AWS and now a CNCF project, takes a fundamentally different approach. Instead of managing node groups, Karpenter watches for unschedulable Pods and directly provisions compute from the cloud provider API, choosing the best instance type based on Pod resource requirements, node selectors, affinity rules, and topology spread constraints. NodePool resources define constraints like allowed instance families, availability zones, capacity types (on-demand or spot), and expiration policies. Karpenter evaluates all pending Pods together and can bin-pack them onto a single optimally-sized instance rather than scaling a node group one unit at a time. At production scale, Karpenter typically provisions nodes in under 60 seconds compared to 2-5 minutes for Cluster Autoscaler, because it skips the Auto Scaling Group scaling process and calls the EC2 API directly. Karpenter also handles node disruption proactively through consolidation, which replaces underutilized nodes with cheaper or better-fitting ones, and expiration, which rotates nodes to pick up AMI updates. However, Karpenter is currently most mature on AWS, with Azure support in development and GCP support community-driven. Teams needing multi-cloud portability or running on GKE or AKS may still prefer Cluster Autoscaler. The non-obvious gotcha is that Karpenter's flexibility requires careful constraint definition. Without proper NodePool limits on instance families, maximum Pods per node, or total cluster capacity, Karpenter can provision very large or very expensive instances. It can also create infrastructure drift if the team's Terraform or IaC does not account for Karpenter-managed nodes. Architects should set explicit NodePool constraints, integrate Karpenter's provisioned nodes into their monitoring and cost dashboards, and understand that Karpenter manages node lifecycle independently of any external node group definition.

Code Example

# Karpenter NodePool that provisions cost-optimized compute for general workloads
apiVersion: karpenter.sh/v1 # Karpenter API group
kind: NodePool # Defines provisioning constraints and behavior
metadata:
  name: general-workloads # Pool name for general-purpose services
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type # Defines instance purchasing model
        operator: In
        values: [on-demand, spot] # Allows both on-demand and spot instances
      - key: node.kubernetes.io/instance-type # Limits instance families
        operator: In
        values: [m6i.large, m6i.xlarge, m7i.large, m7i.xlarge, c6i.large, c6i.xlarge] # Curated list of right-sized instances
      - key: topology.kubernetes.io/zone # Constrains to specific zones
        operator: In
        values: [us-east-1a, us-east-1b, us-east-1c] # All three availability zones
      nodeClassRef:
        group: karpenter.k8s.aws # AWS-specific node configuration
        kind: EC2NodeClass # References the EC2 node template
        name: general-al2023 # Node class with AL2023 AMI and security groups
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized # Replaces wasteful nodes automatically
    consolidateAfter: 30s # Waits 30 seconds before consolidating
  limits:
    cpu: 200 # Maximum total CPU across all nodes in this pool
    memory: 800Gi # Maximum total memory across all nodes in this pool
---
apiVersion: karpenter.k8s.aws/v1 # AWS-specific Karpenter API
kind: EC2NodeClass # Configures the EC2 instance template
metadata:
  name: general-al2023 # Referenced by the NodePool above
spec:
  amiSelectorTerms:
  - alias: al2023@latest # Uses the latest Amazon Linux 2023 EKS-optimized AMI
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: payments-cluster # Discovers subnets by tag
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: payments-cluster # Discovers security groups by tag

# Check which instances Karpenter provisioned and why
kubectl get nodeclaims -o custom-columns='NAME:.metadata.name,TYPE:.status.instanceType,ZONE:.metadata.labels.topology\.kubernetes\.io/zone,CAPACITY:.metadata.labels.karpenter\.sh/capacity-type'

◈ Architecture Diagram

┌──────────────────────────────────────┐
│         Cluster Autoscaler           │
│  Pending → ASG Scale → Node Ready    │
│  (2-5 min, fixed instance types)     │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│         Karpenter                    │
│  Pending → EC2 API → Node Ready      │
│  (<60s, dynamic instance selection)  │
└──────────────────────────────────────┘

What happens during a Kubernetes rolling update, and how do you rollback a bad deployment?

beginnerpodskubernetes

▼

Quick Answer

During a rolling update, the Deployment creates a new ReplicaSet and gradually scales it up while scaling the old one down, controlled by maxSurge and maxUnavailable. Pods must pass readiness probes before old Pods are terminated. To rollback, run kubectl rollout undo which scales the previous ReplicaSet back up.

Detailed Answer

Think of a rolling update like replacing light bulbs in a theater while the show is still running. You can't turn off all the lights at once (that's a Recreate strategy — total blackout). Instead, you unscrew one old bulb, screw in a new one, verify it lights up (readiness probe), and only then move to the next one. At every moment during the replacement, the audience still has enough light to see the show. If a new bulb turns out to be defective (bad deployment), you stop the replacement and screw the old bulbs back in (rollback). When you update a Deployment (change the container image, environment variables, or any field in the Pod template), the Deployment controller creates a brand new ReplicaSet with the updated Pod template. It then orchestrates a carefully choreographed transition between the old and new ReplicaSets. The two parameters that control this dance are maxSurge (how many extra Pods above the desired count are allowed during the update — controls speed) and maxUnavailable (how many Pods can be offline simultaneously — controls safety margin). With replicas=4, maxSurge=1, maxUnavailable=1: Kubernetes can have up to 5 Pods running (4+1 surge) and at minimum 3 Pods available (4-1 unavailable) at any point. The update proceeds in cycles. In each cycle: (1) the new ReplicaSet is scaled up by maxSurge number of Pods, (2) Kubernetes waits for those new Pods to pass their readiness probe (this is the critical gate — without a readiness probe, Kubernetes considers Pods ready immediately, potentially sending traffic to uninitialized applications), (3) once new Pods are Ready, the old ReplicaSet is scaled down by maxUnavailable Pods. This cycle repeats until all old Pods are terminated and all new Pods are running. The entire process is recorded as a new revision in the Deployment's rollout history. Rollback is elegantly simple because Kubernetes keeps old ReplicaSets around (scaled to 0 replicas). When you run `kubectl rollout undo deployment/payments-api`, the Deployment controller doesn't create anything new — it simply scales up the previous ReplicaSet (which still has the old Pod template with the known-good image) and scales down the current one. This means rollback is typically faster than the original deployment because: the old image may still be cached on nodes (no pull needed), and the old ReplicaSet already exists (no creation delay). You can also rollback to a specific revision with `kubectl rollout undo --to-revision=3`. By default, Kubernetes keeps the last 10 old ReplicaSets (controlled by revisionHistoryLimit). Setting this too low (like 1) means you can only undo one step back. Setting it too high wastes API Server memory with stale ReplicaSet objects. For most teams, 5 revisions is the sweet spot. The most critical production gotcha: if you don't have a readiness probe, Kubernetes considers Pods ready the instant the container process starts — even if your Spring Boot app needs 45 seconds to initialize. During a rolling update, traffic gets routed to these half-started Pods, causing 500 errors for real users. The second gotcha: if your readiness probe never passes (bug in health endpoint, wrong port, misconfigured path), the rollout hangs forever — new Pods stay in a NotReady state, old Pods never get terminated, and the Deployment reports 'waiting for rollout to finish'. Use progressDeadlineSeconds (default 600s) to automatically mark a stuck rollout as Failed after a timeout.

Code Example

# Current state: payments-api running v2.1.0 with 4 replicas
kubectl get deployment payments-api
# NAME           READY   UP-TO-DATE   AVAILABLE   AGE
# payments-api   4/4     4            4           30d

# Trigger rolling update to v2.2.0
kubectl set image deployment/payments-api \
  api=registry.company.io/payments-api:2.2.0

# Watch the rollout in real time
kubectl rollout status deployment/payments-api
# Waiting for deployment "payments-api" rollout to finish:
# 1 out of 4 new replicas have been updated...
# 2 out of 4 new replicas have been updated...
# 4 out of 4 new replicas have been updated...
# deployment "payments-api" successfully rolled out

# See both ReplicaSets during the transition
kubectl get replicasets -l app=payments-api
# NAME                      DESIRED   CURRENT   READY
# payments-api-7d5f8b6c4    4         4         4      ← new (v2.2.0)
# payments-api-6b4a9c1e2    0         0         0      ← old (v2.1.0)

# v2.2.0 has a bug! Rollback immediately
kubectl rollout undo deployment/payments-api
# deployment.apps/payments-api rolled back

# Confirm rollback succeeded
kubectl rollout status deployment/payments-api
kubectl describe deployment payments-api | grep Image
# Image: registry.company.io/payments-api:2.1.0  ← back to previous

# View full rollout history
kubectl rollout history deployment/payments-api
# REVISION  CHANGE-CAUSE
# 1         initial deploy
# 2         image update to v2.2.0
# 3         rollback to revision 1

# Rollback to a specific revision
kubectl rollout undo deployment/payments-api --to-revision=1

# Deployment spec for controlled rollouts
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1              # 1 extra Pod allowed during update
      maxUnavailable: 0        # Never drop below desired count (safest)
  progressDeadlineSeconds: 300 # Fail rollout if stuck for 5 min
  minReadySeconds: 30          # Wait 30s after Ready before continuing

◈ Architecture Diagram

┌─── Rolling Update Timeline (replicas=4, maxSurge=1) ───┐
│                                                          │
│  Step 0: Stable on v2.1.0                                │
│  ┌──────┐┌──────┐┌──────┐┌──────┐                       │
│  │v2.1.0││v2.1.0││v2.1.0││v2.1.0│  Available: 4         │
│  └──────┘└──────┘└──────┘└──────┘                       │
│                                                          │
│  Step 1: Create 1 new Pod (surge)                        │
│  ┌──────┐┌──────┐┌──────┐┌──────┐┌──────┐              │
│  │v2.1.0││v2.1.0││v2.1.0││v2.1.0││v2.2.0│  Total: 5    │
│  └──────┘└──────┘└──────┘└──────┘└──┬───┘              │
│                                      │                   │
│                              wait for readiness probe    │
│                                      ▼                   │
│  Step 2: New Pod Ready → terminate 1 old                 │
│  ┌──────┐┌──────┐┌──────┐┌──────┐                       │
│  │v2.1.0││v2.1.0││v2.1.0││v2.2.0│  Available: 4         │
│  └──────┘└──────┘└──────┘└──────┘                       │
│                                                          │
│  ... repeat until ...                                    │
│                                                          │
│  Step Final: All replaced                                │
│  ┌──────┐┌──────┐┌──────┐┌──────┐                       │
│  │v2.2.0││v2.2.0││v2.2.0││v2.2.0│  Available: 4         │
│  └──────┘└──────┘└──────┘└──────┘                       │
│                                                          │
│  Rollback: scale old RS back up (instant, no new pull)  │
│  ┌──────┐┌──────┐┌──────┐┌──────┐                       │
│  │v2.1.0││v2.1.0││v2.1.0││v2.1.0│  ← old RS restored   │
│  └──────┘└──────┘└──────┘└──────┘                       │
└──────────────────────────────────────────────────────────┘

What is a game day in chaos engineering, and how do you run one for a banking payments platform?

intermediategeneralkubernetes

▼

Quick Answer

A game day is a planned resilience exercise where teams deliberately inject failures into systems while observing how services, monitoring, and people respond. It validates that SLOs are maintained under stress, runbooks are effective, and on-call engineers can diagnose and recover from realistic failure scenarios.

Detailed Answer

Think of a game day like a military field exercise. Instead of sending troops into actual combat to find out if their training works, you create a realistic simulation — complete with communications failures, supply chain disruptions, and adversary movements — in a controlled environment where nobody actually gets hurt. The goal isn't to prove everything works perfectly; it's to find the gaps in training, communication, and procedures before a real crisis exposes them. The most valuable game days are the ones where things go wrong, because each failure becomes a training opportunity. A game day for a banking payments platform begins weeks before the actual event with planning and scoping. The game day lead defines 3-5 failure scenarios relevant to the platform's risk profile: primary database failover, Kafka broker loss during peak transaction volume, AZ failure affecting half the payments-api pods, certificate expiry on a critical internal service, and a sudden traffic spike of 10x normal volume. Each scenario has a defined hypothesis: 'When we kill 2 of 3 Kafka brokers, consumer lag will spike but recover within 5 minutes, and no transactions will be lost.' The scope explicitly lists what will and will not be tested — you never inject chaos into systems that process live customer transactions without extensive safeguards. During the game day, the facilitator runs scenarios one at a time while observers monitor dashboards, alerting, and team communication channels. The key participants are: the facilitator who injects failures and controls the timeline, the development team who respond to incidents as if they were real, the SRE team who monitor SLIs and system health, and observers who document everything — how long it took to detect the issue, what runbook was followed, what communication happened, and where confusion arose. Each scenario follows a structured flow: inject the failure, start a timer, observe whether alerts fire within the expected window, watch the team's response, measure time to detection and time to recovery, and document all observations. The most critical aspect of a banking game day is defining clear safety rails. You need a kill switch — a way to immediately reverse any injected failure if it threatens to cause real customer impact. The game day should run in a pre-production environment that mirrors production topology, or in production during a low-traffic maintenance window with explicit leadership approval. For PCI-DSS and SOC2 compliance, every game day must be formally documented with approvals, scope definitions, results, and remediation actions. Some regulators specifically require evidence of resilience testing, making game days not just a best practice but a compliance requirement. After all scenarios complete, the team conducts an immediate retrospective. This is where the real value emerges. Common findings include: alerts didn't fire because the threshold was wrong, the runbook had an outdated command that no longer works, the on-call engineer didn't know how to access the Kafka admin tools, DNS failover took 12 minutes instead of the expected 2 minutes, and the team communicated over three different Slack channels causing information fragmentation. Each finding becomes a tracked action item with an owner and deadline. The gotcha that makes game days fail: treating them as a one-time event rather than a regular practice. The first game day is always rough — it exposes dozens of issues and the team feels demoralized. The value comes from running them quarterly, tracking improvement on previously identified issues, and gradually increasing the complexity and realism of scenarios. A team that runs game days quarterly will eventually handle real incidents with the calm confidence of a practiced response, while a team that ran one game day 18 months ago will fumble through the next real outage.

Code Example

# Game Day Plan: Banking Payments Platform
# Date: 2026-Q3 Game Day
# Duration: 4 hours (10 AM - 2 PM, business hours)

# Pre-game: Verify monitoring baseline
# Record steady-state metrics for comparison
kubectl exec -n monitoring prometheus-0 -- \
  promtool query instant http://localhost:9090 \
  'sum(rate(http_requests_total{service="payments-api"}[5m]))'
# Expected: ~500 req/s baseline

# ──── Scenario 1: Pod Failure (30 min) ────
# Hypothesis: Killing 50% of payments-api pods maintains p99 < 200ms

# Inject: Kill 3 of 6 payments-api pods
kubectl delete pod -n payments -l app=payments-api \
  --field-selector status.phase=Running \
  --grace-period=0 | head -3

# Observe: Do alerts fire within 2 minutes?
# Observe: Do replacement pods start within 30 seconds?
# Observe: Does p99 latency stay below SLO threshold?
kubectl get pods -n payments -l app=payments-api -w

# Measure recovery
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -20

# ──── Scenario 2: Kafka Broker Failure (45 min) ────
# Hypothesis: Losing 1 of 3 Kafka brokers doesn't lose transactions

# Inject: Cordon and drain the node running kafka-1
kubectl cordon worker-node-03
kubectl drain worker-node-03 --delete-emptydir-data --force --ignore-daemonsets

# Monitor consumer lag — should spike then recover
kubectl exec -n kafka kafka-0 -- \
  kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --group payments-processor --describe

# Verify no messages lost — check dead letter queue
kubectl exec -n kafka kafka-0 -- \
  kafka-console-consumer.sh --bootstrap-server localhost:9092 \
  --topic payments.dlq --from-beginning --timeout-ms 5000

# ──── Scenario 3: Network Latency Injection (30 min) ────
# Hypothesis: 500ms latency to fraud-detector triggers circuit breaker

# Using LitmusChaos to inject network latency
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: gameday-network-latency
  namespace: payments
spec:
  engineState: active
  appinfo:
    appns: payments
    applabel: app=fraud-detector
    appkind: deployment
  experiments:
  - name: pod-network-latency
    spec:
      components:
        env:
        - name: NETWORK_LATENCY
          value: "500"                # 500ms added latency
        - name: TOTAL_CHAOS_DURATION
          value: "300"                # 5 minutes
        - name: DESTINATION_IPS
          value: "10.96.0.0/12"       # Only internal traffic

# ──── Post-Game Day Retrospective ────
# Document findings in structured format
# Finding 1: Alert for Kafka consumer lag fired after 8 minutes
#            (expected: 2 minutes). Action: reduce evaluation window
# Finding 2: Runbook for broker recovery referenced deprecated CLI
#            Action: update runbook with kraft-based commands
# Finding 3: Circuit breaker on fraud-detector opened at 1s timeout
#            but SLO requires 500ms. Action: tune threshold

# Track action items
# kubectl create configmap gameday-q3-actions -n platform \
#   --from-literal=finding1="Reduce Kafka lag alert window to 2m" \
#   --from-literal=finding2="Update broker recovery runbook" \
#   --from-literal=finding3="Tune circuit breaker to 500ms"

◈ Architecture Diagram

┌──────────── Game Day Timeline ──────────────────────────┐
│                                                         │
│  ┌─── Preparation (2-3 weeks before) ──────────────┐   │
│  │ • Define 3-5 failure scenarios                   │   │
│  │ • Write hypotheses tied to SLOs                  │   │
│  │ • Get leadership approval                        │   │
│  │ • Notify on-call and dependent teams             │   │
│  │ • Verify kill switch / rollback procedures       │   │
│  └──────────────────────┬──────────────────────────┘   │
│                         ▼                               │
│  ┌─── Execution (4 hours) ─────────────────────────┐   │
│  │                                                  │   │
│  │  Roles:                                          │   │
│  │  ┌───────────┐  ┌───────────┐  ┌────────────┐   │   │
│  │  │Facilitator│  │Responders │  │ Observers  │   │   │
│  │  │(injects   │  │(dev + SRE │  │(document   │   │   │
│  │  │ failures) │  │ team)     │  │ everything)│   │   │
│  │  └─────┬─────┘  └─────┬─────┘  └─────┬──────┘   │   │
│  │        │              │              │           │   │
│  │        ▼              ▼              ▼           │   │
│  │  Scenario 1 → Detect → Respond → Recover → Log │   │
│  │  Scenario 2 → Detect → Respond → Recover → Log │   │
│  │  Scenario 3 → Detect → Respond → Recover → Log │   │
│  │                                                  │   │
│  │  Metrics tracked per scenario:                   │   │
│  │  • Time to detect (TTD)                          │   │
│  │  • Time to mitigate (TTM)                        │   │
│  │  • SLI impact during failure                     │   │
│  │  • Alert accuracy (fired? correct severity?)     │   │
│  └──────────────────────┬──────────────────────────┘   │
│                         ▼                               │
│  ┌─── Retrospective (immediately after) ───────────┐   │
│  │ • Review each scenario: hypothesis vs reality    │   │
│  │ • Document findings and surprises                │   │
│  │ • Create action items with owners + deadlines    │   │
│  │ • Schedule follow-up game day (quarterly)        │   │
│  └──────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

How do HPA, VPA, and Cluster Autoscaler work together to handle traffic spikes?

intermediateschedulingkubernetes

▼

Quick Answer

HPA scales pods horizontally based on CPU, memory, or custom metrics. VPA adjusts pod resource requests and limits vertically to right-size containers. Cluster Autoscaler adds or removes nodes when pods cannot be scheduled due to insufficient cluster capacity. Together, they form a three-layer scaling system: right-size pods, scale pod count, then scale infrastructure.

Detailed Answer

Think of a restaurant handling a dinner rush. VPA is like giving each chef a bigger workstation when they are cramped (vertical scaling of individual workers). HPA is like calling in more chefs when orders pile up (horizontal scaling of worker count). Cluster Autoscaler is like opening additional kitchen rooms when there is no space for more chefs (infrastructure scaling). Each addresses a different bottleneck, and they work best when coordinated. The Horizontal Pod Autoscaler (HPA) watches metrics and adjusts the replica count of a Deployment or StatefulSet. By default it uses CPU utilization, but it can target memory, custom metrics (requests per second from Prometheus), or external metrics (SQS queue depth from CloudWatch). HPA evaluates every 15 seconds (configurable), calculates the desired replica count using the formula desiredReplicas = currentReplicas * (currentMetric / targetMetric), and scales accordingly. It respects minReplicas and maxReplicas boundaries and has stabilization windows to prevent flapping. The Vertical Pod Autoscaler (VPA) analyzes historical resource usage and recommends or automatically adjusts the CPU and memory requests and limits for containers. It operates in three modes: Off (only recommends), Initial (sets resources only at pod creation), and Auto (evicts and recreates pods with updated resources). VPA solves the problem of developers guessing resource requests — either setting them too high (wasting cluster capacity) or too low (causing throttling and OOMKills). It uses a recommender component that analyzes metrics over time and an updater that evicts pods needing adjustment. The Cluster Autoscaler watches for pods stuck in Pending state because no node has sufficient resources. When it detects unschedulable pods, it evaluates which node group can accommodate them and triggers the cloud provider to add nodes. Conversely, it removes underutilized nodes (below 50 percent utilization by default) after a cooldown period, draining pods safely using PodDisruptionBudgets. It works with AWS Auto Scaling Groups, GCP Managed Instance Groups, or Azure VM Scale Sets. In production, coordination between these three autoscalers requires careful planning. The critical rule is never to use HPA and VPA on the same metric for the same pod, because they will fight: HPA tries to add replicas while VPA tries to resize existing ones, creating oscillation. The recommended pattern is HPA on CPU with VPA on memory in recommendation-only mode, or HPA on custom metrics while VPA handles resource right-sizing in Initial mode. Cluster Autoscaler must respond within 30-60 seconds to add nodes or traffic will be dropped during spikes. Teams configure node pool warm-up strategies or use Karpenter for faster, more flexible node provisioning. The non-obvious gotcha is scaling lag. HPA reacts in seconds but new pods need time to start (image pull, readiness probe). Cluster Autoscaler takes 1-3 minutes to provision new nodes. During a sudden traffic spike, the system can drop requests for several minutes. Mitigation strategies include setting higher minReplicas during known peak windows, using PodPriority to preempt less critical workloads, over-provisioning with pause pods (low-priority placeholder pods that get evicted instantly to free capacity), and combining with KEDA for event-driven scaling that responds to queue depth before CPU rises.

Code Example

# Check current HPA status including current vs target metrics and replica count
kubectl get hpa -n payments

# Describe HPA to see scaling events, conditions, and metric sources
kubectl describe hpa payments-api -n payments

# Example HPA manifest scaling on CPU and custom requests-per-second metric
# ---
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
#   name: payments-api
#   namespace: payments
# spec:
#   scaleTargetRef:
#     apiVersion: apps/v1
#     kind: Deployment
#     name: payments-api
#   minReplicas: 4
#   maxReplicas: 40
#   metrics:
#   - type: Resource
#     resource:
#       name: cpu
#       target:
#         type: Utilization
#         averageUtilization: 65

# Check VPA recommendations without applying them (Off mode)
kubectl get vpa payments-api -n payments -o yaml | grep -A10 recommendation

# Check Cluster Autoscaler status and recent scaling decisions
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# Look for pods stuck in Pending that might trigger Cluster Autoscaler
kubectl get pods -A --field-selector=status.phase=Pending

◈ Architecture Diagram

┌─────────────────────────────────────────────────┐
│                Traffic Spike                      │
└────────────────────┬────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────┐
│  Layer 1: HPA (seconds)                          │
│  Scale pods 4 → 12 based on CPU/custom metrics   │
└────────────────────┬────────────────────────────┘
                     ↓ pods Pending?
┌─────────────────────────────────────────────────┐
│  Layer 2: Cluster Autoscaler (1-3 minutes)       │
│  Add nodes to fit unschedulable pods             │
└────────────────────┬────────────────────────────┘
                     ↓ right-size over time
┌─────────────────────────────────────────────────┐
│  Layer 3: VPA (background)                       │
│  Adjust resource requests based on actual usage  │
└─────────────────────────────────────────────────┘

What is the difference between rolling update, blue-green, and canary deployments in Kubernetes?

intermediatecicdkubernetes

▼

Quick Answer

Rolling update gradually replaces old pods with new ones (Kubernetes native). Blue-green maintains two full environments and switches traffic instantly. Canary sends a small percentage of traffic to the new version first, then gradually increases. Each trades off speed, resource cost, and risk differently.

Detailed Answer

Think of upgrading a restaurant's menu. A rolling update is like changing one table's menu at a time until all tables have the new menu — gradual with no extra cost but hard to undo if customers complain. Blue-green is like printing a complete new set of menus and swapping them all at once while keeping the old ones ready — instant switchback but double the printing cost. Canary is like giving the new menu to one table first, watching their reaction, and only rolling it out if they order successfully — safest but slowest. The rolling update is Kubernetes' native deployment strategy. When you update a Deployment's pod template, the controller creates a new ReplicaSet and gradually scales it up while scaling the old one down. The maxSurge parameter controls how many extra pods can exist during the transition (e.g., 25% means 12 becomes 15 temporarily), and maxUnavailable controls how many pods can be missing (e.g., 25% means at least 9 of 12 must always be ready). Traffic flows to both old and new pods simultaneously during the rollout. If the new pods fail readiness probes, the rollout pauses. Rollback is a single command: kubectl rollout undo. Blue-green deployment runs two complete environments: blue (current live version) and green (new version). Both are fully deployed and warmed up before any traffic switch. Once the green environment passes health checks and smoke tests, the Service selector or Ingress routing is updated to point all traffic to green. If problems emerge, switching back to blue is instantaneous because it is still running. The cost is double the resources during the transition. In Kubernetes, this is implemented by having two Deployments (payments-api-blue and payments-api-green) and changing the Service selector or using Ingress annotations to switch. Canary deployment sends a small fraction of production traffic to the new version (typically 1-5% initially) while the majority continues on the stable version. If metrics (error rate, latency, business KPIs) look healthy over a defined period, traffic is gradually shifted: 5%, 25%, 50%, 100%. If any metric degrades, traffic reverts to the stable version automatically. This provides the highest safety because real production traffic validates the new version with minimal blast radius. In Kubernetes, canary deployments are typically managed by tools like Argo Rollouts, Flagger, or Istio traffic splitting rather than native Kubernetes primitives. In production, the choice depends on your risk tolerance, infrastructure budget, and rollback speed requirements. Rolling updates are free (no extra resources) and native, but you cannot easily test the new version in isolation before it receives traffic. Blue-green is excellent for major version changes that might need instant rollback, but it doubles resource cost. Canary is ideal for high-traffic services where even a 1% error rate affects thousands of users and you need metric-driven progressive delivery. Many teams use rolling updates for internal services and canary for user-facing APIs. The non-obvious gotcha is database compatibility. All three strategies may have old and new application versions running simultaneously (during the rollout window). If the new version changes the database schema, both versions must work with both the old and new schema. This requires backward-compatible migrations: add columns as nullable, never rename in place, deploy the new schema before the new code, and remove old columns only after the old code is fully gone. Teams that forget this have both versions hitting the database and one version crashes on schema mismatch.

Code Example

# Check current rolling update strategy settings
kubectl get deployment payments-api -n payments -o jsonpath='{.spec.strategy}'

# Watch a rolling update in progress — shows old and new ReplicaSets
kubectl rollout status deployment/payments-api -n payments

# View both ReplicaSets during a rollout (old scaling down, new scaling up)
kubectl get replicaset -n payments -l app=payments-api

# Rollback a failed rolling update to the previous version
kubectl rollout undo deployment/payments-api -n payments

# Blue-green: switch traffic by updating the Service selector
kubectl patch svc payments-api -n payments -p '{"spec":{"selector":{"version":"green"}}}'

# Blue-green: rollback by switching back to blue
kubectl patch svc payments-api -n payments -p '{"spec":{"selector":{"version":"blue"}}}'

# Canary with Argo Rollouts: check canary step progress
kubectl argo rollouts get rollout payments-api -n payments

# Canary: manually promote after verifying metrics
kubectl argo rollouts promote payments-api -n payments

# Canary: abort if metrics degrade and roll back to stable
kubectl argo rollouts abort payments-api -n payments

◈ Architecture Diagram

┌─────────────────────────────────────────────────┐
│  Rolling Update (native)                         │
│  ┌───┐ ┌───┐ ┌───┐ ┌───┐                        │
│  │v1 │ │v1 │ │v2 │ │v2 │  gradual replacement   │
│  └───┘ └───┘ └───┘ └───┘                        │
├─────────────────────────────────────────────────┤
│  Blue-Green                                      │
│  ┌───┐ ┌───┐ ┌───┐ ┌───┐  blue (live)           │
│  │v1 │ │v1 │ │v1 │ │v1 │                        │
│  └───┘ └───┘ └───┘ └───┘                        │
│  ┌───┐ ┌───┐ ┌───┐ ┌───┐  green (standby)       │
│  │v2 │ │v2 │ │v2 │ │v2 │  ← switch traffic      │
│  └───┘ └───┘ └───┘ └───┘                        │
├─────────────────────────────────────────────────┤
│  Canary                                          │
│  ┌───┐ ┌───┐ ┌───┐        95% traffic           │
│  │v1 │ │v1 │ │v1 │                              │
│  └───┘ └───┘ └───┘                              │
│  ┌───┐                     5% traffic            │
│  │v2 │  ← monitor metrics before promoting       │
│  └───┘                                           │
└─────────────────────────────────────────────────┘

What happens during a rolling update, and how do maxSurge and maxUnavailable control it?

intermediategeneralkubernetes

▼

Quick Answer

A rolling update gradually replaces old pods with new ones by creating new ReplicaSet pods while terminating old ones. maxSurge controls how many extra pods can exist above the desired count during the update, while maxUnavailable controls how many pods can be down simultaneously.

Detailed Answer

Think of maxSurge and maxUnavailable like staffing rules during a shift change at a hospital. maxSurge is how many extra nurses you can have on the floor simultaneously (overtime budget), and maxUnavailable is how many nurse positions can be empty at once (minimum staffing). A hospital might say 'we can have 1 extra nurse on overtime (maxSurge=1) but no positions can be vacant (maxUnavailable=0)' meaning the new shift arrives before the old shift leaves. When you update a Deployment's pod template (new image, env vars, etc.), the Deployment controller creates a new ReplicaSet with the updated spec and begins scaling it up while scaling the old ReplicaSet down. The speed and safety of this transition is controlled by the `strategy.rollingUpdate.maxSurge` and `strategy.rollingUpdate.maxUnavailable` parameters. Both can be absolute numbers or percentages of the desired replica count. Here is the exact sequence with replicas=4, maxSurge=1, maxUnavailable=1: The controller can have at most 5 total pods (4 + maxSurge=1) and at least 3 available pods (4 - maxUnavailable=1). It starts by creating 1 new pod (now 5 total: 4 old + 1 new). Once the new pod is Ready, it terminates 1 old pod (now 4 total: 3 old + 1 new, with 1 old terminating making 3 available, which satisfies the minimum). It then creates another new pod (5 total again), waits for it to be Ready, terminates another old one, and repeats until all 4 pods are running the new version. The entire process respects both constraints at every step. In production, the two most common configurations are: (1) maxSurge=25%, maxUnavailable=25% (the default) which balances speed and safety, allowing the update to happen in about 4 rounds for most replica counts; (2) maxSurge=1, maxUnavailable=0 which is the safest option because no old pod is terminated until its replacement is proven Ready. The second option means your cluster temporarily runs more pods than desired, so you need spare capacity, but it guarantees zero capacity reduction during the update. The critical gotcha: maxUnavailable counts pods that are not Ready, not pods that are Terminating. If a new pod fails its readiness probe, it counts as unavailable, which may block the rollout from proceeding. A common failure mode is a broken readiness probe on the new version that never passes: the rollout creates maxSurge new pods, they all fail readiness, and the rollout stalls with both old and new pods running but no progress being made. This is when `kubectl rollout undo` becomes necessary.

Code Example

# Deployment with explicit rolling update strategy
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
  namespace: production
spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2              # Allow 2 extra pods (total can be 8)
      maxUnavailable: 1        # At most 1 pod can be unavailable (min 5 available)
  selector:
    matchLabels:
      app: checkout
  template:
    metadata:
      labels:
        app: checkout
    spec:
      containers:
      - name: checkout
        image: checkout:3.4.1   # Change this to trigger rolling update
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5

# Trigger a rolling update by changing the image
kubectl set image deploy/checkout-service checkout=checkout:3.5.0 -n production

# Watch the rollout progress
kubectl rollout status deploy/checkout-service -n production
# Waiting for deployment "checkout-service" rollout to finish:
# 2 out of 6 new replicas have been updated...
# 4 out of 6 new replicas have been updated...
# 6 out of 6 new replicas have been updated...
# deployment "checkout-service" successfully rolled out

# See both ReplicaSets during rollout
kubectl get rs -n production -l app=checkout
# NAME                       DESIRED   CURRENT   READY
# checkout-service-6d4f7b    6         6         6     ← new (complete)
# checkout-service-5c3e8a    0         0         0     ← old (scaled down)

# Rollback if the new version has issues
kubectl rollout undo deploy/checkout-service -n production

# Safe config: no capacity loss during update
# maxSurge: 1, maxUnavailable: 0
# Means: always at least 6 pods available, create 1 new before removing 1 old

◈ Architecture Diagram

Rolling Update with replicas=4, maxSurge=1, maxUnavailable=1:

  Step 0 (before update):
  Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 ✓] [Pod4 ✓]  = 4 Ready
  New RS: (empty)
  Total: 4 pods, 4 available

  Step 1 (create new, terminate old):
  Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 ✓] [Pod4 terminating]
  New RS: [Pod5 ✓]
  Total: 5 pods, 4 available (within maxSurge=1, maxUnavail=1)

  Step 2:
  Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 terminating]
  New RS: [Pod5 ✓] [Pod6 ✓]
  Total: 5 pods, 4 available

  Step 3:
  Old RS: [Pod1 ✓] [Pod2 terminating]
  New RS: [Pod5 ✓] [Pod6 ✓] [Pod7 ✓]
  Total: 5 pods, 4 available

  Step 4 (complete):
  Old RS: (empty)
  New RS: [Pod5 ✓] [Pod6 ✓] [Pod7 ✓] [Pod8 ✓]  = 4 Ready
  Total: 4 pods, 4 available

  Constraint at every step:
  max total pods = replicas + maxSurge = 4 + 1 = 5
  min available  = replicas - maxUnavail = 4 - 1 = 3

Why do Kubernetes pods get evicted, and what commands and steps do you use to diagnose and resolve pod eviction?

intermediatepodskubernetes

▼

Quick Answer

Pods are evicted when a node is under resource pressure — disk (DiskPressure), memory (MemoryPressure), or PID exhaustion (PIDPressure). The kubelet evicts pods based on QoS class priority: BestEffort first, then Burstable, then Guaranteed last. Diagnosis starts with kubectl describe node to check conditions and kubectl get events to find eviction reasons.

Detailed Answer

Think of an overcrowded bus. When the bus exceeds its weight limit, the driver must ask some passengers to leave. Passengers without tickets (BestEffort pods) are asked first, then those with partial tickets (Burstable pods), and finally full-fare passengers (Guaranteed pods) are the last to go. The bus driver does not choose randomly — there is a clear priority order based on who has the strongest claim to stay. In Kubernetes, pod eviction is the kubelet's mechanism for protecting node stability. When a node runs low on a critical resource — memory, temporary (ephemeral) storage, or process IDs — the kubelet begins evicting pods to reclaim that resource. This is different from preemption (which is the scheduler removing lower-priority pods to make room for higher-priority ones) and different from API-initiated eviction (which is used during node drain for maintenance). Internally, the kubelet monitors resource usage against configurable eviction thresholds. The default soft eviction threshold for memory is memory.available < 100Mi, and for disk is nodefs.available < 10%. When a threshold is breached, the kubelet sets the corresponding node condition (MemoryPressure, DiskPressure, PIDPressure) and begins ranking pods for eviction. The ranking uses QoS class: BestEffort pods (no resource requests or limits) are evicted first, Burstable pods (requests set but lower than limits) are evicted next based on how much they exceed their requests, and Guaranteed pods (requests equal limits for all containers) are evicted last. At production scale, the most common eviction cause is ephemeral storage exhaustion from container logs, emptyDir volumes, or container writable layers growing unbounded. Memory-based evictions happen when applications have memory leaks or when resource limits are set too low for actual workload requirements. Teams should monitor node conditions, set appropriate resource requests and limits to ensure critical pods get Guaranteed QoS, configure log rotation to prevent disk pressure, and use PodDisruptionBudgets to limit the impact of evictions on service availability. The non-obvious gotcha is that eviction thresholds have both soft and hard variants. Soft evictions give pods a grace period to terminate cleanly, while hard evictions kill pods immediately. If the hard eviction threshold is hit (e.g., memory.available < 50Mi), the kubelet kills pods without waiting for graceful shutdown, which can cause data loss or incomplete request processing. Architects should ensure hard thresholds are never reached by setting soft thresholds with enough buffer.

Code Example

# Check node conditions for resource pressure
kubectl describe node ip-10-0-1-42.ec2.internal | grep -A5 'Conditions' # Shows MemoryPressure, DiskPressure status

# Find eviction events in the namespace
kubectl get events -n payments --field-selector reason=Evicted --sort-by='.lastTimestamp' # Lists evicted pods with reasons

# Check which pod was evicted and why
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.reason}' # Shows 'Evicted'
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.message}' # Shows the resource that triggered eviction

# Check node resource usage
kubectl top node ip-10-0-1-42.ec2.internal # Shows current CPU and memory usage

# Check disk usage on the node (requires node access)
kubectl debug node/ip-10-0-1-42.ec2.internal -it --image=busybox -- df -h # Shows filesystem usage on the node

# Check QoS class of pods to understand eviction priority
kubectl get pods -n payments -o custom-columns='NAME:.metadata.name,QOS:.status.qosClass' # Shows BestEffort, Burstable, or Guaranteed

# Set proper resource requests equal to limits for Guaranteed QoS
# resources:
#   requests:
#     cpu: 250m      # Request equals limit for Guaranteed QoS
#     memory: 512Mi  # Request equals limit for Guaranteed QoS
#   limits:
#     cpu: 250m      # Matches request
#     memory: 512Mi  # Matches request

◈ Architecture Diagram

┌──────────────────────────┐
│ Node Resource Pressure   │
│ Memory < 100Mi           │
└────────────┬─────────────┘
             ↓
┌──────────────────────────┐
│ Eviction Priority        │
│ 1. BestEffort  (first)   │
│ 2. Burstable   (next)    │
│ 3. Guaranteed  (last)    │
└──────────────────────────┘

How do you troubleshoot a pod stuck in CrashLoopBackOff, and what are the most common root causes?

intermediatepodskubernetes

▼

Quick Answer

CrashLoopBackOff means the container starts, crashes, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5 minutes). Common causes are application startup errors, missing environment variables or secrets, misconfigured commands or entrypoints, failed health probes, and OOMKilled. Diagnosis uses kubectl logs --previous, kubectl describe pod, and checking exit codes.

Detailed Answer

Think of a light switch connected to a circuit breaker. You flip the switch (container starts), the circuit overloads (container crashes), and the breaker trips (Kubernetes waits before retrying). Each time you try again, the breaker waits longer before allowing another attempt. CrashLoopBackOff is Kubernetes telling you that the container keeps failing and the wait time between restarts is increasing. In Kubernetes, CrashLoopBackOff is not a separate error state — it is the backoff delay that kubelet applies after repeated container crashes. The container exits with a non-zero code, kubelet restarts it after 10 seconds, it crashes again, kubelet waits 20 seconds, then 40, then 80, capping at 300 seconds (5 minutes). The pod status shows CrashLoopBackOff during these waiting periods and Error or Completed when the container actually exits. The most common root causes fall into categories. Application errors: the application throws an unhandled exception during startup because a required database is unreachable, a configuration file is malformed, or a required API key is missing. Configuration errors: the container command or args field is wrong (pointing to a script that does not exist in the image), the image tag points to a version with a different entrypoint, or a required environment variable is not set. Resource errors: the container is OOMKilled immediately on startup because the memory limit is too low for the JVM heap or the application's baseline memory footprint. Probe errors: an aggressive liveness probe kills the container before it finishes starting up, especially for Java applications with long startup times. At production scale, the diagnostic sequence is: first check exit code with kubectl describe pod (exit code 1 = application error, 137 = OOMKilled/SIGKILL, 143 = SIGTERM). Then check previous container logs with kubectl logs --previous since the current container may have already crashed. Check whether the container image recently changed with kubectl rollout history. Verify that ConfigMaps, Secrets, and PersistentVolumeClaims referenced by the pod actually exist in the namespace. The non-obvious gotcha is that CrashLoopBackOff can be caused by a liveness probe that is too aggressive during startup. If the liveness probe starts checking before the application is ready and the initialDelaySeconds is too short, the probe fails, kubelet kills the container, it restarts, and the cycle continues. The fix is to use a startup probe with a longer timeout to protect the liveness probe during application initialization, or to increase the liveness probe's initialDelaySeconds and failureThreshold.

Code Example

# Check pod status and restart count
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments # Shows status CrashLoopBackOff and restart count

# Get the exit code to categorize the failure
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -A10 'Last State' # Exit code 1=app error, 137=OOMKilled

# Check logs from the PREVIOUS crashed container (critical — current container may already be dead)
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=50 # Shows why the last container died

# Check if required ConfigMaps and Secrets exist
kubectl get configmap payments-config -n payments # Verify ConfigMap exists
kubectl get secret payments-db-credentials -n payments # Verify Secret exists

# Check if the container command is correct by inspecting the image
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.spec.containers[0].command}' # Shows configured command

# Check if OOMKilled is the cause
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason and exit code

# Fix startup probe to prevent liveness probe from killing slow-starting apps
# startupProbe:
#   httpGet:
#     path: /health        # Startup health endpoint
#     port: 8080           # Application port
#   failureThreshold: 30   # Allow 30 x 10s = 5 minutes to start
#   periodSeconds: 10      # Check every 10 seconds during startup

◈ Architecture Diagram

┌──────────┐
│ Start    │
└────┬─────┘
     ↓
┌──────────┐
│ Crash    │←─── Exit 1: App Error
│ (exit≠0) │←─── Exit 137: OOMKill
└────┬─────┘←─── Exit 143: Probe
     ↓
┌──────────┐
│ Backoff  │
│10→20→40s │
└────┬─────┘
     ↓
┌──────────┐
│ Restart  │
└──────────┘

How do multi-stage builds, caching, and base image pinning affect production Docker security?

advancedsecuritydocker

▼

Quick Answer

Multi-stage builds keep build tools out of runtime images. Cache ordering speeds up builds. Pinning base images helps reproducibility, but you must rebuild often and scan for vulnerabilities so old layers don't hide security holes.

Detailed Answer

Think of a restaurant kitchen that uses mixers, cutting boards, and raw ingredients to make a meal, but only sends the finished plate to the customer. A bad container image ships the entire kitchen to the table. A good multi-stage Docker build uses one stage as the kitchen and a second stage as the clean plate. The final image has only what the app needs to run, which cuts size, attack surface, and surprises in production. Docker's build best practices call for multi-stage builds, smart base image choices, a solid .dockerignore file, skipping unnecessary packages, using the build cache wisely, and rebuilding images regularly. The real concern is not just size. Every extra package, shell, compiler, credential, or leftover file in the final image gives attackers something to inspect or exploit. Smaller runtime images are easier to scan, faster to transfer, quicker to start, and simpler to reason about when things go wrong. The build process runs Dockerfile instructions in order and can reuse cached layers when an instruction and its inputs have not changed. This is why you copy stable dependency files before frequently changing source code. With BuildKit, teams can also use cache mounts and secret mounts so dependency downloads go faster and credentials never become image layers. Multi-stage builds then copy only selected files from builder stages into the final runtime stage. In production pipelines, engineers pin base images by digest for repeatable builds, scan images for known vulnerabilities, generate SBOMs (software bills of materials), sign artifacts, and rebuild even when app code has not changed. Rebuilds matter because pinning a tag or digest freezes the base layer, while security patches land in newer versions. Good teams track image size, critical CVE counts, startup time, pull latency, and rollback success rates as key metrics. The tricky part is that reproducibility and freshness pull in opposite directions. Pinning python:3.12-slim@sha256:... makes builds predictable, but it also locks in vulnerabilities until someone bumps the digest. Floating tags pick up patches automatically, but they can change under you and create builds you cannot reproduce. Senior engineers solve this with automated dependency-update PRs, signed digests, scheduled CI rebuilds, and policy gates. The goal is to make image supply-chain safety boring and routine rather than heroic and manual.

Code Example

docker buildx build --pull --tag registry.internal/payments-api:2026.06.18 --file Dockerfile . # Builds with a refreshed base image tag and a traceable release tag.
docker history registry.internal/payments-api:2026.06.18 # Inspects final layers to confirm build tools and secrets were not copied into runtime.
docker scout cves registry.internal/payments-api:2026.06.18 # Scans the image for known vulnerabilities before promotion.
docker image inspect registry.internal/payments-api:2026.06.18 --format '{{json .RepoDigests}}' # Captures immutable digests for deployment manifests.
docker push registry.internal/payments-api:2026.06.18 # Publishes the reviewed image to the internal registry.

◈ Architecture Diagram

┌──────────┐
│ Source   │
└────┬─────┘
     ↓
┌──────────┐
│ Builder  │
└────┬─────┘
     ↓ copy
┌──────────┐
│ Runtime  │
└────┬─────┘
     ↓ scan
┌──────────┐
│ Registry │
└────┬─────┘
     ↓
┌──────────┐
│ Deploy   │
└──────────┘

How do multi-stage builds shrink images and how do you order Dockerfiles to avoid slow CI builds?

architectsecuritydocker

▼

Quick Answer

Multi-stage builds use multiple FROM lines to separate build tools from runtime artifacts, so the final image has only the compiled binary and minimal OS libraries. Ordering dependency installs before source code copies maximizes cache hits and avoids full rebuilds when only app code changes.

Detailed Answer

Think of a woodworking shop. You need saws, clamps, sandpaper, and a workbench to build a cabinet, but the customer only gets the finished cabinet. They do not take home the saw. A multi-stage Docker build works the same way: one stage has all the build tools, and the final stage holds only the finished product. In Docker, a multi-stage build uses multiple FROM instructions in a single Dockerfile. Each FROM begins a new stage with its own base image and filesystem. Intermediate stages can install compilers, download dependencies, run tests, and produce artifacts. The final stage starts from a tiny base like distroless or alpine and copies only the compiled binaries or bundled assets from earlier stages using COPY --from. This means the production image never contains gcc, npm, pip, or any build toolchain, eliminating hundreds of megabytes and thousands of CVE-carrying packages from the runtime image. Under the hood, Docker and BuildKit process each stage as an independent node in the build graph. BuildKit can run independent stages in parallel, which is a major speed advantage over the legacy builder. When the Dockerfile is ordered correctly (base image first, dependency manifest copy second, dependency install third, source code copy fourth, build fifth), BuildKit reuses cached layers for everything up to the point where content changes. Since dependency manifests like package.json or go.sum change far less often than source code, this ordering means most CI builds only rebuild the final compilation step instead of re-downloading all dependencies. At production scale, teams running 50 or more microservices through CI see dramatic results. A payments-api image that was 1.2 GB with a single-stage node build drops to 85 MB with a multi-stage build using distroless as the final base. CI time drops from 8 minutes to 2 minutes because dependency layers are cached. Security scanners report 90 percent fewer vulnerabilities because the final image has no compilers, shells, or package managers. Teams should also use .dockerignore to exclude test fixtures, documentation, and local configs from the build context, which prevents unnecessary cache busting and reduces context transfer time. The tricky gotcha is that COPY --from references are position-based by default (stage 0, stage 1), which breaks silently when someone adds a new stage. Always name stages with AS and reference by name. Another trap is copying an entire directory from the build stage instead of specific artifacts, which can accidentally include build caches, test output, or sensitive files in the production image. Architects should also know that multi-stage builds do not automatically clean up intermediate images in CI. BuildKit's garbage collection handles this, but disk pressure on CI runners can still build up if max-storage is not configured.

Code Example

# Dockerfile for payments-api using multi-stage build
# Stage 1: Install dependencies in a full Node image
FROM node:22-bookworm AS deps
# Set the working directory for dependency installation
WORKDIR /build
# Copy only the dependency manifests first to maximize cache hits
COPY package.json package-lock.json ./
# Install production dependencies with exact versions from lockfile
RUN npm ci --production

# Stage 2: Build the application with dev dependencies
FROM node:22-bookworm AS builder
# Set the working directory for the build process
WORKDIR /build
# Copy all dependency manifests for full install including dev deps
COPY package.json package-lock.json ./
# Install all dependencies including TypeScript compiler and test tools
RUN npm ci
# Copy source code after dependencies to preserve layer cache
COPY src/ ./src/
# Copy TypeScript config for compilation
COPY tsconfig.json ./
# Compile TypeScript to JavaScript in the dist directory
RUN npm run build

# Stage 3: Production image with only runtime artifacts
FROM gcr.io/distroless/nodejs22-debian12 AS production
# Set a non-root user for security hardening
USER 1000
# Set the working directory for the application
WORKDIR /app
# Copy only production node_modules from the deps stage
COPY --from=deps /build/node_modules ./node_modules/
# Copy only the compiled JavaScript from the builder stage
COPY --from=builder /build/dist ./dist/
# Expose the API port for documentation and container networking
EXPOSE 8080
# Run the compiled application entry point
CMD ["dist/server.js"]

◈ Architecture Diagram

┌──────────┐
│ deps     │
│ npm ci   │
└────┬─────┘
     │
┌────┴─────┐
│ builder  │
│ compile  │
└────┬─────┘
     │ COPY --from
┌────┴─────┐
│production│
│distroless│
└──────────┘

How do you build secure Docker images with multi-stage builds, non-root users, and minimal attack surface?

intermediatesecuritydocker

▼

Quick Answer

Secure Docker images use multi-stage builds to exclude build tools from the final image, run as non-root users with explicit UIDs, start from minimal base images like Distroless or Alpine, pin dependencies to digests, and drop all unnecessary capabilities. This reduces the attack surface from hundreds of exploitable packages to a minimal runtime footprint.

Detailed Answer

Think of building a secure Docker image like constructing a bank vault room. During construction, workers bring in welding equipment, power tools, scaffolding, and raw materials. Once the vault is complete, every construction tool is removed from the room. The vault door is keyed to specific authorized personnel, not a master key. The room contains only what is needed for its purpose: reinforced walls, a locking mechanism, and a ventilation system. If a thief breaks in, they find no tools to use against the vault itself. Multi-stage Docker builds follow the same principle: build tools exist only during construction and never ship to production. A multi-stage Dockerfile separates the build environment from the runtime environment using multiple FROM statements. The first stage installs compilers, package managers, testing frameworks, and build dependencies needed to compile the application. The second stage starts from a minimal base image and copies only the compiled binary or application artifacts from the build stage. For a Java banking application, the build stage might use a full JDK image with Maven, while the runtime stage uses a Distroless Java image that contains only the JRE and no shell, package manager, or system utilities. This dramatically reduces the number of packages that vulnerability scanners flag and eliminates tools that attackers could use for post-exploitation activities like installing malware or pivoting to other services. Running containers as non-root is a fundamental security control that prevents container breakout exploits from gaining host-level root access. The Dockerfile creates a dedicated application user with a specific numeric UID and GID, changes ownership of application files to that user, and switches to that user with the USER directive before the ENTRYPOINT. In banking environments, the specific UID matters because it must match file permissions on mounted volumes and satisfy Pod Security Standards that require runAsNonRoot in Kubernetes. Using numeric UIDs instead of usernames avoids dependency on /etc/passwd, which may not exist in Distroless images. The non-root user should have no shell assigned and no home directory beyond what the application needs. Minimal base images are the foundation of attack surface reduction. A standard Ubuntu base image contains over 100 installed packages including shells, text editors, network utilities, and package managers. An Alpine image reduces this to roughly 15 packages. A Google Distroless image contains only the application runtime and its direct dependencies, with no shell at all. For banking applications, Distroless is preferred for production because if an attacker gains code execution inside the container, they cannot open a shell, install tools, or inspect the filesystem interactively. When debugging is needed, teams use ephemeral debug containers through kubectl debug rather than shipping debug tools in production images. The production gotcha that catches many teams is the interaction between read-only root filesystems and application behavior. Many frameworks write temporary files, session data, or compilation caches to the filesystem at runtime. When the root filesystem is read-only, these writes fail and the application crashes. Teams must identify every path the application writes to and mount emptyDir volumes at those paths. Log files should go to stdout and stderr rather than filesystem paths. Another subtle issue is layer ordering in the Dockerfile: placing frequently changing instructions like COPY of application code after rarely changing instructions like dependency installation maximizes build cache utilization and reduces build times from minutes to seconds. In regulated environments, every base image must also be scanned and approved through the organization's software supply chain process before it can be used as a FROM source.

Code Example

# Secure multi-stage Dockerfile for payments-api (Spring Boot)
# Stage 1: Build — full JDK with Maven for compilation
FROM eclipse-temurin:17-jdk-alpine AS builder
WORKDIR /build

# Cache dependencies separately from application code
COPY pom.xml .
RUN mvn dependency:go-offline -B

# Copy source and build
COPY src/ src/
RUN mvn package -DskipTests -B && \
    # Extract layered Spring Boot JAR for optimal Docker layers
    java -Djarmode=layertools -jar target/payments-api.jar extract --destination extracted

# Stage 2: Runtime — minimal Distroless image (no shell, no pkg manager)
FROM gcr.io/distroless/java17-debian12:nonroot

# Labels for audit and compliance tracking
LABEL maintainer="[email protected]" \
      app="payments-api" \
      compliance="sox-pci" \
      base-image="distroless-java17"

WORKDIR /app

# Copy Spring Boot layers in dependency order for cache efficiency
COPY --from=builder /build/extracted/dependencies/ ./
COPY --from=builder /build/extracted/spring-boot-loader/ ./
COPY --from=builder /build/extracted/snapshot-dependencies/ ./
COPY --from=builder /build/extracted/application/ ./

# Run as non-root user (UID 65532 is the nonroot user in Distroless)
USER 65532:65532

# Health check for Kubernetes readiness probes
EXPOSE 8080
ENTRYPOINT ["java", "-XX:MaxRAMPercentage=75.0", \
            "-Djava.security.egd=file:/dev/./urandom", \
            "org.springframework.boot.loader.launch.JarLauncher"]

# Compare image sizes to prove attack surface reduction
# docker images
# REPOSITORY          TAG         SIZE
# payments-api-full   latest      580MB  (JDK + Maven + OS tools)
# payments-api        latest      210MB  (Distroless JRE only)

# Verify no shell exists in the production image
# docker run --rm payments-api /bin/sh
# exec: "/bin/sh": stat /bin/sh: no such file or directory

# Scan the final image for vulnerabilities
trivy image --severity CRITICAL,HIGH ecr.bank.com/payments-api:v2.3.1

◈ Architecture Diagram

┌─────────────────────────────────────────────┐
│           Multi-Stage Build                 │
│                                             │
│  Stage 1: Builder                           │
│  ┌────────────────────────────┐              │
│  │ JDK 17 + Maven            │              │
│  │ Source Code                │              │
│  │ Test Frameworks           │  ← DISCARDED │
│  │ Build Tools               │              │
│  │ OS Packages (580MB)       │              │
│  └─────────────┬─────────────┘              │
│                │ COPY --from=builder        │
│                ↓ (JAR only)                 │
│  Stage 2: Runtime                           │
│  ┌────────────────────────────┐              │
│  │ Distroless Java 17        │              │
│  │ payments-api.jar           │  ← SHIPPED  │
│  │ USER 65532 (non-root)     │              │
│  │ No shell, no pkg mgr      │              │
│  │ Read-only rootFS (210MB)  │              │
│  └────────────────────────────┘              │
└─────────────────────────────────────────────┘

What is a multi-stage Docker build and why does it matter for production?

intermediategeneraldocker

▼

Quick Answer

A multi-stage build uses multiple FROM lines in one Dockerfile. You compile code in one stage and copy only the finished artifact into a tiny runtime image. This slashes image size and attack surface.

Detailed Answer

Think of a multi-stage build like a factory assembly line. In a car factory, welding robots, paint booths, and heavy machinery stay on the factory floor. Only the finished car rolls out and into the showroom. A multi-stage Docker build works the same way: all the bulky compilers, build tools, and source code stay in the build stage, while only the lean, finished binary ships in the final image. The key idea is separating what you need to build from what you need to run. A multi-stage Docker build is a Dockerfile pattern where you write multiple FROM instructions, each starting a fresh stage. The first stage usually pulls a full SDK or compiler image, installs dependencies, compiles source code, runs tests, and produces a deployable file. Later stages start from a tiny base image like alpine or distroless and use COPY --from to grab only the necessary files from earlier stages. The result is a final image that contains nothing except what the app needs to run, with no leftover build tools, package caches, or temporary files. Under the hood, Docker treats each stage as a separate build context with its own layer history. When Docker hits a second FROM instruction, it starts a clean image context while keeping previous stages in memory for reference. The COPY --from=builder command reaches into the filesystem of the named stage and pulls out specific paths. Each stage can use a completely different base image. For example, you might build with golang:1.21 and run with gcr.io/distroless/static-debian12. The build cache works per-stage, so changes to later stages do not force earlier stages to rebuild, which makes development iterations faster. In production, multi-stage builds are considered a must-have for several reasons. First, they shrink image size dramatically. A Go app built in a standard golang image might weigh 900MB, but the final distroless image with just the static binary can be under 15MB. Smaller images pull faster across registries, scale quicker in Kubernetes, cost less to store, and give security scanners far less to audit. Second, multi-stage builds fit cleanly into CI/CD pipelines because the entire build process lives inside the Dockerfile. No external build scripts or Makefiles needed. Third, they enable reproducible builds since every developer and every CI runner uses the exact same build environment defined in that first stage. A common mistake is copying too many files from the build stage. Using COPY --from=builder / /app/ instead of targeting a specific directory can accidentally pull in source code, credential files, or package caches that bloat the image and create security risks. Always copy only the exact artifact you need. Another subtle issue: build arguments (ARG) defined in one stage are not available in later stages. You must redeclare ARG after each FROM if you need the same value. Finally, intermediate stages are not always cleaned up automatically, so running docker image prune regularly is important to free disk space on build servers.

Code Example

# Stage 1: Build the payments-api Go binary
FROM golang:1.21-alpine AS builder
# Set the working directory inside the build container
WORKDIR /src
# Copy go.mod and go.sum first to leverage layer caching
COPY go.mod go.sum ./
# Download dependencies (cached if go.mod/go.sum unchanged)
RUN go mod download
# Copy the entire source code into the build container
COPY . .
# Compile the payments-api binary with CGO disabled for static linking
RUN CGO_ENABLED=0 GOOS=linux go build -o /payments-api ./cmd/server
# Stage 2: Create a minimal production image
FROM gcr.io/distroless/static-debian12
# Copy only the compiled binary from the builder stage
COPY --from=builder /payments-api /payments-api
# Copy the config file needed at runtime
COPY --from=builder /src/config/production.yaml /config/production.yaml
# Expose the port the payments-api listens on
EXPOSE 8080
# Set the entrypoint to run the payments-api binary
ENTRYPOINT ["/payments-api"]

◈ Architecture Diagram

┌─────────────────────────────┐
│  Stage 1: builder           │
│  FROM golang:1.21-alpine    │
│                             │
│  ┌───────────────────────┐  │
│  │ Source Code + go.mod  │  │
│  └───────────┬───────────┘  │
│              ↓              │
│  ┌───────────────────────┐  │
│  │ go build → /payments  │  │
│  └───────────┬───────────┘  │
└──────────────┼──────────────┘
               ↓ COPY --from=builder
┌──────────────┼──────────────┐
│  Stage 2: runtime           │
│  FROM distroless            │
│              ↓              │
│  ┌───────────────────────┐  │
│  │ /payments (binary)    │  │
│  │ /config/prod.yaml     │  │
│  └───────────────────────┘  │
│                             │
│  EXPOSE 8080                │
└─────────────────────────────┘

How do you implement SLO-based alerting with Prometheus using multi-window burn rate?

advancedmonitoringprometheus

▼

Quick Answer

Multi-window burn rate alerting fires when the error rate burns through the error budget faster than expected across both a long window (1h) and a short window (5m). This reduces alert noise compared to static thresholds by only alerting when the burn rate is sustained enough to exhaust the budget within the SLO period.

Detailed Answer

Think of a car's fuel gauge. A static threshold alert says 'warn at 25% fuel' — but that ignores whether you are on a highway burning fuel fast or parked with the engine off. Multi-window burn rate is like saying 'warn when fuel consumption over the last hour would empty the tank before you reach the next gas station, AND you are still burning fast right now.' This catches real problems while ignoring brief spikes. SLO-based alerting starts with defining an error budget. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — about 43 minutes of downtime. The burn rate is how fast you are consuming this budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the period. A burn rate of 14.4x means you will exhaust the 30-day budget in just 2 days. Multi-window burn rate uses two windows to reduce false positives. The long window (typically 1 hour) detects sustained error rates that threaten the budget. The short window (typically 5 minutes) confirms the problem is still happening right now. Both conditions must be true for the alert to fire. This prevents alerting on brief spikes that self-resolve (short window would not fire) and on historical errors that have already been fixed (long window shows the past, short window confirms the present). Google's SRE book recommends multiple severity tiers: 14.4x burn rate over 1h/5m for critical (page), 6x over 6h/30m for warning (ticket). At production scale, teams define recording rules that pre-compute error ratios for each SLI at multiple windows. The error ratio is calculated as rate(http_requests_total{status=~"5.."}[window]) / rate(http_requests_total[window]). Recording rules at 5m, 30m, 1h, and 6h windows avoid expensive queries at alert evaluation time. Grafana dashboards show the remaining error budget as a percentage, making it visual whether the team can ship features or must focus on reliability. The non-obvious gotcha is that burn rate alerts assume a uniform error distribution, which rarely matches reality. A 5-minute outage that burns 10% of the monthly budget followed by 29 days of perfect operation is very different from a constant 0.1% error rate. Teams should complement burn rate alerts with absolute threshold alerts for catastrophic failures (error rate > 50% for 1 minute) that would cause immediate user impact regardless of the monthly budget.

Code Example

# Recording rules for multi-window error ratios
# prometheus-rules.yaml
groups:
- name: slo-payments-api
  rules:
  # 5-minute error ratio (short window)
  - record: payments_api:error_ratio:5m
    expr: |
      sum(rate(http_requests_total{service="payments-api",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{service="payments-api"}[5m]))

  # 1-hour error ratio (long window)
  - record: payments_api:error_ratio:1h
    expr: |
      sum(rate(http_requests_total{service="payments-api",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{service="payments-api"}[1h]))

  # Multi-window burn rate alert (14.4x = exhausts 30-day budget in 2 days)
  - alert: PaymentsAPIHighBurnRate
    expr: |
      payments_api:error_ratio:1h > (14.4 * 0.001)
      and
      payments_api:error_ratio:5m > (14.4 * 0.001)
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "payments-api burning error budget at 14.4x rate"

How do Prometheus staleness markers and the lookback window affect alerts when a target disappears?

advancedmonitoringprometheus

▼

Quick Answer

Prometheus reuses the newest sample only if it falls within the lookback window, which defaults to 5 minutes. When a target or metric disappears, Prometheus writes a staleness marker so queries stop returning the old value instead of silently carrying it forever.

Detailed Answer

Think of a train station display board. If the 8:10 train reported its location two minutes ago, the board can still show a useful last-known position. If the train has not reported for an hour, showing that old position would mislead passengers. Prometheus has the same problem with metrics: a recent sample is fine to use at query time, but an old sample should eventually disappear so graphs and alerts do not pretend the system is healthy. PromQL, the Prometheus query language, evaluates instant queries at a single timestamp and range queries at many evenly spaced timestamps. For each evaluation timestamp, Prometheus looks backward for the newest sample inside the lookback window. The default lookback is 5 minutes, and it is configurable. This lets queries work even when scrapes do not land exactly on graph step boundaries. Without this behavior, normal scrape timing jitter would create broken graphs and unreliable aggregations. Staleness adds another layer. If a target scrape no longer returns a series that previously existed, or if service discovery removes a target entirely, Prometheus can write a staleness marker for that time series. After that marker, instant queries no longer return the old value for that series. This prevents stale readings from being treated as current values in aggregations like sum, avg, or alert expressions. If fresh samples later arrive for the same label set, the series simply reappears. Production alerting gets subtle here. An alert like `up == 0` catches failed scrapes where the target is still known but unreachable. However, it may not catch a target that vanished from service discovery, because there may be no `up` series left to evaluate. For detecting missing services, `absent()` or inventory-based alerts are usually needed. Engineers also tune scrape_interval, scrape_timeout, evaluation_interval, and the alert `for` duration so brief network hiccups do not page people while true disappearances still get caught quickly. The experienced gotcha is that a graph can look flat or empty for different reasons. A flat line may mean Prometheus is carrying a recent last sample inside the lookback window, while an empty graph may mean the series went stale, not that the value became zero. Exporters that attach their own timestamps can behave differently and may keep the last value visible until lookback expires. A common band-aid is using `or vector(0)` everywhere, which makes dashboards look tidy but hides missing telemetry. Senior engineers learn to distinguish between zero, missing, stale, and failed-scrape states explicitly rather than papering over the differences.

Code Example

curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=up{job="payments-api"}' # Checks whether Prometheus still sees the payments-api target and whether the latest scrape succeeded.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=absent(up{job="payments-api"})' # Detects the case where the target disappeared from service discovery and no up series exists.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=max_over_time(up{job="payments-api"}[10m])' # Shows whether the target was present at any point during the last 10 minutes.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=time() - timestamp(up{job="payments-api"})' # Measures how old the newest up sample is for the target.
promtool check rules /etc/prometheus/rules/payments-availability.yml # Validates alert rules before reloading them into Prometheus.

◈ Architecture Diagram

┌──────────┐
│ Target   │
└────┬─────┘
     ↓ scrape
┌──────────┐
│ Sample   │
└────┬─────┘
     ↓ query
┌──────────┐
│ Lookback │
└────┬─────┘
     ↓
┌──────────┐      ┌──────────┐
│ Present  │      │ Stale    │
└────┬─────┘      └────┬─────┘
     ↓                 ↓
┌──────────┐      ┌──────────┐
│ Alert    │      │ Absent   │
└──────────┘      └──────────┘

How should you design Prometheus histograms for latency SLOs, and when would you use a summary instead?

advancedmonitoringprometheus

▼

Quick Answer

Use histograms when you need to aggregate percentiles across many instances and tie them to SLOs. Classic histograms need explicit bucket boundaries, native histograms reduce that manual work, and summaries calculate percentiles inside the app but cannot be safely combined across replicas.

Detailed Answer

Think of measuring checkout wait times by placing customers into labeled bins: under 100 ms, under 300 ms, under 1 second, and so on. If the bins are chosen around the thresholds the business actually cares about, the data is useful. If every bin is too wide, too narrow, or missing the SLO boundary, the final percentile looks precise but answers the wrong question. Prometheus histograms are that binning system for measurements like request duration. A classic Prometheus histogram exposes cumulative bucket counters using the `le` label, which stands for less than or equal, plus `_sum` and `_count` series. Prometheus calculates percentiles using the `histogram_quantile()` function over rates of those buckets. The big advantage of this design is that you can aggregate across pods, nodes, clusters, or jobs before calculating the percentile, which is why histograms are the go-to for distributed services. The cost is extra time series: each bucket boundary creates another series for every label combination. Native histograms change the storage model by representing many bucket spans more compactly and letting Prometheus handle histogram samples directly. They reduce some of the pain of choosing bucket boundaries manually and support more flexible percentile exploration. However, they require compatible Prometheus settings, client libraries, remote write backends, and query paths, so you need to check the full chain before adopting them. Summaries are a different animal: they compute selected quantiles inside each application process. That can be useful for a single process, but averaging p95 values across replicas is statistically wrong because each process saw a different number and shape of requests. The query path matters for getting correct results. For classic histograms, you typically apply `rate()` to `_bucket` counters, aggregate with `sum by (le, service)` or similar, then call `histogram_quantile()`. The `le` label must survive until the quantile function runs because it represents the bucket boundary. For SLO checks like seeing what fraction of requests finish under 300 ms, having an exact bucket at that boundary makes the calculation simple and reliable. For Apdex-style scores, you need buckets at both the satisfied and tolerated thresholds. The gotcha is that histogram percentiles are estimates, and how good the estimate is depends entirely on where you place the buckets. A p99 alert built on buckets of 100 ms, 1 second, and 10 seconds cannot accurately tell the difference between 1.2 seconds and 8 seconds. Another common mistake is averaging per-pod p95 values in Grafana, which gives equal weight to quiet pods and busy pods. Experienced engineers pick bucket boundaries around the SLO thresholds users care about, keep labels low-cardinality, aggregate buckets before computing quantiles, and verify that the remote storage path preserves the histogram type they depend on.

Code Example

curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="payments-api"}[5m])))' # Computes fleet-wide p95 latency from classic histogram buckets after aggregating by bucket boundary.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=sum(rate(http_request_duration_seconds_bucket{job="payments-api",le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count{job="payments-api"}[5m]))' # Calculates the fraction of payments-api requests completed within the 300 ms SLO bucket.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=sum(rate(http_request_duration_seconds_sum{job="payments-api"}[5m])) / sum(rate(http_request_duration_seconds_count{job="payments-api"}[5m]))' # Calculates average latency from histogram sum and count without using quantile math.
promtool check rules /etc/prometheus/rules/payments-latency-slo.yml # Validates histogram-based recording and alerting rules before deployment.

◈ Architecture Diagram

┌──────────┐
│ Request  │
└────┬─────┘
     ↓
┌──────────┐
│ Buckets  │
└────┬─────┘
     ↓
┌──────────┐
│ Rate     │
└────┬─────┘
     ↓
┌──────────┐
│ Sum by le│
└────┬─────┘
     ↓
┌──────────┐
│ p95      │
└──────────┘

How does Prometheus remote write handle backpressure when the receiving backend is slow or down?

advancedmonitoringprometheus

▼

Quick Answer

Remote write tails the Prometheus WAL into per-destination queues, shards the work across parallel senders, batches samples, and retries on failure. If queues fill up, Prometheus stops reading from the WAL for that destination. If the receiver stays down too long, unsent samples can be lost as WAL data gets compacted away.

Detailed Answer

Think of a shipping dock that sends packages from a factory to a central warehouse. The factory keeps making boxes, workers load them into several truck lanes, and sometimes the warehouse slows down. If the lanes fill up, boxes pile up at the dock. If the warehouse stays closed for hours, the factory has to choose between halting intake, using more dock space, or eventually throwing away boxes it can no longer hold. Prometheus remote write is that shipping dock for metrics. Prometheus first ingests samples locally through its normal scrape and WAL path. A remote write component then reads from the WAL, maps internal series IDs to their label sets, queues samples, and sends compressed HTTP requests to the configured remote endpoint. That endpoint might be Grafana Mimir, Thanos Receive, Cortex, VictoriaMetrics, or a managed cloud service. Remote write is not a magic way to backfill historical data; it is mainly a streaming replication path from the local ingestion flow. Backpressure shows up when the remote endpoint is slow, returning errors, rate-limiting, or totally unreachable. Prometheus uses shards, which are parallel sending workers, to improve throughput. Each shard has an in-memory queue with a capacity limit and a maximum batch size. Failed requests get retried with exponential backoff. Prometheus can automatically adjust the shard count based on the incoming sample rate and how long sends are taking. But once queues fill up, reading from the WAL for that remote write target is blocked, and pending samples start piling up. In production, the key metrics to watch are pending samples, failed samples, retried samples, send batch duration, current shard count, and queue capacity. Tuning usually starts with the receiver side: confirm it is healthy, not throttling, and not rejecting samples due to tenant limits or bad labels. Then tune Prometheus settings like `max_samples_per_send`, `capacity`, `max_shards`, and backoff values. Capacity should generally be several times the batch size, but setting it too high increases Prometheus memory usage. Write relabeling can drop expensive or unnecessary samples before they even leave Prometheus. The gotcha is that cranking every knob up can make the outage worse. More shards can overwhelm a backend that is trying to recover. More queue capacity can cause Prometheus memory pressure, especially during high series churn because remote write caches series labels. Another gotcha involves the two-hour WAL window: if remote write stays blocked longer than the WAL can hold unsent data, samples get lost when the WAL is compacted. Senior engineers treat remote write tuning as end-to-end flow control, not just a matter of making the queue bigger.

Code Example

remote_write: # Sends locally ingested samples to a central backend such as Mimir or Thanos Receive.
  - url: https://mimir-write.monitoring.svc/api/v1/push # Points Prometheus at the remote write receiver endpoint.
    name: payments-mimir # Gives this remote write queue a stable name in metrics and logs.
    remote_timeout: 30s # Bounds each send request so slow receivers do not hang workers forever.
    queue_config: # Controls memory queues and parallel send workers for this remote write target.
      max_samples_per_send: 5000 # Sends larger batches to improve throughput when the receiver supports them.
      capacity: 30000 # Keeps per-shard capacity about six times the batch size to absorb short slowdowns.
      max_shards: 10 # Caps parallelism so Prometheus does not overload the central backend during recovery.
      min_shards: 2 # Starts with two workers so the queue can drain promptly after restart.
      min_backoff: 1s # Waits at least one second before retrying a failed send.
      max_backoff: 30s # Prevents retry storms by backing off repeated failures.
    write_relabel_configs: # Drops samples before remote write to reduce bandwidth and receiver load.
      - source_labels: [__name__] # Selects samples by metric name before deciding whether to send them.
        regex: 'go_.*' # Matches noisy runtime metrics that the central backend does not need.
        action: drop # Drops matching samples from remote write while keeping local scrape data.

◈ Architecture Diagram

┌──────────┐
│ WAL      │
└────┬─────┘
     ↓
┌──────────┐
│ Queue    │
└────┬─────┘
     ↓
┌──────────┐
│ Shards   │
└────┬─────┘
     ↓
┌──────────┐
│ Receiver │
└────┬─────┘
     ↓
┌──────────┐
│ Object   │
└──────────┘

How should you set up Alertmanager grouping, inhibition, silences, and HA to avoid noise without hiding real incidents?

advancedmonitoringprometheus

▼

Quick Answer

Alertmanager groups related alerts, deduplicates notifications, routes them to the right receiver, silences planned noise, and inhibits lower-level alerts when a parent alert explains them. In HA mode, Prometheus should send alerts directly to every Alertmanager peer, not through a load balancer.

Detailed Answer

Think of a hospital emergency department during a city-wide power failure. Thousands of alarms pour in from buildings, traffic lights, and elevators. Operators do not want a separate phone call for each alarm. They want one grouped incident per affected area, with enough detail to know which buildings still need help. Alertmanager is that dispatch layer for Prometheus alerts. Prometheus evaluates alerting rules and sends firing or resolved alerts to Alertmanager over HTTP. Alertmanager then groups alerts by chosen labels, routes each group through a routing tree to the right receiver (Slack, PagerDuty, email), deduplicates repeated notifications, applies silences for planned maintenance, and applies inhibitions when one alert makes another redundant. For example, if a ClusterDown alert is firing, an inhibition rule can suppress thousands of pod-level alerts from that same cluster because they are all symptoms of the same root cause. Grouping is label-driven. The group_by setting picks which labels define a notification group. group_wait delays the first notification briefly so related alerts can arrive together. group_interval controls how often new alerts get added to an existing group. repeat_interval controls how frequently unresolved alerts are re-sent. Inhibition rules compare a source alert against target alerts using matchers and equality labels. Silences use matchers and time windows, and are usually created through the Alertmanager UI or API during maintenance windows. For high availability, multiple Alertmanager instances form a cluster and share notification state through a gossip protocol. Prometheus should be configured with all Alertmanager peers listed as targets. The Prometheus docs warn against putting a load balancer between Prometheus and Alertmanager because each Prometheus instance needs to deliver alerts to the full cluster so deduplication and state replication work correctly. Teams also set external labels like cluster, region, and replica carefully so Alertmanager can tell independent environments apart while still deduplicating HA Prometheus replicas. The gotcha is that label design can either flood your team or hide a real outage. If group_by includes pod, every single pod failure during a deployment becomes a separate page. If it only groups by alertname, unrelated production and staging incidents might collapse into one notification. Inhibition can be dangerous too -- if the source alert is too broad or fires too easily, it can silence real alerts. Senior engineers test alert routes with sample payloads, keep grouping labels tied to ownership and blast radius, and regularly review active silences to make sure planned maintenance windows have not turned into black holes for real incidents.

Code Example

alerting: # Configures where Prometheus sends evaluated alerts.
  alertmanagers: # Lists Alertmanager targets for alert delivery.
    - static_configs: # Uses explicit peer targets instead of a load-balanced single endpoint.
        - targets: ['alertmanager-0:9093','alertmanager-1:9093','alertmanager-2:9093'] # Sends alerts directly to every HA Alertmanager peer.
route: # Defines the root Alertmanager routing tree.
  receiver: sre-pager # Sends unmatched production alerts to the SRE paging receiver.
  group_by: ['cluster','namespace','alertname'] # Groups by blast radius without grouping unrelated clusters together.
  group_wait: 30s # Waits briefly so related alerts from the same incident can arrive together.
  group_interval: 5m # Controls how often new alerts are added to an existing notification group.
  repeat_interval: 4h # Prevents repeated pages for the same unresolved alert group.
inhibit_rules: # Suppresses noisy child alerts when a parent outage alert is already firing.
  - source_matchers: ['alertname="ClusterDown"'] # Uses the cluster-level outage alert as the inhibition source.
    target_matchers: ['severity="warning"'] # Suppresses lower-severity warning alerts during the parent outage.
    equal: ['cluster'] # Applies inhibition only inside the same cluster label value.

◈ Architecture Diagram

┌──────────┐
│ Rules    │
└────┬─────┘
     ↓
┌──────────┐
│ Alerts   │
└────┬─────┘
     ↓
┌──────────┐
│ Group    │
└────┬─────┘
     ↓
┌──────────┐     ┌──────────┐
│ Inhibit  │←────│ Silence  │
└────┬─────┘     └──────────┘
     ↓
┌──────────┐
│ Route    │
└────┬─────┘
     ↓
┌──────────┐
│ Pager    │
└──────────┘

How do you manage Grafana dashboards and alerts as code with GitOps?

advancedmonitoringprometheus

▼

Quick Answer

Store dashboard JSON and alert rule YAML in Git. Use Grafana provisioning, Grafonnet (a Jsonnet library), or Terraform's Grafana provider to define dashboards as code. Changes go through PR review, CI validates syntax, and CD applies them automatically. Updating 10 dashboards means changing one template and pushing a single commit.

Detailed Answer

Think of it like managing a chain of restaurants where every location has to serve the same menu. Instead of calling each manager and dictating changes over the phone, which is like clicking around in Grafana's UI, you update the master menu in a shared drive, the managers review it, and an automated system prints and ships the new menus to all locations at once. The GitOps workflow for Grafana has three main approaches, from simple to powerful. The simplest is Grafana's built-in provisioning: you put dashboard JSON files and alert rule YAML files in a directory that Grafana watches, usually mounted via a ConfigMap in Kubernetes. When the files change, Grafana reloads them. You store these files in Git, and your CI/CD pipeline updates the ConfigMap every time a change merges to main. The second approach uses Grafonnet, a Jsonnet library for generating Grafana dashboard JSON programmatically. Instead of writing raw 500-line JSON files by hand, you write concise Jsonnet code that generates them. This is where updating 10 dashboards at once becomes easy: if all 10 share a common template, say a service dashboard with CPU, memory, error rate, and latency panels, you define the template once and pass in parameters per service. Changing the template changes all 10 dashboards in one commit. Jsonnet compiles down to JSON, which then gets provisioned into Grafana. The third approach uses Terraform with the Grafana provider. You define dashboards, folders, alert rules, and notification channels as Terraform resources. The CI pipeline runs `terraform plan` on pull requests to show what would change and `terraform apply` on merge. This gives you state management, drift detection, and the full Terraform workflow. For large organizations managing hundreds of dashboards across multiple Grafana instances, this is the most maintainable path. For alerts, Grafana's alerting rules and notification policies can also be defined in YAML and provisioned alongside dashboards. The entire alerting chain -- rules, routing policies, contact points, and message templates -- lives in Git, version-controlled and reviewable. The day-to-day workflow looks like this: a developer creates a branch, modifies dashboard Jsonnet or Terraform files, opens a pull request, CI runs syntax checks like jsonnet lint or terraform validate and optionally renders a preview, a reviewer approves, the PR merges to main, and the CD pipeline applies changes to Grafana. The big win is that every change is code-reviewed, version-controlled, and reversible with a simple `git revert`.

Code Example

# ─── Approach 1: Grafana Provisioning via ConfigMap ───
# dashboards.yaml (Grafana provisioning config)
apiVersion: 1
providers:
- name: default
  type: file
  options:
    path: /var/lib/grafana/dashboards    # Watch this directory
    foldersFromFilesStructure: true

# Mount dashboards from ConfigMap in Kubernetes
# kubectl create configmap grafana-dashboards \
#   --from-file=dashboards/ -n monitoring

# ─── Approach 2: Grafonnet (Jsonnet) ───
# service-dashboard.jsonnet — one template, many dashboards
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;

# Template function — reused for all services
local serviceDashboard(name, namespace) =
  dashboard.new(name + ' Service Dashboard')
  + dashboard.withUid(name + '-svc')
  + dashboard.withPanels([
    # CPU panel
    grafana.panel.timeSeries.new(name + ' CPU')
    + { targets: [prometheus.new(
        'sum(rate(container_cpu_usage_seconds_total{namespace="' + namespace + '", pod=~"' + name + '.*"}[5m]))'
      )] },
    # Error rate panel
    grafana.panel.timeSeries.new(name + ' Error Rate')
    + { targets: [prometheus.new(
        'sum(rate(http_requests_total{namespace="' + namespace + '", status=~"5.."}[5m]))'
      )] },
  ]);

# Generate 10 dashboards from one template
{
  'payments-api.json': serviceDashboard('payments-api', 'production'),
  'checkout-svc.json': serviceDashboard('checkout-svc', 'production'),
  'user-auth.json': serviceDashboard('user-auth', 'production'),
  # ... 7 more services
}

# Build: jsonnet -J vendor/ -m output/ service-dashboard.jsonnet

# ─── Approach 3: Terraform ───
resource "grafana_dashboard" "payments" {
  config_json = file("dashboards/payments-api.json")
  folder      = grafana_folder.production.id
}

# CI Pipeline (.github/workflows/grafana.yml)
# on PR:    terraform plan → post diff as comment
# on merge: terraform apply → dashboards updated

◈ Architecture Diagram

GitOps Workflow for Grafana:

  ┌──────────┐    PR      ┌──────────┐
  │Developer │───────────►│   Git    │
  │          │            │ (main)   │
  │ edit     │  review    │          │
  │ .jsonnet │◄───────────│ CI runs: │
  │ or .tf   │  approve   │ lint     │
  └──────────┘            │ plan     │
                          └────┬─────┘
                               │ merge
                               ▼
                     ┌─────────────────┐
                     │   CD Pipeline   │
                     │                 │
                     │ jsonnet build   │
                     │ OR              │
                     │ terraform apply │
                     └────────┬────────┘
                              │
                              ▼
                     ┌─────────────────┐
                     │    Grafana      │
                     │                 │
                     │ 10 dashboards   │
                     │ updated from    │
                     │ 1 template      │
                     └─────────────────┘

  Template → 10 dashboards:
  ┌──────────────┐
  │ svc-template │
  │  .jsonnet    │──► payments-api.json
  │              │──► checkout-svc.json
  │              │──► user-auth.json
  │              │──► ... 7 more
  │  1 change =  │
  │  10 updates  │
  └──────────────┘

What is symptom-based alerting and how is it different from alerting on CPU, disk, etc.?

advancedmonitoringprometheus

▼

Quick Answer

Symptom-based alerting fires on things users actually feel, like high error rates, slow responses, or SLO budget burn, instead of internal causes like high CPU or disk at 80%. It cuts alert noise dramatically because many internal causes map to just a few user-facing symptoms. You implement it with SLO-based burn rate alerts in Prometheus using multi-window, multi-burn-rate rules.

Detailed Answer

Think of it like a car dashboard. Cause-based alerting would mean separate warning lights for every internal part: fuel injector pressure, alternator voltage, coolant thermostat position, oxygen sensor reading. You would have 200 lights and no idea which ones matter. Symptom-based alerting gives you one light that says 'engine temperature high,' and the mechanic investigates the cause from there. Traditional monitoring creates alerts for every possible internal state: CPU above 80%, disk above 85%, memory above 90%, Pod restarts above 3, queue depth above 1000. This leads to massive alert fatigue. A team with 50 microservices might have 500-plus alert rules, most of which fire for brief spikes that fix themselves. Engineers start ignoring alerts, and when a real outage happens, the critical signal is buried in noise. Symptom-based alerting flips this around. You alert on what users experience: the error rate is burning through the SLO budget faster than sustainable, latency has crossed the SLO target, or availability has dropped below the threshold. These are called SLI-based alerts, where SLI stands for Service Level Indicator. If CPU is at 95% but the error rate is 0% and latency is normal, there is no user impact, so no alert is needed. If CPU is at 40% but the error rate is 5%, users are hurting, so you alert right away. The best way to implement this is Google's multi-window, multi-burn-rate approach from the SRE book. You define an SLO such as 99.9% availability over 30 days, which gives you an error budget of 43.2 minutes of allowed downtime. Then you create burn rate alerts. A fast-burn alert fires when the error rate is consuming budget at 14.4 times the sustainable rate, meaning the entire monthly budget would be gone in 2 hours. This catches acute incidents. A slow-burn alert fires at 1 times the sustainable rate held over 3 days, catching gradual degradation. Each alert uses two time windows, a short one like 5 minutes and a long one like 1 hour, so a single brief spike does not trigger a false alarm. In Prometheus, this translates to recording rules that calculate error ratios over multiple windows, plus alert rules that compare burn rates against thresholds. Grafana displays an SLO dashboard showing remaining error budget, burn rate trends, and alert status. The result: instead of 500 noisy alerts, you might have 10 to 20 SLO-based alerts across all services, each one actionable and tied to real user impact.

Code Example

# ─── SLO Definition ───
# Service: payments-api
# SLO: 99.9% availability (error budget: 0.1% or 43.2 min/month)

# ─── Recording Rules (prometheus-rules.yaml) ───
groups:
- name: payments-slo
  rules:
  # Error ratio over different windows
  - record: payments:error_ratio:5m
    expr: |
      sum(rate(http_requests_total{job="payments-api",status=~"5.."}[5m]))
      /
      sum(rate(http_requests_total{job="payments-api"}[5m]))

  - record: payments:error_ratio:1h
    expr: |
      sum(rate(http_requests_total{job="payments-api",status=~"5.."}[1h]))
      /
      sum(rate(http_requests_total{job="payments-api"}[1h]))

  - record: payments:error_ratio:6h
    expr: |
      sum(rate(http_requests_total{job="payments-api",status=~"5.."}[6h]))
      /
      sum(rate(http_requests_total{job="payments-api"}[6h]))

  # ─── Alert Rules (burn rate) ───
  # Fast burn: 14.4x budget consumption → page immediately
  - alert: PaymentsSLOFastBurn
    expr: |
      payments:error_ratio:5m > (14.4 * 0.001)
      and
      payments:error_ratio:1h > (14.4 * 0.001)
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Payments API burning error budget 14x too fast"
      description: "At this rate, monthly budget exhausted in 2 hours"

  # Slow burn: 1x sustained → ticket (not page)
  - alert: PaymentsSLOSlowBurn
    expr: |
      payments:error_ratio:6h > (1 * 0.001)
      and
      payments:error_ratio:3d > (1 * 0.001)
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Payments API slowly burning error budget"
      description: "Gradual degradation — investigate this week"

◈ Architecture Diagram

Cause-Based (noisy):        Symptom-Based (actionable):

  ┌────────────────────┐      ┌────────────────────┐
  │ CPU > 80%     PAGE │      │                    │
  │ Disk > 85%    PAGE │      │ Error rate > SLO   │
  │ Memory > 90%  PAGE │      │ burn rate?         │
  │ Restarts > 3  PAGE │      │                    │
  │ Queue > 1000  PAGE │      │ YES → PAGE         │
  │ Latency spike PAGE │      │ NO  → silence      │
  └────────────────────┘      └────────────────────┘
  500+ alerts, most noise      10-20 alerts, all real

  Multi-Window Burn Rate:

  Error Budget: 43.2 min/month (99.9% SLO)

  ┌─── Fast Burn ──────────────────────┐
  │ 14.4x burn rate                    │
  │ 5min window AND 1hr window         │
  │ → exhausts budget in 2 hours       │
  │ → PAGE immediately                 │
  └────────────────────────────────────┘

  ┌─── Slow Burn ──────────────────────┐
  │ 1x burn rate                       │
  │ 6hr window AND 3day window         │
  │ → exhausts budget in 30 days       │
  │ → ticket, investigate this week    │
  └────────────────────────────────────┘

What are recording rules and alerting rules in Prometheus, and how do they differ?

intermediategeneralprometheus

▼

Quick Answer

Recording rules pre-compute expensive PromQL queries and save the results as new time series, making dashboards load faster. Alerting rules check PromQL conditions at regular intervals and fire alerts to Alertmanager when conditions stay true for a set duration.

Detailed Answer

Think of recording rules like a restaurant that preps ingredients before the dinner rush. Instead of chopping vegetables from scratch for every order, the kitchen pre-chops during quiet hours. Recording rules pre-compute expensive PromQL queries on a schedule so dashboards load instantly. Alerting rules are like a smoke detector: they continuously check a condition and sound the alarm when something crosses a threshold for long enough to be a real problem. Both types of rules are defined in YAML files and loaded by Prometheus through the rule_files config. They are organized into rule groups, where each group has a name and an optional evaluation interval. Recording rules have a record field (the name of the new metric to create) and an expr field (the PromQL expression to evaluate). The naming convention follows the pattern level:metric:operations -- for example, namespace:http_requests_total:rate5m tells you the aggregation level, the base metric, and what operation was applied. Alerting rules have an alert field (the alert name), an expr field, an optional for duration, labels to attach, and annotations for human-readable descriptions. Under the hood, Prometheus evaluates rules within each group sequentially but can run multiple groups in parallel. The evaluation interval defaults to the global setting but can be overridden per group. For recording rules, each evaluation writes a new sample to the TSDB with the current timestamp. For alerting rules, the evaluation produces one of three states: inactive (the expression returned nothing), pending (the expression matched but the for duration has not passed yet), or firing (the expression has been true for at least the for duration). When an alert hits firing state, Prometheus sends it to all configured Alertmanagers. In production, recording rules are essential for scaling dashboards. Without them, 50 engineers opening the same Grafana dashboard during an incident would each trigger the same expensive aggregation 50 times per refresh. Recording rules compute it once and store the result. A common pattern is building a pyramid: raw metrics get aggregated into per-service rates, then those rates get aggregated into per-team totals. For alerting, the for clause is critical -- it prevents false alarms from momentary spikes. A for: 5m clause means the condition must be continuously true for 5 minutes before the alert fires. A key gotcha with recording rules is circular dependencies. If rule A depends on the output of rule B, both must be in the same group with B listed first, because rules within a group run sequentially. Across groups, evaluation order is not guaranteed. For alerting rules, a common mistake is leaving out the for clause entirely, which causes alerts to fire on every brief spike. Another pitfall is hardcoding values in annotations instead of using template variables. Always include {{ $labels.instance }} and {{ $value }} in your annotation templates so on-call engineers can immediately see which target is affected and how bad it is.

Code Example

# prometheus-rules.yml - Recording and Alerting Rules
# Loaded via: rule_files: ['prometheus-rules.yml'] in prometheus.yml

groups:
  # Recording rules for payments-api performance metrics
  - name: payments_api_recording_rules              # Group name for organization
    interval: 30s                                    # Evaluate every 30 seconds
    rules:
      # Pre-compute per-service request rate
      - record: service:http_requests_total:rate5m   # New time series name (level:metric:operation)
        expr: >                                      # PromQL expression to evaluate
          sum by (service, environment) (
            rate(http_requests_total[5m])             # Rate of requests over 5 minutes
          )
        labels:
          aggregated_by: "recording_rule"            # Custom label to identify pre-computed metrics

      # Pre-compute error rate percentage
      - record: service:http_error_rate:ratio_rate5m # Error ratio as a recording rule
        expr: >                                      # Avoids expensive division in dashboards
          sum by (service) (rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum by (service) (rate(http_requests_total[5m]))

      # Pre-compute p99 latency per service
      - record: service:http_request_duration:p99_5m # 99th percentile latency
        expr: >                                      # histogram_quantile is expensive at query time
          histogram_quantile(0.99,
            sum by (le, service) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  # Alerting rules for checkout-service SLOs
  - name: checkout_service_alerts                    # Alerting rule group
    rules:
      # Alert when error rate exceeds 1% for 5 minutes
      - alert: CheckoutHighErrorRate                 # Alert name (PascalCase convention)
        expr: >                                      # PromQL condition to evaluate
          service:http_error_rate:ratio_rate5m{service="checkout-service"} > 0.01
        for: 5m                                      # Must be true for 5 min before firing
        labels:
          severity: critical                         # Routing label for Alertmanager
          team: checkout                             # Team responsible for this alert
        annotations:
          summary: "High error rate on checkout-service"          # Short description
          description: >                                           # Detailed description with templates
            Error rate is {{ $value | humanizePercentage }}
            for {{ $labels.service }} in {{ $labels.environment }}.
          runbook_url: "https://wiki.internal/runbooks/checkout-errors"  # Link to remediation steps

      # Alert when p99 latency exceeds 2 seconds
      - alert: CheckoutHighLatency                   # Latency SLO violation alert
        expr: >                                      # Use pre-computed recording rule
          service:http_request_duration:p99_5m{service="checkout-service"} > 2.0
        for: 10m                                     # Longer for-clause to reduce noise
        labels:
          severity: warning                          # Warning severity, not critical
          team: checkout                             # Ownership label
        annotations:
          summary: "P99 latency exceeds 2s on checkout-service"
          description: >                             # Include actual value for quick triage
            P99 latency is {{ $value | humanizeDuration }}
            for {{ $labels.service }}.

◈ Architecture Diagram

┌──────────────────────────────────────────────────────────────────┐
│               Recording Rules vs Alerting Rules                  │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌──────────────────────────────────────────────────┐            │
│  │              Recording Rules                      │            │
│  │                                                   │            │
│  │  Expensive PromQL ──→ Evaluate every 30s          │            │
│  │       expr            ──→ Store as new metric     │            │
│  │                            in TSDB                │            │
│  │                                                   │            │
│  │  sum(rate(http_total[5m])) → service:http:rate5m  │            │
│  │       [complex query]         [pre-computed]      │            │
│  └──────────────────────────────────────────────────┘            │
│                                                                  │
│  ┌──────────────────────────────────────────────────┐            │
│  │              Alerting Rules                       │            │
│  │                                                   │            │
│  │  PromQL expr ──→ Evaluate ──→ State Machine       │            │
│  │                                                   │            │
│  │  ┌──────────┐   ┌──────────┐   ┌──────────┐      │            │
│  │  │ INACTIVE │──→│ PENDING  │──→│ FIRING   │      │            │
│  │  │ expr=∅   │   │ expr=true│   │ for:5m   │      │            │
│  │  └──────────┘   │ timer<5m │   │ elapsed  │      │            │
│  │       ↑         └────┬─────┘   └────┬─────┘      │            │
│  │       │              │              │             │            │
│  │       └──expr=false──┘              ↓             │            │
│  │                              ┌────────────┐      │            │
│  │                              │Alertmanager│      │            │
│  │                              │  routing   │      │            │
│  │                              └────────────┘      │            │
│  └──────────────────────────────────────────────────┘            │
└──────────────────────────────────────────────────────────────────┘

How do you set up Alertmanager routing, grouping, and silences with Prometheus?

intermediategeneralprometheus

▼

Quick Answer

Alertmanager receives alerts from Prometheus, groups related ones by labels, routes them to the right receiver (Slack, PagerDuty, email) using a routing tree, and supports silences to temporarily mute notifications. Grouping reduces noise by batching alerts that share the same labels into one notification.

Detailed Answer

Think of Alertmanager as a hospital triage system. Patients (alerts) arrive and are grouped by condition. Cardiac cases go to cardiology, broken bones go to orthopedics (routing rules). If the hospital is doing planned maintenance on radiology machines, they put up a sign saying ignore false alarms from 2am to 4am (silences). The triage nurse does not page the doctor twice for the same patient (deduplication), and waits a few minutes to batch patients arriving together (group_wait). Alertmanager is a separate process that receives alerts from one or more Prometheus servers through its /api/v2/alerts endpoint. Its configuration defines receivers (notification channels like Slack or PagerDuty), a routing tree (which alerts go where), inhibition rules (suppress certain alerts when others are already firing), and templates for formatting notifications. The routing tree starts with a root route that has a default receiver. Child routes match on alert labels using matchers. Routes are evaluated top to bottom, and the first matching child wins unless continue: true is set, which lets evaluation continue to the next sibling. Grouping is Alertmanager's most important noise reduction feature. When group_by is set to something like [service, environment], all alerts with the same service and environment label values get bundled into a single notification. Three timing settings control notification behavior: group_wait is how long to wait for more alerts before sending the first notification for a new group (default 30 seconds), group_interval is the minimum time between updates to an existing group when new alerts arrive (default 5 minutes), and repeat_interval is how long to wait before resending an unresolved alert (default 4 hours). Getting these right is the difference between a useful alert system and one that either floods your phone or misses real problems. In production, a well-designed routing tree mirrors your organization's on-call structure. Critical payment alerts go to PagerDuty for immediate paging. Warning-level alerts for batch jobs go to a Slack channel. Silences are created through the Alertmanager UI or API and match alerts by label matchers -- they are essential during deployments and maintenance windows. Inhibition rules automatically suppress downstream alerts: when the entire cluster is unreachable (KubeAPIDown), you do not want 500 pod alerts flooding the channel. The inhibition rule says if KubeAPIDown is firing, suppress all alerts with the same cluster label. A common mistake is setting group_by to too many labels, like [service, pod, instance]. This creates one notification per pod, which defeats the purpose of grouping. On the other hand, the special value group_by: ['...'] groups nothing -- every alert becomes its own group. Another pitfall is setting repeat_interval too low, causing alert fatigue from constant re-notifications for chronic issues. The sweet spot is usually 4 to 12 hours. Engineers also forget that silences expire. If you create a 1-hour silence for a deployment that takes 2 hours, alerts will resume halfway through. Always add generous padding to silence durations.

Code Example

# alertmanager.yml - Alertmanager configuration
global:
  resolve_timeout: 5m                             # Mark alert as resolved if not re-received in 5m
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxx'  # Default Slack webhook
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'      # PagerDuty events API

# Notification templates
templates:
  - '/etc/alertmanager/templates/*.tmpl'           # Path to custom notification templates

# Routing tree - determines which alerts go to which receivers
route:
  receiver: 'slack-default'                        # Default receiver if no child route matches
  group_by: ['alertname', 'service']               # Group alerts by these labels
  group_wait: 30s                                  # Wait 30s for more alerts before first notification
  group_interval: 5m                               # Wait 5m between updates to existing groups
  repeat_interval: 4h                              # Re-send unresolved alerts every 4 hours
  routes:
    # Critical payment alerts go to PagerDuty immediately
    - match:
        severity: critical                         # Match alerts with severity=critical
        team: payments                             # AND team=payments
      receiver: 'pagerduty-payments'               # Route to PagerDuty
      group_wait: 10s                              # Shorter wait for critical alerts
      repeat_interval: 1h                          # Re-page every hour if unresolved
      continue: false                              # Stop matching after this route

    # All critical alerts (non-payments) go to PagerDuty general
    - match:
        severity: critical                         # Match any critical alert
      receiver: 'pagerduty-general'                # General on-call PagerDuty
      group_wait: 15s                              # Quick notification for critical

    # Warning alerts go to team-specific Slack channels
    - match:
        severity: warning                          # Match warning-level alerts
      receiver: 'slack-default'                    # Default Slack channel
      routes:
        - match:
            team: checkout                         # Checkout team warnings
          receiver: 'slack-checkout'               # Team-specific Slack channel
        - match:
            team: payments                         # Payments team warnings
          receiver: 'slack-payments'               # Payments Slack channel

# Inhibition rules - suppress alerts when others are firing
inhibit_rules:
  - source_match:                                  # When this alert is firing...
      alertname: 'KubeAPIDown'                     # Kubernetes API server is down
    target_match_re:                               # ...suppress these alerts
      alertname: 'Kube.*'                          # All Kubernetes-related alerts
    equal: ['cluster']                             # Only if cluster label matches

  - source_match:                                  # When critical alert is firing...
      severity: 'critical'                         # For a specific service
    target_match:
      severity: 'warning'                          # Suppress warning alerts
    equal: ['alertname', 'service']                # For the same alert and service

# Receivers - notification channel configurations
receivers:
  - name: 'slack-default'                          # Default Slack receiver
    slack_configs:
      - channel: '#alerts-general'                 # Slack channel name
        send_resolved: true                        # Notify when alert resolves
        title: '{{ .GroupLabels.alertname }}'       # Alert name as title
        text: >-                                   # Notification body template
          {{ range .Alerts }}
          *{{ .Labels.service }}* - {{ .Annotations.summary }}
          {{ end }}

  - name: 'pagerduty-payments'                     # PagerDuty for payments team
    pagerduty_configs:
      - service_key: 'payments-service-key-xxx'    # PagerDuty integration key
        severity: '{{ .GroupLabels.severity }}'     # Map to PD severity
        description: '{{ .CommonAnnotations.summary }}'  # Alert summary

  - name: 'slack-checkout'                         # Checkout team Slack channel
    slack_configs:
      - channel: '#checkout-alerts'                # Team-specific channel
        send_resolved: true                        # Send resolution notifications

  - name: 'slack-payments'                         # Payments team Slack channel
    slack_configs:
      - channel: '#payments-alerts'                # Team-specific channel
        send_resolved: true                        # Send resolution notifications

  - name: 'pagerduty-general'                      # General on-call PagerDuty
    pagerduty_configs:
      - service_key: 'general-oncall-key-xxx'      # General integration key

◈ Architecture Diagram

┌──────────────────────────────────────────────────────────────────┐
│                    Alertmanager Routing Tree                      │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Prometheus ──→ POST /api/v2/alerts ──→ Alertmanager             │
│                                                                  │
│  ┌──────────────────────────────────────────────────────┐        │
│  │  Root Route                                          │        │
│  │  receiver: slack-default                             │        │
│  │  group_by: [alertname, service]                      │        │
│  │                                                      │        │
│  │  ├── severity=critical AND team=payments             │        │
│  │  │   └── receiver: pagerduty-payments                │        │
│  │  │                                                   │        │
│  │  ├── severity=critical                               │        │
│  │  │   └── receiver: pagerduty-general                 │        │
│  │  │                                                   │        │
│  │  └── severity=warning                                │        │
│  │      ├── team=checkout                               │        │
│  │      │   └── receiver: slack-checkout                 │        │
│  │      └── team=payments                               │        │
│  │          └── receiver: slack-payments                 │        │
│  └──────────────────────────────────────────────────────┘        │
│                                                                  │
│  Grouping Timeline:                                              │
│  ┌────────┐  ┌──────────┐  ┌──────────────┐  ┌──────────────┐   │
│  │ Alert  │→ │group_wait│→ │ 1st Notify   │→ │group_interval│   │
│  │ Arrives│  │  (30s)   │  │ (batch sent) │  │   (5m)       │   │
│  └────────┘  └──────────┘  └──────────────┘  └──────┬───────┘   │
│                                                      │           │
│                                              ┌───────↓───────┐   │
│                                              │repeat_interval│   │
│                                              │    (4h)       │   │
│                                              │ re-send if    │   │
│                                              │ unresolved    │   │
│                                              └───────────────┘   │
└──────────────────────────────────────────────────────────────────┘