15 interview questions · kubernetes
Quick Answer
Build a multi-stage pipeline with automated testing gates, manual approval for production, progressive canary deployment using Argo Rollouts or Flagger, metric-based canary analysis from Prometheus, and automatic rollback when error rates or latency exceed thresholds.
Detailed Answer
Think of deploying software to a bank like introducing a new procedure at a branch. You would not roll it out to all 500 branches simultaneously. You would train one branch first (canary), monitor customer satisfaction and error rates for a few days, get management approval (gate), then gradually roll it out to more branches while watching for problems. If complaints spike, you immediately revert to the old procedure. CI/CD pipelines for Kubernetes follow this exact pattern with automation replacing the manual monitoring. A production-grade CI/CD pipeline for banking has distinct stages. The CI portion runs on every pull request: code checkout, dependency vulnerability scanning (Snyk or Trivy), unit tests, static analysis (SonarQube), container image build, image vulnerability scanning, and push to ECR with a git-SHA tag. The CD portion triggers when code merges to main: deploy to staging, run integration tests against staging, wait for manual approval from a tech lead or release manager, deploy canary to production (5% traffic), run automated canary analysis for 30 minutes, gradually shift traffic (5% → 25% → 50% → 100%), and verify post-deployment health checks. In a regulated bank, the approval gate is not optional — SOX compliance requires documented approval for production changes, and the pipeline must log who approved, when, and what was deployed. Canary analysis is where the pipeline becomes intelligent. Instead of a human watching dashboards during canary, tools like Argo Rollouts with the Prometheus metrics provider or Flagger with its canary analysis engine automatically compare the canary's metrics against the baseline (stable version). You define success criteria: error rate must be below 1%, P99 latency must be below 500ms, and no new error log patterns. The tool queries Prometheus every 60 seconds during the canary window, compares canary metrics against the stable version's metrics, and makes a pass/fail decision. If any metric fails the threshold for two consecutive checks, the canary is automatically rolled back — no human intervention needed. This is critical for banking because a bad deployment to the payments-api could cause failed transactions, and automatic rollback limits the blast radius to the 5% canary traffic. Argo Rollouts replaces the standard Kubernetes Deployment with a Rollout resource that supports canary and blue-green strategies natively. The Rollout resource defines the canary steps (traffic weight, pause duration, analysis run), and an AnalysisTemplate defines the Prometheus queries and thresholds. When a new image is pushed, the Rollout controller creates a canary ReplicaSet, configures the Istio VirtualService (or nginx ingress) to split traffic, runs the analysis, and either promotes or aborts. The entire process is declarative and version-controlled — auditors can review the Git history to see exactly what canary criteria were in place for each deployment. In production at a bank, the pipeline must also handle database migrations, feature flags, and compliance artifacts. Database migrations run before the canary deployment using a Kubernetes Job with a migration container. Feature flags (via LaunchDarkly or Unleash) allow code to be deployed but not activated until the canary is promoted. Compliance artifacts — SBOM (Software Bill of Materials), vulnerability scan results, approval records, and deployment timestamps — are stored in an immutable artifact store (JFrog Artifactory or AWS CodeArtifact) and linked to the deployment for audit trails. The pipeline also enforces branch protection rules: only code that has passed peer review (minimum two approvals), all CI checks, and security scanning can reach the production deployment stage. The biggest gotcha is canary analysis that gives false confidence. If your canary only receives 5% of traffic and you are analyzing error rate, low traffic volume means a single error can swing your error rate from 0% to 10%, causing false rollbacks. Use absolute error counts alongside percentages for low-traffic services. Another gotcha is not testing the rollback path — if your canary deployment includes a database migration that is not backward-compatible, rolling back the application while the database has already migrated forward causes data issues. Always make database migrations backward-compatible (add columns but do not remove them until the next release). Finally, approval gates must have timeouts — a deployment waiting for approval for 48 hours in a banking context creates risk if the codebase has moved on.
Code Example
# Argo Rollouts - Canary deployment for payments-api
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
namespace: banking-prod
spec:
replicas: 10
revisionHistoryLimit: 5
selector:
matchLabels:
app: payments-api
strategy:
canary:
canaryService: payments-api-canary
stableService: payments-api-stable
trafficRouting:
istio:
virtualServices:
- name: payments-api-vsvc
routes:
- primary
steps:
# Step 1: 5% canary traffic + analysis
- setWeight: 5
- analysis:
templates:
- templateName: payments-api-canary-analysis
args:
- name: service-name
value: payments-api-canary
# Step 2: Manual approval gate (SOX compliance)
- pause: {} # Requires manual promotion
# Step 3: Gradual rollout
- setWeight: 25
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 100
# Automatic rollback on failure
abortScaleDownDelaySeconds: 30
template:
metadata:
labels:
app: payments-api
spec:
containers:
- name: payments-api
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payments-api:v2.5.1
ports:
- containerPort: 8080
---
# AnalysisTemplate - Prometheus-based canary validation
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: payments-api-canary-analysis
namespace: banking-prod
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 60s
count: 10 # 10 checks over 10 minutes
successCondition: result[0] < 0.01 # < 1% error rate
failureLimit: 2 # Rollback after 2 consecutive failures
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{app="{{args.service-name}}",code=~"5.."}[2m]))
/
sum(rate(http_requests_total{app="{{args.service-name}}"}[2m]))
- name: latency-p99
interval: 60s
count: 10
successCondition: result[0] < 0.5 # < 500ms P99
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{app="{{args.service-name}}"}[2m])) by (le)
)
---
# GitHub Actions CI pipeline with security gates
# .github/workflows/payments-api-cicd.yaml
name: payments-api CI/CD
on:
push:
branches: [main]
paths: ['services/payments-api/**']
jobs:
ci:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run unit tests
run: go test ./... -coverprofile=coverage.out
- name: SonarQube analysis
uses: sonarsource/sonarqube-scan-action@v2
- name: Build container image
run: |
docker build -t payments-api:${{ github.sha }} \
-f services/payments-api/Dockerfile .
- name: Trivy vulnerability scan
uses: aquasecurity/trivy-action@master
with:
image-ref: payments-api:${{ github.sha }}
severity: CRITICAL,HIGH
exit-code: 1 # Fail pipeline on critical vulns
- name: Generate SBOM for compliance audit trail
run: syft payments-api:${{ github.sha }} -o spdx-json > sbom.json
- name: Push to ECR
run: |
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker tag payments-api:${{ github.sha }} $ECR_REGISTRY/payments-api:${{ github.sha }}
docker push $ECR_REGISTRY/payments-api:${{ github.sha }}◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐ │ CI/CD Pipeline with Canary Analysis │ │ │ │ ┌──────┐ ┌──────┐ ┌───────┐ ┌──────┐ ┌──────────────────┐ │ │ │ Code │→ │ Unit │→ │ SAST │→ │Image │→ │ Trivy Scan + │ │ │ │Commit│ │ Test │ │Sonar │ │Build │ │ SBOM Generation │ │ │ └──────┘ └──────┘ └───────┘ └──────┘ └────────┬─────────┘ │ │ │ │ │ ┌──────────────────────────▼────────┐ │ │ │ Staging Deploy │ │ │ │ Integration Tests + E2E │ │ │ └──────────────┬────────────────────┘ │ │ │ │ │ ┌──────────────▼────────────────────┐ │ │ │ Manual Approval Gate │ │ │ │ (SOX: who + when + what) │ │ │ └──────────────┬────────────────────┘ │ │ │ │ │ ┌──────────────────────────────────────▼─────────────────────┐ │ │ │ Canary Deployment │ │ │ │ │ │ │ │ 5% ──→ Analysis ──→ 25% ──→ 50% ──→ 100% │ │ │ │ (Prometheus) │ │ │ │ error rate < 1% │ │ │ │ P99 < 500ms ┌──────────────────┐ │ │ │ │ │ Auto Rollback │ │ │ │ │ If analysis fails ─────────→│ (abort canary, │ │ │ │ │ │ restore stable) │ │ │ │ │ └──────────────────┘ │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Quick Answer
Chaos engineering systematically injects failures (pod kills, network latency, disk stress) into Kubernetes to validate resilience. Blast radius is controlled through namespace targeting, label selectors, percentage-based injection, mandatory rollback plans, and SLO-based abort conditions.
Detailed Answer
Think of chaos engineering like a fire drill in a bank's headquarters. You don't set the entire building on fire to test the evacuation plan — you trigger a controlled alarm in one wing, observe how people respond, time the evacuation, identify bottlenecks at stairwells, and then expand to more wings only after the first drill succeeds. The 'blast radius' is which wing you pick and how many floors are involved. Chaos engineering in Kubernetes follows the same principle: start small, observe carefully, expand gradually. Before running any experiment, you must define the steady state hypothesis tied to your SLOs. For a banking payments platform, steady state might be: p99 latency for /api/payments is below 200ms, error rate is below 0.1%, and transaction throughput is above 500 TPS. The experiment asks: 'If we kill 30% of payments-api pods, does the system maintain steady state?' If it does, your horizontal scaling and readiness probes work. If it doesn't, you've found a resilience gap before a real incident exposes it at 3 AM. LitmusChaos is a CNCF project that runs as a Kubernetes operator. You install the ChaosCenter (control plane) and deploy ChaosEngine resources that reference ChaosExperiment templates. LitmusChaos provides a library of pre-built experiments: pod-delete, pod-network-latency, pod-cpu-hog, node-drain, disk-fill, and more. Each experiment has configurable parameters for blast radius: you specify the target namespace, label selectors, number of pods to affect, and duration. The ChaosEngine runs the experiment as a Kubernetes Job, injects the failure, observes the results via probes (HTTP health checks, Prometheus queries, custom scripts), and reports pass/fail. Gremlin is a commercial alternative that provides a SaaS control plane with a richer UI, team-based RBAC, and pre-built attack scenarios. Gremlin deploys a daemonset on your cluster that receives attack commands from the Gremlin cloud API. Blast radius control is the most critical aspect and requires multiple layers. First, always start in non-production environments — your staging cluster should mirror production topology (same number of replicas, same resource limits, same network policies). Second, use namespace isolation: target only the payments namespace, never run experiments against kube-system or cert-manager namespaces that affect the entire cluster. Third, use percentage-based targeting: affect 1 out of 5 pods first, then 2, then 3 — never jump to 100%. Fourth, set experiment duration limits: a 30-second pod kill is recoverable; a 30-minute network partition might trigger cascading failures in downstream services. Fifth, define abort conditions: if error rate exceeds 1% or latency exceeds 500ms, automatically halt the experiment. In a banking environment, chaos engineering requires additional governance. Every experiment needs a formal approval process, a documented rollback plan, and on-call engineers explicitly notified before execution. Regulatory frameworks like SOC2 and PCI-DSS require evidence that resilience testing was controlled and documented. Run experiments during business hours when the team is fully staffed, never on Fridays or before holidays. Use GameDay scheduling in your chaos platform to coordinate experiments across teams and ensure only one experiment runs at a time to prevent compounding failures. The gotcha that catches most teams: chaos experiments expose not just application weaknesses but observability gaps. Your first pod-kill experiment might pass — the app recovers in 5 seconds. But if your alerting didn't fire, your dashboards didn't show the event, and your on-call engineer had no idea an experiment was running, you've discovered that your monitoring is blind to real failures. The most valuable outcome of chaos engineering is often improving your observability stack, not your application code.
Code Example
# Install LitmusChaos operator on the cluster
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmus litmuschaos/litmus \
--namespace litmus --create-namespace \
--set portal.frontend.service.type=ClusterIP
# Install chaos experiments for the payments namespace
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/experiments.yaml \
-n payments
# Define a ChaosEngine targeting payments-api pods
# with strict blast radius controls
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payments-pod-kill
namespace: payments
spec:
engineState: active
appinfo:
appns: payments
applabel: app=payments-api # Only target payments-api pods
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30" # Only 30 seconds of chaos
- name: CHAOS_INTERVAL
value: "10" # Kill a pod every 10 seconds
- name: PODS_AFFECTED_PERC
value: "30" # Only affect 30% of pods
- name: FORCE
value: "false" # Graceful termination first
probe:
- name: payments-health-check
type: httpProbe
httpProbe/inputs:
url: http://payments-api.payments.svc:8080/health
insecureSkipVerify: false
responseTimeout: 3000 # 3s timeout
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
retry: 3
interval: 5
- name: slo-error-rate
type: promProbe
promProbe/inputs:
endpoint: http://prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{service="payments-api",code=~"5.."}[1m]))
/ sum(rate(http_requests_total{service="payments-api"}[1m])) * 100
comparator:
type: float
criteria: <="
value: "1.0" # Abort if error rate > 1%
mode: Continuous
# Monitor chaos experiment status
kubectl get chaosresult payments-pod-kill-pod-delete -n payments -o yaml
# Check if experiment passed or failed
kubectl get chaosengine payments-pod-kill -n payments \
-o jsonpath='{.status.experiments[0].status}'◈ Architecture Diagram
┌─────────── Chaos Engineering Workflow ──────────────────┐ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 1. Define Steady State (SLO-based) │ │ │ │ • p99 latency < 200ms │ │ │ │ • Error rate < 0.1% │ │ │ │ • Throughput > 500 TPS │ │ │ └──────────────────────┬───────────────────────────┘ │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 2. Blast Radius Controls │ │ │ │ ┌────────────┐ ┌──────────┐ ┌───────────┐ │ │ │ │ │ Namespace │ │ Label │ │ % Pods │ │ │ │ │ │ isolation │ │ selector │ │ affected │ │ │ │ │ │ (payments) │ │ (app= │ │ (30%) │ │ │ │ │ │ │ │ pay-api)│ │ │ │ │ │ │ └────────────┘ └──────────┘ └───────────┘ │ │ │ └──────────────────────┬───────────────────────────┘ │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 3. Run Experiment │ │ │ │ │ │ │ │ ChaosEngine → Pod Kill Job → Inject Failure │ │ │ │ │ │ │ │ │ │ │ ┌────────────────────────┤ │ │ │ │ ▼ ▼ ▼ │ │ │ │ HTTP Probe Prom Probe Abort Condition │ │ │ │ (health OK?) (SLO met?) (error > 1%? STOP) │ │ │ └──────────────────────┬───────────────────────────┘ │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 4. Analyze Results │ │ │ │ PASS → expand scope in next iteration │ │ │ │ FAIL → document gap → create remediation │ │ │ └──────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘
Quick Answer
The kube-scheduler uses a two-phase approach: first it filters out nodes that cannot run the Pod (predicates), then it scores the remaining feasible nodes using priority functions. The node with the highest aggregate score wins the binding.
Detailed Answer
Think of the Kubernetes scheduler like a hiring manager filling a position. First, you eliminate candidates who do not meet the minimum qualifications (no relevant degree, wrong location, missing certifications) -- that is the filtering phase. Then, among the qualified candidates, you rank them by how well they fit the role (years of experience, cultural fit, salary expectations) -- that is the scoring phase. The best-ranked candidate gets the offer, and in Kubernetes, the best-ranked node gets the Pod. In Kubernetes, the kube-scheduler is a control-plane component that watches the API server for newly created Pods that have no node assignment (spec.nodeName is empty). When it detects an unscheduled Pod, it begins the scheduling cycle. The scheduler maintains an internal scheduling queue that prioritizes Pods based on their priority class, creation timestamp, and other factors. The entire process happens in two distinct phases: filtering (also called predicates) and scoring (also called priorities). During the filtering phase, the scheduler evaluates each node against a set of filter plugins. These include PodFitsResources (checking CPU and memory requests against allocatable capacity), PodFitsHostPorts (ensuring requested host ports are available), NodeAffinity (matching node labels against affinity rules), TaintToleration (verifying the Pod tolerates all node taints), PodTopologySpread (enforcing topology spread constraints), and VolumeBinding (checking that required persistent volumes can be provisioned or are available on that node). Any node that fails even one filter is eliminated. If no nodes pass filtering, the Pod remains Pending and the scheduler may trigger preemption if the Pod has sufficient priority to evict lower-priority Pods. In the scoring phase, each surviving node is evaluated by scoring plugins that assign a value typically between 0 and 100. The NodeResourcesBalancedAllocation plugin favors nodes that would have balanced CPU and memory use after placing the Pod. The ImageLocality plugin gives higher scores to nodes that already have the container image cached, reducing pull time. InterPodAffinity scores nodes based on whether co-locating the Pod with other Pods matches affinity or anti-affinity preferences. The LeastAllocated strategy prefers nodes with the most free resources, while MostAllocated does the opposite for bin-packing. Each plugin score is multiplied by a configurable weight, and the weighted scores are summed. The node with the highest total score is selected, and the scheduler creates a Binding object to assign the Pod to that node. At production scale with thousands of nodes, the scheduler uses a percentageOfNodesToScore parameter (defaulting to a formula based on cluster size) to avoid evaluating every single node, which would be too slow. For a 5000-node cluster, it might only score 10% of feasible nodes once it has found enough candidates. The scheduler also supports scheduling profiles, allowing you to run multiple schedulers or customize the plugin chain. The scheduling framework has extension points like PreFilter, Filter, PreScore, Score, Reserve, Permit, PreBind, Bind, and PostBind, making it highly extensible. A non-obvious gotcha is that the scheduler makes decisions based on a snapshot of the cluster state, which can become stale in highly dynamic environments. If two Pods are being scheduled simultaneously and both target the same node, the second Pod may fail to bind because resources were consumed by the first. Additionally, the percentageOfNodesToScore optimization means the scheduler might not always find the globally optimal node -- it finds a good-enough node quickly. Resource requests (not limits) drive scheduling decisions, so Pods without requests are treated as requesting zero resources, which can lead to node overcommitment. Finally, DaemonSet Pods are not scheduled by the default scheduler since Kubernetes 1.12; the DaemonSet controller handles their node assignment directly.
Code Example
# Custom scheduler profile with specific plugins enabled
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: payments-scheduler # Custom scheduler name for payments workloads
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation # Prefer nodes with balanced CPU/memory
weight: 2 # Double the weight for balanced allocation
- name: ImageLocality # Prefer nodes that already have the image cached
weight: 1 # Standard weight for image locality
disabled:
- name: NodeResourcesMostAllocated # Disable bin-packing strategy
pluginConfig:
- name: PodTopologySpread # Configure topology spread constraints
args:
defaultingType: List # Use list-based defaulting
defaultConstraints: # Spread across zones by default
- maxSkew: 1 # Allow at most 1 Pod difference between zones
topologyKey: topology.kubernetes.io/zone # Spread across AZs
whenUnsatisfiable: ScheduleAnyway # Soft constraint - still schedule if skew exceeded
---
# Pod with resource requests that drive scheduling decisions
apiVersion: v1
kind: Pod
metadata:
name: payments-api-7f8d9c # Realistic Pod name with hash suffix
namespace: payments # Namespace for the payments service
labels:
app: payments-api # Label for service discovery
tier: backend # Label for topology spread
spec:
schedulerName: payments-scheduler # Use the custom scheduler defined above
topologySpreadConstraints: # Spread Pods across zones for HA
- maxSkew: 1 # Maximum difference in Pod count between zones
topologyKey: topology.kubernetes.io/zone # Spread across AZs
whenUnsatisfiable: DoNotSchedule # Hard constraint - block if cannot satisfy
labelSelector: # Match Pods with the same app label
matchLabels:
app: payments-api # Select all payments-api Pods
containers:
- name: payments-api # Main container name
image: registry.internal.io/payments-api:v2.4.1 # Internal registry image
resources:
requests: # These values drive the scheduler filtering phase
cpu: 500m # Request half a CPU core
memory: 512Mi # Request 512MB of memory
limits: # Limits enforce runtime cgroups constraints
cpu: "1" # Limit to 1 full CPU core
memory: 1Gi # Limit to 1GB of memory◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Unscheduled│ │ Filter │ │ Score │ │ Bind │
│ Pod │───→│ Phase │───→│ Phase │───→│ to Node │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
↓ ↓
┌──────────┐ ┌──────────┐
│ Eliminate │ │ Rank by │
│ Infeasible│ │ Weighted │
│ Nodes │ │ Scores │
└──────────┘ └──────────┘Quick Answer
When etcd loses quorum (majority of members are down), the cluster becomes read-only and cannot process writes, meaning no new Pods can be scheduled and no state changes can be persisted. Recovery involves either restoring enough members to regain quorum or rebuilding from a snapshot backup.
Detailed Answer
Imagine a board of directors that requires a majority vote to approve any decision. If the company has five board members and three resign suddenly, the remaining two cannot approve anything -- even if they agree -- because they lack the required majority. The company is paralyzed: no new hires, no budget changes, nothing. That is exactly what happens when etcd loses quorum: the remaining members know the current state but cannot authorize any changes. In Kubernetes, etcd is the single source of truth for all cluster state -- every Pod definition, Service, ConfigMap, Secret, and controller state lives in etcd. The kube-apiserver reads from and writes to etcd exclusively. Etcd uses the Raft consensus algorithm, which requires a strict majority (N/2 + 1) of members to agree on writes. For a 3-member etcd cluster, quorum requires 2 members; for 5 members, it requires 3. When quorum is lost, etcd switches to a degraded mode where it can serve stale reads (depending on consistency settings) but rejects all write operations. When quorum is lost, the chain of failure propagates quickly. The kube-apiserver begins returning errors for any mutating request (POST, PUT, DELETE) because etcd refuses writes. Controllers in the kube-controller-manager that rely on leader election through the apiserver may lose their leases. The scheduler cannot bind Pods to nodes. Existing workloads continue running because kubelets cache their Pod specs locally and container runtimes are independent of the control plane. However, no new Pods can be created, no scaling can occur, node heartbeats cannot be updated (which eventually triggers node NotReady conditions), and self-healing stops entirely. The cluster is alive but brain-dead. Recovery depends on the failure scenario. If etcd members are down due to transient issues (network partition, disk pressure, or crashed processes), the fastest path is to bring enough members back online to restore quorum. Check each member with etcdctl endpoint status and etcdctl member list. If a member's data is corrupted, remove it from the cluster with etcdctl member remove, then re-add it as a new member with etcdctl member add and let it rejoin and replicate. For catastrophic failure where all members are lost, you must restore from an etcd snapshot. Take regular snapshots with etcdctl snapshot save, then restore with etcdctl snapshot restore to a new data directory on each member, updating the initial-cluster and initial-advertise-peer-urls flags. After restoration, restart etcd and verify the kube-apiserver reconnects. In production, etcd failures at scale are often caused by slow disks, large key-value sizes from too many Kubernetes objects, or aggressive compaction settings. The write-ahead log (WAL) is sensitive to disk latency; etcd recommends dedicated SSDs with sub-10ms p99 latency. A non-obvious gotcha is that etcd v3 has a default storage limit of 2GB (configurable up to 8GB), and if the database exceeds this limit, etcd enters a maintenance mode that effectively looks like quorum loss. Another trap: during recovery, if you restore a snapshot to an odd number of members but start them with stale peer URLs, they may form split-brain scenarios. Always restore all members from the same snapshot simultaneously and use a fresh cluster token to prevent old members from rejoining.
Code Example
# Check etcd cluster health and member status ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # First etcd endpoint --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # CA certificate for TLS --cert=/etc/kubernetes/pki/etcd/server.crt \ # Server certificate --key=/etc/kubernetes/pki/etcd/server.key \ # Server private key endpoint health --cluster # Check health of all cluster members # List all etcd members and their status ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # Connect to surviving member --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # CA cert path --cert=/etc/kubernetes/pki/etcd/server.crt \ # Client cert for auth --key=/etc/kubernetes/pki/etcd/server.key \ # Client key for auth member list -w table # Output in table format for readability # Create a snapshot backup (run this as a CronJob in production) ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # Endpoint to snapshot from --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # TLS CA certificate --cert=/etc/kubernetes/pki/etcd/server.crt \ # TLS client certificate --key=/etc/kubernetes/pki/etcd/server.key \ # TLS client key snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db # Timestamped backup file # Restore from snapshot on each etcd member ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20260615-030000.db \ # Snapshot file to restore --name=etcd-0 \ # This member's name --initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380 \ # All members --initial-cluster-token=etcd-cluster-recovery-1 \ # New token prevents old members rejoining --initial-advertise-peer-urls=https://10.0.1.10:2380 \ # This member's peer URL --data-dir=/var/lib/etcd-restored # New data directory to avoid conflicts
◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐
│ etcd-0 │ │ etcd-1 │ │ etcd-2 │
│ HEALTHY │ │ DOWN │ │ DOWN │
└────┬─────┘ └──────────┘ └──────────┘
│
↓
┌──────────┐ ┌──────────┐
│ Quorum │───→│ API Srvr │
│ LOST │ │ Read │
│ No Write │ │ Only │
└──────────┘ └──────────┘
│
┌──────────┴──────────┐
↓ ↓
┌──────────┐ ┌──────────┐
│ Scheduler│ │Controller│
│ Blocked │ │ Blocked │
└──────────┘ └──────────┘Quick Answer
Architects tune etcd by sizing disks for low-latency IOPS, adjusting compaction and defragmentation schedules, monitoring database size and peer latency, and separating the events store. Sharding the main etcd or using virtual clusters becomes necessary when a single etcd instance approaches 8 GB or 30,000-40,000 objects and API server latency degrades.
Detailed Answer
Think of a library card catalog. When the library has a few thousand books, one cabinet handles lookups fine. But when the library grows to millions of books and hundreds of librarians are searching simultaneously, you either need a faster cabinet, multiple cabinets organized by subject, or a way to archive old cards. Etcd is that card catalog for Kubernetes — every resource definition, status update, and event is a card in the catalog. Etcd is the sole persistent store for Kubernetes cluster state. Every API server read and write flows through etcd, making its performance the ceiling for cluster responsiveness. For large clusters — those with tens of thousands of Pods, thousands of Services, or high churn from controllers and operators — etcd becomes the bottleneck before CPU, memory, or network do. The key metrics are fsync latency (which depends on disk IOPS), database size, number of keys, leader election frequency, and peer round-trip time between etcd members. Internally, etcd uses a B-tree index with multi-version concurrency control, or MVCC, keeping every revision of every key until compacted. Compaction removes old revisions, and defragmentation reclaims disk space after compaction. Without regular compaction, the database grows unboundedly. Kubernetes runs automatic compaction every five minutes by default, but operators must also schedule defragmentation because compaction alone does not free physical disk space. On cloud providers, using provisioned IOPS SSD volumes (like gp3 with 6000+ IOPS on AWS) is critical because etcd performance degrades sharply when fsync latency exceeds 10 milliseconds. At production scale, the first architectural decision is separating the events store. Kubernetes Events are high-volume, short-lived objects that create write pressure without carrying critical state. Running a dedicated etcd instance for Events reduces load on the main etcd cluster significantly. AWS EKS offers provisioned control plane tiers (XL, 2XL, 4XL) that scale etcd database limits up to 16 GB for clusters running AI and ML workloads with many custom resources. When even separated events and tuned compaction are insufficient, true etcd sharding — distributing different API groups to separate etcd clusters — or virtual clusters that maintain independent etcd instances per tenant become the next scaling lever. The non-obvious gotcha is that etcd performance problems often manifest as API server timeouts or slow kubectl responses, and teams blame the API server rather than looking at etcd disk latency. A single slow etcd member in a three-node cluster can drag down the entire quorum because the leader waits for a majority of followers to acknowledge writes. Architects should alert on p99 fsync duration, database size approaching 8 GB, and any leader changes, because a leader election storm during high write load can cascade into control-plane unavailability.
Code Example
# Check etcd database size and key count on the leader member ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://etcd-0.etcd.kube-system:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --write-out=table # Monitor fsync latency histogram from Prometheus metrics curl -s https://etcd-0.etcd.kube-system:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds # Trigger a manual defragmentation on a specific member during a maintenance window ETCDCTL_API=3 etcdctl defrag --endpoints=https://etcd-1.etcd.kube-system:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Configure API server to use a separate etcd instance for Event objects # In kube-apiserver manifest or startup flags: # --etcd-servers=https://etcd-main:2379 # Main etcd for all resources # --etcd-servers-overrides=/events#https://etcd-events:2379 # Separate etcd for Events # Check Kubernetes object counts by resource type to identify growth kubectl get --raw='/metrics' | grep apiserver_storage_objects | sort -t' ' -k2 -rn | head -20
◈ Architecture Diagram
┌──────────┐ │API Server│ └──┬───┬───┘ │ │ ↓ ↓ ┌─────┐ ┌─────────┐ │Main │ │Events │ │etcd │ │etcd │ │< 8GB│ │separate │ └─────┘ └─────────┘ │ ┌──┴──────────┐ │Compact+Defrag│ └──────────────┘
Quick Answer
Topology spread constraints tell the scheduler to distribute Pods across failure domains defined by node labels such as zone or hostname, using maxSkew to control imbalance. When combined with cluster autoscaling, problems arise if a zone has zero nodes — the autoscaler may not know about the zone, causing the scheduler to leave Pods pending indefinitely.
Detailed Answer
Think of seating guests at a wedding reception. You want to spread friends evenly across tables so no table is overcrowded and no group is isolated. The wedding planner checks how many people are at each table and seats the next guest at the most empty one, but if a table does not exist yet (no physical table has been set up), the planner cannot seat anyone there even if the venue has room. Topology spread constraints in Kubernetes work the same way. Kubernetes topology spread constraints are declared in the Pod spec under topologySpreadConstraints. Each constraint specifies a topologyKey (a node label like topology.kubernetes.io/zone or kubernetes.io/hostname), a maxSkew (the maximum allowed difference in Pod count between the most-populated and least-populated domain), a whenUnsatisfiable behavior (DoNotSchedule or ScheduleAnyway), and a labelSelector to identify which Pods count toward the spread calculation. Internally, the scheduler evaluates topology spread during the Filter and Score phases. In the Filter phase, it eliminates nodes where placing the Pod would violate the maxSkew when whenUnsatisfiable is DoNotSchedule. In the Score phase, it ranks remaining nodes by how well they balance the distribution. The scheduler considers the topologyKey label on existing nodes to define domains — a domain only exists if at least one node carries that label value. It then counts matching Pods per domain and calculates whether the new Pod can land in each domain without exceeding maxSkew. At production scale, the interaction with cluster autoscaling creates subtle failures. If a node pool in one availability zone scales to zero, that zone disappears from the scheduler's topology map. The scheduler only sees zones with active nodes, so it may consider a two-zone spread sufficient even when three zones are available. When maxSkew is 1 and whenUnsatisfiable is DoNotSchedule, the scheduler can leave Pods pending because it cannot place them in a zone that has no nodes, and the autoscaler may not create a node in the missing zone because it does not see pending Pods that specifically require it. This chicken-and-egg problem is one of the most common production issues with topology spread constraints. The non-obvious gotcha is that topology spread constraints count all matching Pods, including ones that are terminating, not-ready, or failing. During a rolling update, old Pods being terminated still count toward the spread calculation, which can cause new Pods to be unschedulable until the old ones are fully removed. Architects should set minDomains to explicitly declare how many zones the spread should consider, use node affinity in combination with spread constraints to ensure the autoscaler knows about expected zones, and monitor for unschedulable Pods with topology spread violation events.
Code Example
# Apply a Deployment with zone and node spread constraints
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages replicated Pods
metadata:
name: checkout-api # Production checkout service
namespace: payments # Team namespace
spec:
replicas: 6 # Six replicas to spread across three zones with two per zone
selector:
matchLabels:
app: checkout-api # Pod selector
template:
metadata:
labels:
app: checkout-api # Label used by spread constraint selector
spec:
topologySpreadConstraints:
- maxSkew: 1 # Allows at most one Pod difference between zones
topologyKey: topology.kubernetes.io/zone # Spreads across availability zones
whenUnsatisfiable: DoNotSchedule # Strictly enforces zone balance
labelSelector:
matchLabels:
app: checkout-api # Counts only checkout-api Pods
minDomains: 3 # Expects three zones even if some have zero nodes
- maxSkew: 1 # Allows at most one Pod difference between nodes within a zone
topologyKey: kubernetes.io/hostname # Spreads across individual nodes
whenUnsatisfiable: ScheduleAnyway # Prefers balance but allows imbalance
labelSelector:
matchLabels:
app: checkout-api # Counts only checkout-api Pods
containers:
- name: api # Application container
image: registry.company.com/checkout-api:3.7.2 # Versioned production image
resources:
requests:
cpu: 250m # Minimum CPU for scheduling
memory: 512Mi # Minimum memory for scheduling
# Check Pod distribution across zones
kubectl get pods -n payments -l app=checkout-api -o custom-columns='POD:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'
# Identify Pods pending due to topology spread violations
kubectl get events -n payments --field-selector reason=FailedScheduling | grep topology◈ Architecture Diagram
┌─── Zone A ──┐ ┌─── Zone B ──┐ ┌─── Zone C ──┐ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ │Pod1││Pod2││ │ │Pod3││Pod4││ │ │Pod5││Pod6││ │ └────┘└────┘│ │ └────┘└────┘│ │ └────┘└────┘│ │ maxSkew=1 │ │ maxSkew=1 │ │ maxSkew=1 │ └─────────────┘ └─────────────┘ └─────────────┘
Quick Answer
Scheduler plugins hook into the scheduling framework's extension points (PreFilter, Filter, PreScore, Score, Reserve, Permit, PreBind, Bind) to add custom logic like gang scheduling, co-scheduling, or capacity reservation. Scheduling profiles allow running multiple schedulers with different plugin configurations. Risks include increased scheduling latency, unintended Pod starvation, and complex debugging when plugins interact.
Detailed Answer
Think of a wedding seating planner with very specific rules. The basic planner checks table capacity and guest preferences. But this wedding also requires that certain groups of guests must all be seated simultaneously (gang scheduling), some tables are reserved for VIPs until the last minute (capacity reservation), and guests from rival families must never share an aisle (anti-affinity). Standard rules cannot express all of this, so the planner adds specialized checkers at different stages of the seating process. Kubernetes scheduler plugins work exactly this way. The Kubernetes scheduling framework replaced the old policy-based scheduler configuration with a plugin architecture. The scheduler processes each Pod through a pipeline of extension points: PreFilter (validate and preprocess), Filter (eliminate ineligible nodes), PostFilter (handle unschedulable Pods), PreScore (prepare scoring data), Score (rank eligible nodes), Reserve (tentatively claim resources), Permit (wait or approve), PreBind (prepare external resources), and Bind (commit the Pod to a node). Each extension point can have multiple plugins that run in order. Scheduling profiles allow a single kube-scheduler binary to expose multiple scheduler personalities. Each profile has a name and its own set of enabled, disabled, and configured plugins. A Pod selects its scheduler by setting spec.schedulerName. This means architects can run a default profile for general workloads and a specialized profile for GPU workloads, batch jobs, or latency-sensitive services without deploying separate scheduler binaries. The scheduler-plugins project from Kubernetes SIGs provides production-grade plugins like Coscheduling (gang scheduling for batch workloads that need all Pods scheduled together), Capacity Scheduling (enforcing elastic quotas across namespaces), and Trimaran (scoring based on real-time node use from metrics server). At production scale, custom scheduler plugins require careful testing because they affect every Pod placement decision. A slow PreFilter or Score plugin increases scheduling latency for all Pods using that profile. A buggy Filter plugin can make nodes ineligible when they should be available, causing Pods to remain pending. Plugin ordering matters because earlier plugins in the chain can mask or override later ones. Architects should measure scheduler latency percentiles (scheduling_duration_seconds), unschedulable Pod counts, and plugin-specific metrics before and after enabling custom plugins. The non-obvious gotcha is debugging scheduling failures with custom plugins. When a Pod is unschedulable, the scheduler event says which extension point rejected it, but the interaction between multiple plugins can create emergent behavior that is hard to trace. For example, a topology spread constraint combined with a capacity reservation plugin can create scenarios where Pods are pending not because of resource shortage but because the combination of constraints has no feasible solution. Architects should use the scheduler's verbose logging, the scheduling-queue metrics, and dry-run scheduling tools to validate plugin interactions before production deployment.
Code Example
# KubeSchedulerConfiguration with two profiles: default and batch-coscheduling
apiVersion: kubescheduler.config.k8s.io/v1 # Scheduler configuration API
kind: KubeSchedulerConfiguration # Configures the kube-scheduler binary
profiles:
- schedulerName: default-scheduler # Default profile for general workloads
plugins:
score:
enabled:
- name: NodeResourcesFit # Scores nodes by resource availability
weight: 1 # Standard weight
- name: InterPodAffinity # Scores based on Pod affinity preferences
weight: 1 # Standard weight
- schedulerName: batch-scheduler # Specialized profile for ML training jobs
plugins:
queueSort:
enabled:
- name: Coscheduling # Sorts Pods so gang members are scheduled together
preFilter:
enabled:
- name: Coscheduling # Validates that all gang members exist
postFilter:
enabled:
- name: Coscheduling # Preempts to make room for complete gangs
permit:
enabled:
- name: Coscheduling # Holds Pods until all gang members are schedulable
reserve:
enabled:
- name: Coscheduling # Reserves resources for the complete gang
# A batch training job that uses gang scheduling via the batch-scheduler profile
apiVersion: batch/v1 # Standard Job API
kind: Job # Batch workload requiring all workers to start together
metadata:
name: fraud-model-training # Distributed training job
namespace: ml-platform # ML team namespace
labels:
pod-group.scheduling.sigs.k8s.io/name: fraud-training-gang # Gang scheduling group name
pod-group.scheduling.sigs.k8s.io/min-available: "4" # All four workers must be scheduled
spec:
parallelism: 4 # Four parallel training workers
completions: 4 # Job completes when all four finish
template:
metadata:
labels:
pod-group.scheduling.sigs.k8s.io/name: fraud-training-gang # Same gang group label
spec:
schedulerName: batch-scheduler # Uses the coscheduling profile
containers:
- name: trainer # Distributed training worker container
image: registry.company.com/fraud-trainer:4.3.1 # ML training image
resources:
requests:
cpu: 4 # Four CPU cores per worker
memory: 16Gi # 16GB memory per worker
# Monitor scheduler performance metrics for the batch profile
kubectl get --raw='/metrics' | grep scheduling_duration_seconds | grep batch-scheduler◈ Architecture Diagram
┌──────────────────────────────────┐ │ Scheduling Pipeline │ │ │ │ PreFilter → Filter → PostFilter │ │ ↓ │ │ PreScore → Score → Reserve │ │ ↓ │ │ Permit → PreBind → Bind │ │ │ │ ┌──────────┐ ┌────────────────┐ │ │ │ Default │ │ Batch Profile │ │ │ │ Profile │ │ +Coscheduling │ │ │ └──────────┘ └────────────────┘ │ └──────────────────────────────────┘
Quick Answer
The control plane consists of the API Server (REST frontend that validates all requests), etcd (distributed key-value store holding all cluster state), Scheduler (assigns Pods to nodes based on constraints), and Controller Manager (runs reconciliation loops that drive actual state toward desired state).
Detailed Answer
Think of the control plane like the management floor of a large warehouse. The API Server is the front desk receptionist who handles every single request coming in — whether it's a customer placing an order, a supervisor checking inventory, or a new employee asking for directions. Every interaction with the warehouse goes through this one desk, no exceptions. etcd is the filing cabinet behind the desk that holds the single source of truth: every order, every employee record, every inventory count. The Scheduler is the floor manager who decides which aisle worker handles which incoming package based on who has capacity. The Controller Manager is the quality inspector who constantly walks the floor comparing what should be happening (orders to fulfill) with what is actually happening (packages on shelves), and files corrective actions when they don't match. The API Server (kube-apiserver) is the only component that talks directly to etcd. Every kubectl command, every internal component communication, and every webhook goes through the API Server as HTTPS REST calls. It performs authentication (who are you?), authorization via RBAC (are you allowed to do this?), admission control (should this request be modified or rejected?), and validation (is this YAML well-formed?) before persisting anything to etcd. It also serves the watch API, which lets other components subscribe to changes in real-time rather than polling — this is how the entire system stays reactive. etcd is a distributed, strongly-consistent key-value store built on the Raft consensus algorithm. It stores every object in the cluster: every Pod spec, every Service definition, every Secret, every ConfigMap. In production, etcd runs as a 3 or 5 node cluster (always odd numbers for quorum) and is often the first component to cause cluster-wide outages when it becomes unhealthy. etcd performance directly determines API Server response time — slow disk I/O on etcd nodes is the number one silent killer of Kubernetes clusters. Production teams typically dedicate SSD-backed nodes exclusively for etcd and monitor fsync latency religiously. The Scheduler (kube-scheduler) watches the API Server for newly created Pods that have no node assigned (spec.nodeName is empty). For each unscheduled Pod, it runs a two-phase algorithm: filtering (eliminate nodes that don't meet hard requirements like resource requests, nodeSelector, taints/tolerations, and affinity rules) and scoring (rank remaining nodes by soft preferences like spreading Pods across failure domains, preferring nodes with the image already cached, or balancing resource utilization). The highest-scoring node wins, and the Scheduler writes the node assignment back to the API Server. The Controller Manager (kube-controller-manager) is actually dozens of separate control loops compiled into a single binary for simplicity. Each controller watches a specific resource type and reconciles actual state with desired state. The ReplicaSet controller ensures the right number of Pods exist. The Deployment controller manages ReplicaSets during rollouts. The Node controller detects when nodes go offline. The Endpoint controller populates Service endpoints. The Job controller manages batch workloads. If the Controller Manager crashes, no reconciliation happens — Pods keep running but nothing self-heals until it's back. A critical production gotcha: many teams monitor Pod health but forget to monitor control plane health. If the API Server is overloaded (common with too many custom controllers or misconfigured HPA polling intervals), the entire cluster becomes unresponsive — you can't deploy, can't scale, can't even see what's broken. Production clusters should have dedicated monitoring for API Server request latency (apiserver_request_duration_seconds), etcd fsync duration (etcd_disk_wal_fsync_duration_seconds), and scheduler queue depth (scheduler_pending_pods).
Code Example
# Check control plane component health kubectl get componentstatuses kubectl get --raw='/healthz?verbose' # View control plane Pods (self-hosted clusters like kubeadm) kubectl get pods -n kube-system -l tier=control-plane # NAME READY STATUS # etcd-master-01 1/1 Running # kube-apiserver-master-01 1/1 Running # kube-controller-manager-master-01 1/1 Running # kube-scheduler-master-01 1/1 Running # Check API Server response time (latency issues?) kubectl get --raw='/metrics' | grep apiserver_request_duration # Verify etcd cluster health kubectl exec -n kube-system etcd-master-01 -- etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health # Check scheduler is making decisions kubectl get events --field-selector reason=Scheduled -A # View controller-manager logs for reconciliation errors kubectl logs -n kube-system kube-controller-manager-master-01 \ --tail=50 | grep -i error # Monitor API Server audit logs for troubleshooting # (configured via --audit-policy-file on API Server) kubectl logs -n kube-system kube-apiserver-master-01 \ | grep payments-api
◈ Architecture Diagram
┌─────────────── Control Plane ──────────────────────────┐
│ │
│ ┌────────────────────────────────────────┐ │
│ │ kube-apiserver │ │
│ │ REST frontend + auth + admission │ │
│ └──────────┬────────────┬────────────────┘ │
│ │ │ │
│ watch │ │ read/write │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ etcd │ │
│ │ │ key-value │ │
│ │ │ cluster state│ │
│ │ └──────────────┘ │
│ │ │
│ ┌────────┴──────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────────────┐ │
│ │kube-scheduler│ │ kube-controller-manager │ │
│ │ │ │ │ │
│ │ filter → │ │ ReplicaSet controller │ │
│ │ score → │ │ Deployment controller │ │
│ │ bind Pod to │ │ Node controller │ │
│ │ best node │ │ Endpoint controller │ │
│ └──────────────┘ └──────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────┘
│
│ API Server communicates with
▼
┌─────── Worker Nodes ──────┐
│ kubelet │ kube-proxy │
└───────────────────────────┘Quick Answer
Each worker node runs kubelet (agent that starts and monitors Pods), kube-proxy (manages networking rules for Service routing), and a container runtime (containerd or CRI-O that actually runs containers). The kubelet communicates with the API Server via HTTPS to receive Pod assignments and report node status.
Detailed Answer
Think of a worker node like a restaurant kitchen station. The kubelet is the line cook who receives tickets (Pod specs) from the head chef (API Server) and actually prepares the dishes (starts containers). The container runtime (containerd or CRI-O) is the set of pots, pans, and ovens that the cook uses — the actual tools that do the cooking. And kube-proxy is the waiter who knows which table ordered which dish, making sure the right plate gets to the right customer via the correct route. Without any one of these three, the kitchen cannot function. The kubelet is the primary node agent. It registers the node with the API Server, then watches (via the API Server's watch mechanism) for PodSpecs assigned to its node. When a new Pod is scheduled to its node, the kubelet calls the container runtime through the Container Runtime Interface (CRI) to pull images and start containers. It then continuously monitors container health by executing liveness and readiness probes at configured intervals. If a liveness probe fails, the kubelet restarts the container. It reports Pod status and node conditions (memory pressure, disk pressure, PID pressure, network unavailable) back to the API Server every 10 seconds by default (the node-status-update-frequency). If the API Server doesn't receive heartbeats for 40 seconds (the node-monitor-grace-period), the node is marked NotReady. The container runtime is the software that actually creates and runs containers using Linux kernel features like namespaces (isolation) and cgroups (resource limits). Kubernetes removed direct Docker support in v1.24 — it now requires a CRI-compatible runtime. The two production choices are containerd (lightweight, used by EKS, GKE, and most managed platforms) and CRI-O (purpose-built for Kubernetes, used by OpenShift). The kubelet communicates with the runtime via a Unix socket using the gRPC-based CRI protocol. The runtime handles image pulling, container lifecycle, and log management. kube-proxy runs on every node and implements the Service abstraction. When you create a Service, kube-proxy watches the API Server for Service and Endpoint objects, then programs the node's networking stack to route traffic correctly. In the default iptables mode, it creates iptables rules that perform DNAT (destination NAT) to translate the virtual Service IP to a real Pod IP, with random selection for load balancing. In IPVS mode (better for clusters with thousands of Services), it uses the Linux kernel's IPVS load balancer which supports multiple algorithms (round-robin, least-connections, source-hash). kube-proxy does NOT proxy traffic through itself — it only configures networking rules; actual packets flow directly from source to destination Pod. The communication between worker nodes and the control plane is strictly one-directional in terms of initiation: the kubelet always initiates connections to the API Server, never the reverse. This is a security design — worker nodes can be in untrusted networks and the API Server never needs to push data to them. The kubelet establishes a persistent watch connection to the API Server, which means it receives updates the instant they happen (Pod scheduled, Pod deleted, ConfigMap changed) without polling. For features like `kubectl exec` and `kubectl logs`, the API Server does establish a reverse connection to the kubelet's HTTPS endpoint (port 10250), which is why kubelet has its own TLS certificate. A common production gotcha: if the container runtime's socket becomes unresponsive (containerd hung, disk full preventing image pulls), the kubelet cannot start new Pods or report accurate status. The node might show Ready because the kubelet process itself is fine, but Pods scheduled there will be stuck in ContainerCreating forever. Monitoring containerd/CRI-O process health separately from kubelet health is essential for catching this early.
Code Example
# Check node status and conditions kubectl get nodes -o wide kubectl describe node worker-node-01 # Look for: Conditions section (MemoryPressure, DiskPressure, PIDPressure) # View kubelet logs on the node (SSH required) sudo journalctl -u kubelet --since "10 minutes ago" | tail -50 # Check kubelet's view of Pods on this node kubectl get pods --field-selector spec.nodeName=worker-node-01 -A # Verify container runtime is healthy sudo crictl info # Runtime status sudo crictl ps # Running containers sudo crictl pods # Running Pod sandboxes # Check kube-proxy is programming iptables rules sudo iptables -t nat -L KUBE-SERVICES | head -20 # View kube-proxy mode and configuration kubectl get configmap kube-proxy -n kube-system -o yaml # Check if kubelet can reach the API Server kubectl get --raw='/api/v1/nodes/worker-node-01/proxy/healthz' # Debug networking by checking kube-proxy logs kubectl logs -n kube-system -l k8s-app=kube-proxy \ --tail=30 # Check node resource capacity vs allocatable kubectl describe node worker-node-01 | grep -A5 'Capacity\|Allocatable' # Capacity: # cpu: 8 # memory: 32Gi # Allocatable: ← what's available for Pods (after system reserved) # cpu: 7600m # memory: 30Gi
◈ Architecture Diagram
┌────────────── Worker Node ──────────────────────────────┐
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ kubelet │ │
│ │ • Registers node with API Server │ │
│ │ • Watches for Pod assignments │ │
│ │ • Executes liveness/readiness probes │ │
│ │ • Reports node status every 10s │ │
│ └────────┬──────────────────────┬─────────────────┘ │
│ │ CRI (gRPC) │ HTTPS │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ Container Runtime│ │ │
│ │ (containerd) │ │ │
│ │ │ │ │
│ │ ┌────┐ ┌────┐ │ │ │
│ │ │Pod │ │Pod │ │ │ │
│ │ │ A │ │ B │ │ │ │
│ │ └────┘ └────┘ │ │ │
│ └──────────────────┘ │ │
│ │ │
│ ┌─────────────────┐ │ │
│ │ kube-proxy │ │ │
│ │ │ │ │
│ │ iptables/IPVS │ │ │
│ │ rules for │ │ │
│ │ Service routing │ │ │
│ └────────┬────────┘ │ │
│ │ │ │
└───────────┼──────────────────────┼──────────────────────┘
│ │
│ watch Services │ watch PodSpecs
│ & Endpoints │ report status
▼ ▼
┌──────────────────────────────────┐
│ kube-apiserver │
│ (Control Plane) │
└──────────────────────────────────┘Quick Answer
During a rolling update, the Deployment creates a new ReplicaSet and gradually scales it up while scaling the old one down, controlled by maxSurge and maxUnavailable. Pods must pass readiness probes before old Pods are terminated. To rollback, run kubectl rollout undo which scales the previous ReplicaSet back up.
Detailed Answer
Think of a rolling update like replacing light bulbs in a theater while the show is still running. You can't turn off all the lights at once (that's a Recreate strategy — total blackout). Instead, you unscrew one old bulb, screw in a new one, verify it lights up (readiness probe), and only then move to the next one. At every moment during the replacement, the audience still has enough light to see the show. If a new bulb turns out to be defective (bad deployment), you stop the replacement and screw the old bulbs back in (rollback). When you update a Deployment (change the container image, environment variables, or any field in the Pod template), the Deployment controller creates a brand new ReplicaSet with the updated Pod template. It then orchestrates a carefully choreographed transition between the old and new ReplicaSets. The two parameters that control this dance are maxSurge (how many extra Pods above the desired count are allowed during the update — controls speed) and maxUnavailable (how many Pods can be offline simultaneously — controls safety margin). With replicas=4, maxSurge=1, maxUnavailable=1: Kubernetes can have up to 5 Pods running (4+1 surge) and at minimum 3 Pods available (4-1 unavailable) at any point. The update proceeds in cycles. In each cycle: (1) the new ReplicaSet is scaled up by maxSurge number of Pods, (2) Kubernetes waits for those new Pods to pass their readiness probe (this is the critical gate — without a readiness probe, Kubernetes considers Pods ready immediately, potentially sending traffic to uninitialized applications), (3) once new Pods are Ready, the old ReplicaSet is scaled down by maxUnavailable Pods. This cycle repeats until all old Pods are terminated and all new Pods are running. The entire process is recorded as a new revision in the Deployment's rollout history. Rollback is elegantly simple because Kubernetes keeps old ReplicaSets around (scaled to 0 replicas). When you run `kubectl rollout undo deployment/payments-api`, the Deployment controller doesn't create anything new — it simply scales up the previous ReplicaSet (which still has the old Pod template with the known-good image) and scales down the current one. This means rollback is typically faster than the original deployment because: the old image may still be cached on nodes (no pull needed), and the old ReplicaSet already exists (no creation delay). You can also rollback to a specific revision with `kubectl rollout undo --to-revision=3`. By default, Kubernetes keeps the last 10 old ReplicaSets (controlled by revisionHistoryLimit). Setting this too low (like 1) means you can only undo one step back. Setting it too high wastes API Server memory with stale ReplicaSet objects. For most teams, 5 revisions is the sweet spot. The most critical production gotcha: if you don't have a readiness probe, Kubernetes considers Pods ready the instant the container process starts — even if your Spring Boot app needs 45 seconds to initialize. During a rolling update, traffic gets routed to these half-started Pods, causing 500 errors for real users. The second gotcha: if your readiness probe never passes (bug in health endpoint, wrong port, misconfigured path), the rollout hangs forever — new Pods stay in a NotReady state, old Pods never get terminated, and the Deployment reports 'waiting for rollout to finish'. Use progressDeadlineSeconds (default 600s) to automatically mark a stuck rollout as Failed after a timeout.
Code Example
# Current state: payments-api running v2.1.0 with 4 replicas
kubectl get deployment payments-api
# NAME READY UP-TO-DATE AVAILABLE AGE
# payments-api 4/4 4 4 30d
# Trigger rolling update to v2.2.0
kubectl set image deployment/payments-api \
api=registry.company.io/payments-api:2.2.0
# Watch the rollout in real time
kubectl rollout status deployment/payments-api
# Waiting for deployment "payments-api" rollout to finish:
# 1 out of 4 new replicas have been updated...
# 2 out of 4 new replicas have been updated...
# 4 out of 4 new replicas have been updated...
# deployment "payments-api" successfully rolled out
# See both ReplicaSets during the transition
kubectl get replicasets -l app=payments-api
# NAME DESIRED CURRENT READY
# payments-api-7d5f8b6c4 4 4 4 ← new (v2.2.0)
# payments-api-6b4a9c1e2 0 0 0 ← old (v2.1.0)
# v2.2.0 has a bug! Rollback immediately
kubectl rollout undo deployment/payments-api
# deployment.apps/payments-api rolled back
# Confirm rollback succeeded
kubectl rollout status deployment/payments-api
kubectl describe deployment payments-api | grep Image
# Image: registry.company.io/payments-api:2.1.0 ← back to previous
# View full rollout history
kubectl rollout history deployment/payments-api
# REVISION CHANGE-CAUSE
# 1 initial deploy
# 2 image update to v2.2.0
# 3 rollback to revision 1
# Rollback to a specific revision
kubectl rollout undo deployment/payments-api --to-revision=1
# Deployment spec for controlled rollouts
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 1 extra Pod allowed during update
maxUnavailable: 0 # Never drop below desired count (safest)
progressDeadlineSeconds: 300 # Fail rollout if stuck for 5 min
minReadySeconds: 30 # Wait 30s after Ready before continuing◈ Architecture Diagram
┌─── Rolling Update Timeline (replicas=4, maxSurge=1) ───┐ │ │ │ Step 0: Stable on v2.1.0 │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.1.0│ Available: 4 │ │ └──────┘└──────┘└──────┘└──────┘ │ │ │ │ Step 1: Create 1 new Pod (surge) │ │ ┌──────┐┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.1.0││v2.2.0│ Total: 5 │ │ └──────┘└──────┘└──────┘└──────┘└──┬───┘ │ │ │ │ │ wait for readiness probe │ │ ▼ │ │ Step 2: New Pod Ready → terminate 1 old │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.2.0│ Available: 4 │ │ └──────┘└──────┘└──────┘└──────┘ │ │ │ │ ... repeat until ... │ │ │ │ Step Final: All replaced │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.2.0││v2.2.0││v2.2.0││v2.2.0│ Available: 4 │ │ └──────┘└──────┘└──────┘└──────┘ │ │ │ │ Rollback: scale old RS back up (instant, no new pull) │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.1.0│ ← old RS restored │ │ └──────┘└──────┘└──────┘└──────┘ │ └──────────────────────────────────────────────────────────┘
Quick Answer
The control plane is the brain of the cluster: the API Server is the single point of communication, etcd stores all cluster state, the Scheduler assigns Pods to nodes, and the Controller Manager runs reconciliation loops that maintain desired state. On worker nodes, the kubelet manages Pods and kube-proxy handles networking.
Detailed Answer
Think of a Kubernetes cluster like an airport. The control plane is the airport operations center — the people and systems that coordinate everything. The API server is the main radio tower: every communication between pilots (kubectl), ground crew (kubelets), and air traffic control (controllers) goes through it. Nobody talks directly to anyone else. Etcd is the flight database — every flight plan, gate assignment, and schedule is recorded there, and if this database goes down, the airport can't function. The Scheduler is the gate assignment officer who decides which arriving plane goes to which gate based on gate size, availability, and terminal capacity. The Controller Manager is the operations team that constantly walks the airport comparing the schedule to reality: 'Gate 3 should have a plane — it doesn't — redirect one there.' The API Server (kube-apiserver) is the only component that talks to etcd. When you run `kubectl get pods`, kubectl sends an HTTPS request to the API server, which authenticates you, checks your RBAC permissions, retrieves the data from etcd, and returns it. When you create a Deployment, the API server validates the manifest, stores it in etcd, and notifies watching controllers via its built-in watch mechanism. Every component in the cluster — scheduler, controllers, kubelets — communicates exclusively through the API server. Etcd is a distributed key-value store that holds the entire state of the cluster: every Pod, Service, Secret, ConfigMap, and node registration. It uses the Raft consensus algorithm to maintain consistency across multiple replicas (production clusters run 3 or 5 etcd members). Etcd is the most critical component — if etcd data is lost and there's no backup, the cluster is unrecoverable. This is why etcd backup is a non-negotiable operational requirement. The Scheduler (kube-scheduler) watches for newly created Pods that have no node assigned. For each unscheduled Pod, it runs a two-phase process: filtering (which nodes CAN run this Pod — enough CPU? right architecture? matching tolerations?) and scoring (which node is BEST — least loaded? closest to existing Pods? has the image cached?). The winning node is written to the Pod's spec.nodeName field, and the kubelet on that node picks it up. The Controller Manager (kube-controller-manager) runs dozens of control loops, each responsible for one type of resource. The Deployment controller watches Deployments and manages ReplicaSets. The ReplicaSet controller watches ReplicaSets and manages Pods. The Node controller monitors node heartbeats and marks nodes as unhealthy. The Endpoints controller updates Service endpoints when Pods change. Each controller follows the same pattern: observe current state → compare to desired state → take action to converge. This reconciliation loop is the fundamental operating principle of Kubernetes. On each worker node, the kubelet is the agent that actually runs Pods. It watches the API server for Pods assigned to its node, pulls container images, starts containers via the container runtime (containerd), and reports status back. Kube-proxy runs on every node and maintains network rules (iptables or IPVS) that implement Service routing. A common misconception is that kube-proxy proxies traffic — in iptables mode, it doesn't. It just programs the kernel's packet filtering rules and gets out of the way.
Code Example
# Check control plane component health kubectl get componentstatuses # View all control plane Pods (they run as static Pods on master nodes) kubectl get pods -n kube-system # Check API server endpoint kubectl cluster-info # Kubernetes control plane is running at https://10.0.0.1:6443 # View etcd members (if you have access to the master node) ETCDCTL_API=3 etcdctl member list \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Take an etcd backup (CRITICAL for disaster recovery) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Check node status and kubelet version kubectl get nodes -o wide # View kubelet logs on a node (SSH required) journalctl -u kubelet -f # Check kube-proxy mode (iptables vs IPVS) kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
◈ Architecture Diagram
┌──────────── Control Plane ──────────────┐
│ │
│ ┌──────────┐ watches ┌────────┐ │
│ │Scheduler │◄─────────────►│ API │ │
│ │ │ │ Server │ │
│ └──────────┘ ┌─────────►│ │ │
│ │ │ only │ │
│ ┌──────────┐ │ │component│ │
│ │Controller│────┘ │that │ │
│ │ Manager │ watches │talks to │ │
│ └──────────┘ │ etcd │ │
│ └───┬────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ etcd │ │
│ │ (cluster │ │
│ │ state) │ │
│ └───────────┘ │
└─────────────────────────────────────────┘
│
API server
watches/updates
│
┌────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐┌──────────┐ ┌──────────┐
│ Node 1 ││ Node 2 │ │ Node 3 │
│ ││ │ │ │
│ kubelet ││ kubelet │ │ kubelet │
│ kube- ││ kube- │ │ kube- │
│ proxy ││ proxy │ │ proxy │
│ ││ │ │ │
│ Pod Pod ││ Pod Pod │ │ Pod Pod │
└──────────┘└──────────┘ └──────────┘Quick Answer
A game day is a planned resilience exercise where teams deliberately inject failures into systems while observing how services, monitoring, and people respond. It validates that SLOs are maintained under stress, runbooks are effective, and on-call engineers can diagnose and recover from realistic failure scenarios.
Detailed Answer
Think of a game day like a military field exercise. Instead of sending troops into actual combat to find out if their training works, you create a realistic simulation — complete with communications failures, supply chain disruptions, and adversary movements — in a controlled environment where nobody actually gets hurt. The goal isn't to prove everything works perfectly; it's to find the gaps in training, communication, and procedures before a real crisis exposes them. The most valuable game days are the ones where things go wrong, because each failure becomes a training opportunity. A game day for a banking payments platform begins weeks before the actual event with planning and scoping. The game day lead defines 3-5 failure scenarios relevant to the platform's risk profile: primary database failover, Kafka broker loss during peak transaction volume, AZ failure affecting half the payments-api pods, certificate expiry on a critical internal service, and a sudden traffic spike of 10x normal volume. Each scenario has a defined hypothesis: 'When we kill 2 of 3 Kafka brokers, consumer lag will spike but recover within 5 minutes, and no transactions will be lost.' The scope explicitly lists what will and will not be tested — you never inject chaos into systems that process live customer transactions without extensive safeguards. During the game day, the facilitator runs scenarios one at a time while observers monitor dashboards, alerting, and team communication channels. The key participants are: the facilitator who injects failures and controls the timeline, the development team who respond to incidents as if they were real, the SRE team who monitor SLIs and system health, and observers who document everything — how long it took to detect the issue, what runbook was followed, what communication happened, and where confusion arose. Each scenario follows a structured flow: inject the failure, start a timer, observe whether alerts fire within the expected window, watch the team's response, measure time to detection and time to recovery, and document all observations. The most critical aspect of a banking game day is defining clear safety rails. You need a kill switch — a way to immediately reverse any injected failure if it threatens to cause real customer impact. The game day should run in a pre-production environment that mirrors production topology, or in production during a low-traffic maintenance window with explicit leadership approval. For PCI-DSS and SOC2 compliance, every game day must be formally documented with approvals, scope definitions, results, and remediation actions. Some regulators specifically require evidence of resilience testing, making game days not just a best practice but a compliance requirement. After all scenarios complete, the team conducts an immediate retrospective. This is where the real value emerges. Common findings include: alerts didn't fire because the threshold was wrong, the runbook had an outdated command that no longer works, the on-call engineer didn't know how to access the Kafka admin tools, DNS failover took 12 minutes instead of the expected 2 minutes, and the team communicated over three different Slack channels causing information fragmentation. Each finding becomes a tracked action item with an owner and deadline. The gotcha that makes game days fail: treating them as a one-time event rather than a regular practice. The first game day is always rough — it exposes dozens of issues and the team feels demoralized. The value comes from running them quarterly, tracking improvement on previously identified issues, and gradually increasing the complexity and realism of scenarios. A team that runs game days quarterly will eventually handle real incidents with the calm confidence of a practiced response, while a team that ran one game day 18 months ago will fumble through the next real outage.
Code Example
# Game Day Plan: Banking Payments Platform
# Date: 2026-Q3 Game Day
# Duration: 4 hours (10 AM - 2 PM, business hours)
# Pre-game: Verify monitoring baseline
# Record steady-state metrics for comparison
kubectl exec -n monitoring prometheus-0 -- \
promtool query instant http://localhost:9090 \
'sum(rate(http_requests_total{service="payments-api"}[5m]))'
# Expected: ~500 req/s baseline
# ──── Scenario 1: Pod Failure (30 min) ────
# Hypothesis: Killing 50% of payments-api pods maintains p99 < 200ms
# Inject: Kill 3 of 6 payments-api pods
kubectl delete pod -n payments -l app=payments-api \
--field-selector status.phase=Running \
--grace-period=0 | head -3
# Observe: Do alerts fire within 2 minutes?
# Observe: Do replacement pods start within 30 seconds?
# Observe: Does p99 latency stay below SLO threshold?
kubectl get pods -n payments -l app=payments-api -w
# Measure recovery
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -20
# ──── Scenario 2: Kafka Broker Failure (45 min) ────
# Hypothesis: Losing 1 of 3 Kafka brokers doesn't lose transactions
# Inject: Cordon and drain the node running kafka-1
kubectl cordon worker-node-03
kubectl drain worker-node-03 --delete-emptydir-data --force --ignore-daemonsets
# Monitor consumer lag — should spike then recover
kubectl exec -n kafka kafka-0 -- \
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group payments-processor --describe
# Verify no messages lost — check dead letter queue
kubectl exec -n kafka kafka-0 -- \
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic payments.dlq --from-beginning --timeout-ms 5000
# ──── Scenario 3: Network Latency Injection (30 min) ────
# Hypothesis: 500ms latency to fraud-detector triggers circuit breaker
# Using LitmusChaos to inject network latency
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: gameday-network-latency
namespace: payments
spec:
engineState: active
appinfo:
appns: payments
applabel: app=fraud-detector
appkind: deployment
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_LATENCY
value: "500" # 500ms added latency
- name: TOTAL_CHAOS_DURATION
value: "300" # 5 minutes
- name: DESTINATION_IPS
value: "10.96.0.0/12" # Only internal traffic
# ──── Post-Game Day Retrospective ────
# Document findings in structured format
# Finding 1: Alert for Kafka consumer lag fired after 8 minutes
# (expected: 2 minutes). Action: reduce evaluation window
# Finding 2: Runbook for broker recovery referenced deprecated CLI
# Action: update runbook with kraft-based commands
# Finding 3: Circuit breaker on fraud-detector opened at 1s timeout
# but SLO requires 500ms. Action: tune threshold
# Track action items
# kubectl create configmap gameday-q3-actions -n platform \
# --from-literal=finding1="Reduce Kafka lag alert window to 2m" \
# --from-literal=finding2="Update broker recovery runbook" \
# --from-literal=finding3="Tune circuit breaker to 500ms"◈ Architecture Diagram
┌──────────── Game Day Timeline ──────────────────────────┐ │ │ │ ┌─── Preparation (2-3 weeks before) ──────────────┐ │ │ │ • Define 3-5 failure scenarios │ │ │ │ • Write hypotheses tied to SLOs │ │ │ │ • Get leadership approval │ │ │ │ • Notify on-call and dependent teams │ │ │ │ • Verify kill switch / rollback procedures │ │ │ └──────────────────────┬──────────────────────────┘ │ │ ▼ │ │ ┌─── Execution (4 hours) ─────────────────────────┐ │ │ │ │ │ │ │ Roles: │ │ │ │ ┌───────────┐ ┌───────────┐ ┌────────────┐ │ │ │ │ │Facilitator│ │Responders │ │ Observers │ │ │ │ │ │(injects │ │(dev + SRE │ │(document │ │ │ │ │ │ failures) │ │ team) │ │ everything)│ │ │ │ │ └─────┬─────┘ └─────┬─────┘ └─────┬──────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ │ │ Scenario 1 → Detect → Respond → Recover → Log │ │ │ │ Scenario 2 → Detect → Respond → Recover → Log │ │ │ │ Scenario 3 → Detect → Respond → Recover → Log │ │ │ │ │ │ │ │ Metrics tracked per scenario: │ │ │ │ • Time to detect (TTD) │ │ │ │ • Time to mitigate (TTM) │ │ │ │ • SLI impact during failure │ │ │ │ • Alert accuracy (fired? correct severity?) │ │ │ └──────────────────────┬──────────────────────────┘ │ │ ▼ │ │ ┌─── Retrospective (immediately after) ───────────┐ │ │ │ • Review each scenario: hypothesis vs reality │ │ │ │ • Document findings and surprises │ │ │ │ • Create action items with owners + deadlines │ │ │ │ • Schedule follow-up game day (quarterly) │ │ │ └──────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘
Quick Answer
Rolling update gradually replaces old pods with new ones (Kubernetes native). Blue-green maintains two full environments and switches traffic instantly. Canary sends a small percentage of traffic to the new version first, then gradually increases. Each trades off speed, resource cost, and risk differently.
Detailed Answer
Think of upgrading a restaurant's menu. A rolling update is like changing one table's menu at a time until all tables have the new menu — gradual with no extra cost but hard to undo if customers complain. Blue-green is like printing a complete new set of menus and swapping them all at once while keeping the old ones ready — instant switchback but double the printing cost. Canary is like giving the new menu to one table first, watching their reaction, and only rolling it out if they order successfully — safest but slowest. The rolling update is Kubernetes' native deployment strategy. When you update a Deployment's pod template, the controller creates a new ReplicaSet and gradually scales it up while scaling the old one down. The maxSurge parameter controls how many extra pods can exist during the transition (e.g., 25% means 12 becomes 15 temporarily), and maxUnavailable controls how many pods can be missing (e.g., 25% means at least 9 of 12 must always be ready). Traffic flows to both old and new pods simultaneously during the rollout. If the new pods fail readiness probes, the rollout pauses. Rollback is a single command: kubectl rollout undo. Blue-green deployment runs two complete environments: blue (current live version) and green (new version). Both are fully deployed and warmed up before any traffic switch. Once the green environment passes health checks and smoke tests, the Service selector or Ingress routing is updated to point all traffic to green. If problems emerge, switching back to blue is instantaneous because it is still running. The cost is double the resources during the transition. In Kubernetes, this is implemented by having two Deployments (payments-api-blue and payments-api-green) and changing the Service selector or using Ingress annotations to switch. Canary deployment sends a small fraction of production traffic to the new version (typically 1-5% initially) while the majority continues on the stable version. If metrics (error rate, latency, business KPIs) look healthy over a defined period, traffic is gradually shifted: 5%, 25%, 50%, 100%. If any metric degrades, traffic reverts to the stable version automatically. This provides the highest safety because real production traffic validates the new version with minimal blast radius. In Kubernetes, canary deployments are typically managed by tools like Argo Rollouts, Flagger, or Istio traffic splitting rather than native Kubernetes primitives. In production, the choice depends on your risk tolerance, infrastructure budget, and rollback speed requirements. Rolling updates are free (no extra resources) and native, but you cannot easily test the new version in isolation before it receives traffic. Blue-green is excellent for major version changes that might need instant rollback, but it doubles resource cost. Canary is ideal for high-traffic services where even a 1% error rate affects thousands of users and you need metric-driven progressive delivery. Many teams use rolling updates for internal services and canary for user-facing APIs. The non-obvious gotcha is database compatibility. All three strategies may have old and new application versions running simultaneously (during the rollout window). If the new version changes the database schema, both versions must work with both the old and new schema. This requires backward-compatible migrations: add columns as nullable, never rename in place, deploy the new schema before the new code, and remove old columns only after the old code is fully gone. Teams that forget this have both versions hitting the database and one version crashes on schema mismatch.
Code Example
# Check current rolling update strategy settings
kubectl get deployment payments-api -n payments -o jsonpath='{.spec.strategy}'
# Watch a rolling update in progress — shows old and new ReplicaSets
kubectl rollout status deployment/payments-api -n payments
# View both ReplicaSets during a rollout (old scaling down, new scaling up)
kubectl get replicaset -n payments -l app=payments-api
# Rollback a failed rolling update to the previous version
kubectl rollout undo deployment/payments-api -n payments
# Blue-green: switch traffic by updating the Service selector
kubectl patch svc payments-api -n payments -p '{"spec":{"selector":{"version":"green"}}}'
# Blue-green: rollback by switching back to blue
kubectl patch svc payments-api -n payments -p '{"spec":{"selector":{"version":"blue"}}}'
# Canary with Argo Rollouts: check canary step progress
kubectl argo rollouts get rollout payments-api -n payments
# Canary: manually promote after verifying metrics
kubectl argo rollouts promote payments-api -n payments
# Canary: abort if metrics degrade and roll back to stable
kubectl argo rollouts abort payments-api -n payments◈ Architecture Diagram
┌─────────────────────────────────────────────────┐ │ Rolling Update (native) │ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │ │ │v1 │ │v1 │ │v2 │ │v2 │ gradual replacement │ │ └───┘ └───┘ └───┘ └───┘ │ ├─────────────────────────────────────────────────┤ │ Blue-Green │ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ blue (live) │ │ │v1 │ │v1 │ │v1 │ │v1 │ │ │ └───┘ └───┘ └───┘ └───┘ │ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ green (standby) │ │ │v2 │ │v2 │ │v2 │ │v2 │ ← switch traffic │ │ └───┘ └───┘ └───┘ └───┘ │ ├─────────────────────────────────────────────────┤ │ Canary │ │ ┌───┐ ┌───┐ ┌───┐ 95% traffic │ │ │v1 │ │v1 │ │v1 │ │ │ └───┘ └───┘ └───┘ │ │ ┌───┐ 5% traffic │ │ │v2 │ ← monitor metrics before promoting │ │ └───┘ │ └─────────────────────────────────────────────────┘
Quick Answer
A rolling update gradually replaces old pods with new ones by creating new ReplicaSet pods while terminating old ones. maxSurge controls how many extra pods can exist above the desired count during the update, while maxUnavailable controls how many pods can be down simultaneously.
Detailed Answer
Think of maxSurge and maxUnavailable like staffing rules during a shift change at a hospital. maxSurge is how many extra nurses you can have on the floor simultaneously (overtime budget), and maxUnavailable is how many nurse positions can be empty at once (minimum staffing). A hospital might say 'we can have 1 extra nurse on overtime (maxSurge=1) but no positions can be vacant (maxUnavailable=0)' meaning the new shift arrives before the old shift leaves. When you update a Deployment's pod template (new image, env vars, etc.), the Deployment controller creates a new ReplicaSet with the updated spec and begins scaling it up while scaling the old ReplicaSet down. The speed and safety of this transition is controlled by the `strategy.rollingUpdate.maxSurge` and `strategy.rollingUpdate.maxUnavailable` parameters. Both can be absolute numbers or percentages of the desired replica count. Here is the exact sequence with replicas=4, maxSurge=1, maxUnavailable=1: The controller can have at most 5 total pods (4 + maxSurge=1) and at least 3 available pods (4 - maxUnavailable=1). It starts by creating 1 new pod (now 5 total: 4 old + 1 new). Once the new pod is Ready, it terminates 1 old pod (now 4 total: 3 old + 1 new, with 1 old terminating making 3 available, which satisfies the minimum). It then creates another new pod (5 total again), waits for it to be Ready, terminates another old one, and repeats until all 4 pods are running the new version. The entire process respects both constraints at every step. In production, the two most common configurations are: (1) maxSurge=25%, maxUnavailable=25% (the default) which balances speed and safety, allowing the update to happen in about 4 rounds for most replica counts; (2) maxSurge=1, maxUnavailable=0 which is the safest option because no old pod is terminated until its replacement is proven Ready. The second option means your cluster temporarily runs more pods than desired, so you need spare capacity, but it guarantees zero capacity reduction during the update. The critical gotcha: maxUnavailable counts pods that are not Ready, not pods that are Terminating. If a new pod fails its readiness probe, it counts as unavailable, which may block the rollout from proceeding. A common failure mode is a broken readiness probe on the new version that never passes: the rollout creates maxSurge new pods, they all fail readiness, and the rollout stalls with both old and new pods running but no progress being made. This is when `kubectl rollout undo` becomes necessary.
Code Example
# Deployment with explicit rolling update strategy
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
namespace: production
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Allow 2 extra pods (total can be 8)
maxUnavailable: 1 # At most 1 pod can be unavailable (min 5 available)
selector:
matchLabels:
app: checkout
template:
metadata:
labels:
app: checkout
spec:
containers:
- name: checkout
image: checkout:3.4.1 # Change this to trigger rolling update
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
# Trigger a rolling update by changing the image
kubectl set image deploy/checkout-service checkout=checkout:3.5.0 -n production
# Watch the rollout progress
kubectl rollout status deploy/checkout-service -n production
# Waiting for deployment "checkout-service" rollout to finish:
# 2 out of 6 new replicas have been updated...
# 4 out of 6 new replicas have been updated...
# 6 out of 6 new replicas have been updated...
# deployment "checkout-service" successfully rolled out
# See both ReplicaSets during rollout
kubectl get rs -n production -l app=checkout
# NAME DESIRED CURRENT READY
# checkout-service-6d4f7b 6 6 6 ← new (complete)
# checkout-service-5c3e8a 0 0 0 ← old (scaled down)
# Rollback if the new version has issues
kubectl rollout undo deploy/checkout-service -n production
# Safe config: no capacity loss during update
# maxSurge: 1, maxUnavailable: 0
# Means: always at least 6 pods available, create 1 new before removing 1 old◈ Architecture Diagram
Rolling Update with replicas=4, maxSurge=1, maxUnavailable=1: Step 0 (before update): Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 ✓] [Pod4 ✓] = 4 Ready New RS: (empty) Total: 4 pods, 4 available Step 1 (create new, terminate old): Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 ✓] [Pod4 terminating] New RS: [Pod5 ✓] Total: 5 pods, 4 available (within maxSurge=1, maxUnavail=1) Step 2: Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 terminating] New RS: [Pod5 ✓] [Pod6 ✓] Total: 5 pods, 4 available Step 3: Old RS: [Pod1 ✓] [Pod2 terminating] New RS: [Pod5 ✓] [Pod6 ✓] [Pod7 ✓] Total: 5 pods, 4 available Step 4 (complete): Old RS: (empty) New RS: [Pod5 ✓] [Pod6 ✓] [Pod7 ✓] [Pod8 ✓] = 4 Ready Total: 4 pods, 4 available Constraint at every step: max total pods = replicas + maxSurge = 4 + 1 = 5 min available = replicas - maxUnavail = 4 - 1 = 3
Quick Answer
Three control plane nodes provide high availability through etcd's Raft consensus, which requires a majority quorum. With 3 members, quorum is 2 — so the cluster survives one node failure. With 2 members, losing one loses quorum.
Detailed Answer
Think of it like a committee that makes decisions by majority vote. If you have 3 committee members, you need 2 to agree (majority) to pass any decision. If one member is sick, the remaining 2 can still vote and make decisions. But if you only had 2 members and one got sick, you'd have 1 out of 2 — not a majority — so no decisions can be made and everything stops. The reason is etcd, the distributed key-value store that holds all Kubernetes cluster state. Etcd uses the Raft consensus algorithm, which requires a strict majority of members to agree on any write. This majority is called quorum. For 3 members, quorum = 2 (you can lose 1). For 5 members, quorum = 3 (you can lose 2). For 2 members, quorum = 2 (you can lose 0 — making 2 members WORSE than 1 for availability). When one control plane node fails in a 3-node setup, here's what happens: etcd continues operating because 2 of 3 members still form quorum. The API server pods on the remaining 2 nodes handle all requests (the load balancer in front of them routes around the failed node). The scheduler and controller manager use leader election — one was active, the others were on standby. If the active leader was on the failed node, a new leader is elected within seconds. From the user's perspective, kubectl commands might have a brief hiccup (~5-10 seconds) during leader re-election, but the cluster continues operating normally. However, losing TWO of three control plane nodes is catastrophic: etcd loses quorum (only 1 of 3 remaining), and all writes fail. The API server can serve reads from the remaining etcd member but cannot process any creates, updates, or deletes. Existing workloads on worker nodes keep running (the kubelet continues managing pods independently), but you cannot deploy anything new, scale, or recover pods that fail. The cluster is in a read-only degraded state until quorum is restored. Why not 5 or 7 control plane nodes? Each etcd write must be acknowledged by a majority before it's committed. More members means more network round trips and higher write latency. For most clusters, the trade-off of 3 nodes (survive 1 failure, fast writes) is optimal. Large enterprise clusters sometimes use 5 nodes for extra resilience, but 7+ is almost never justified because the write performance penalty outweighs the marginal availability gain.
Code Example
# Check etcd member health ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://master-0:2379,https://master-1:2379,https://master-2:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # master-0:2379 is healthy: committed index = 458923 # master-1:2379 is healthy: committed index = 458923 # master-2:2379 is healthy: committed index = 458923 # Check etcd member list ETCDCTL_API=3 etcdctl member list --write-out=table # Check which controller-manager and scheduler are the leader kubectl get endpoints kube-scheduler -n kube-system -o yaml kubectl get endpoints kube-controller-manager -n kube-system -o yaml # Check control plane node status kubectl get nodes -l node-role.kubernetes.io/control-plane # NAME STATUS ROLES AGE # master-0 Ready control-plane 365d # master-1 Ready control-plane 365d # master-2 NotReady control-plane 365d ← failed # Backup etcd (CRITICAL — do this before any maintenance) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
◈ Architecture Diagram
3-Node Control Plane: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ │ │ │ │ │ │ API Srvr │ │ API Srvr │ │ API Srvr │ │ etcd │ │ etcd │ │ etcd │ │ Sched │ │ Sched │ │ Sched │ │ CtrlMgr │ │ CtrlMgr │ │ CtrlMgr │ └──────────┘ └──────────┘ └──────────┘ leader ★ standby standby Quorum math: 3 members → quorum = 2 → tolerate 1 failure ✓ 5 members → quorum = 3 → tolerate 2 failures ✓ 2 members → quorum = 2 → tolerate 0 failures ✗ Master-2 fails: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ etcd ✓ │ │ etcd ✓ │ │ etcd ✗ │ │ leader ★ │ │ standby │ │ DOWN │ └──────────┘ └──────────┘ └──────────┘ 2/3 = quorum ✓ → cluster operational Master-0 ALSO fails: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ etcd ✗ │ │ etcd ✓ │ │ etcd ✗ │ │ DOWN │ │ alone! │ │ DOWN │ └──────────┘ └──────────┘ └──────────┘ 1/3 = NO quorum ✗ → read-only mode