Quick Answer
Chaos engineering systematically injects failures (pod kills, network latency, disk stress) into Kubernetes to validate resilience. Blast radius is controlled through namespace targeting, label selectors, percentage-based injection, mandatory rollback plans, and SLO-based abort conditions.
Detailed Answer
Think of chaos engineering like a fire drill in a bank's headquarters. You don't set the entire building on fire to test the evacuation plan — you trigger a controlled alarm in one wing, observe how people respond, time the evacuation, identify bottlenecks at stairwells, and then expand to more wings only after the first drill succeeds. The 'blast radius' is which wing you pick and how many floors are involved. Chaos engineering in Kubernetes follows the same principle: start small, observe carefully, expand gradually. Before running any experiment, you must define the steady state hypothesis tied to your SLOs. For a banking payments platform, steady state might be: p99 latency for /api/payments is below 200ms, error rate is below 0.1%, and transaction throughput is above 500 TPS. The experiment asks: 'If we kill 30% of payments-api pods, does the system maintain steady state?' If it does, your horizontal scaling and readiness probes work. If it doesn't, you've found a resilience gap before a real incident exposes it at 3 AM. LitmusChaos is a CNCF project that runs as a Kubernetes operator. You install the ChaosCenter (control plane) and deploy ChaosEngine resources that reference ChaosExperiment templates. LitmusChaos provides a library of pre-built experiments: pod-delete, pod-network-latency, pod-cpu-hog, node-drain, disk-fill, and more. Each experiment has configurable parameters for blast radius: you specify the target namespace, label selectors, number of pods to affect, and duration. The ChaosEngine runs the experiment as a Kubernetes Job, injects the failure, observes the results via probes (HTTP health checks, Prometheus queries, custom scripts), and reports pass/fail. Gremlin is a commercial alternative that provides a SaaS control plane with a richer UI, team-based RBAC, and pre-built attack scenarios. Gremlin deploys a daemonset on your cluster that receives attack commands from the Gremlin cloud API. Blast radius control is the most critical aspect and requires multiple layers. First, always start in non-production environments — your staging cluster should mirror production topology (same number of replicas, same resource limits, same network policies). Second, use namespace isolation: target only the payments namespace, never run experiments against kube-system or cert-manager namespaces that affect the entire cluster. Third, use percentage-based targeting: affect 1 out of 5 pods first, then 2, then 3 — never jump to 100%. Fourth, set experiment duration limits: a 30-second pod kill is recoverable; a 30-minute network partition might trigger cascading failures in downstream services. Fifth, define abort conditions: if error rate exceeds 1% or latency exceeds 500ms, automatically halt the experiment. In a banking environment, chaos engineering requires additional governance. Every experiment needs a formal approval process, a documented rollback plan, and on-call engineers explicitly notified before execution. Regulatory frameworks like SOC2 and PCI-DSS require evidence that resilience testing was controlled and documented. Run experiments during business hours when the team is fully staffed, never on Fridays or before holidays. Use GameDay scheduling in your chaos platform to coordinate experiments across teams and ensure only one experiment runs at a time to prevent compounding failures. The gotcha that catches most teams: chaos experiments expose not just application weaknesses but observability gaps. Your first pod-kill experiment might pass — the app recovers in 5 seconds. But if your alerting didn't fire, your dashboards didn't show the event, and your on-call engineer had no idea an experiment was running, you've discovered that your monitoring is blind to real failures. The most valuable outcome of chaos engineering is often improving your observability stack, not your application code.
Code Example
# Install LitmusChaos operator on the cluster
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmus litmuschaos/litmus \
--namespace litmus --create-namespace \
--set portal.frontend.service.type=ClusterIP
# Install chaos experiments for the payments namespace
kubectl apply -f https://hub.litmuschaos.io/api/chaos/3.0.0?file=charts/generic/experiments.yaml \
-n payments
# Define a ChaosEngine targeting payments-api pods
# with strict blast radius controls
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: payments-pod-kill
namespace: payments
spec:
engineState: active
appinfo:
appns: payments
applabel: app=payments-api # Only target payments-api pods
appkind: deployment
chaosServiceAccount: litmus-admin
experiments:
- name: pod-delete
spec:
components:
env:
- name: TOTAL_CHAOS_DURATION
value: "30" # Only 30 seconds of chaos
- name: CHAOS_INTERVAL
value: "10" # Kill a pod every 10 seconds
- name: PODS_AFFECTED_PERC
value: "30" # Only affect 30% of pods
- name: FORCE
value: "false" # Graceful termination first
probe:
- name: payments-health-check
type: httpProbe
httpProbe/inputs:
url: http://payments-api.payments.svc:8080/health
insecureSkipVerify: false
responseTimeout: 3000 # 3s timeout
method:
get:
criteria: ==
responseCode: "200"
mode: Continuous
runProperties:
probeTimeout: 5
retry: 3
interval: 5
- name: slo-error-rate
type: promProbe
promProbe/inputs:
endpoint: http://prometheus.monitoring.svc:9090
query: |
sum(rate(http_requests_total{service="payments-api",code=~"5.."}[1m]))
/ sum(rate(http_requests_total{service="payments-api"}[1m])) * 100
comparator:
type: float
criteria: <="
value: "1.0" # Abort if error rate > 1%
mode: Continuous
# Monitor chaos experiment status
kubectl get chaosresult payments-pod-kill-pod-delete -n payments -o yaml
# Check if experiment passed or failed
kubectl get chaosengine payments-pod-kill -n payments \
-o jsonpath='{.status.experiments[0].status}'◈ Architecture Diagram
┌─────────── Chaos Engineering Workflow ──────────────────┐ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 1. Define Steady State (SLO-based) │ │ │ │ • p99 latency < 200ms │ │ │ │ • Error rate < 0.1% │ │ │ │ • Throughput > 500 TPS │ │ │ └──────────────────────┬───────────────────────────┘ │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 2. Blast Radius Controls │ │ │ │ ┌────────────┐ ┌──────────┐ ┌───────────┐ │ │ │ │ │ Namespace │ │ Label │ │ % Pods │ │ │ │ │ │ isolation │ │ selector │ │ affected │ │ │ │ │ │ (payments) │ │ (app= │ │ (30%) │ │ │ │ │ │ │ │ pay-api)│ │ │ │ │ │ │ └────────────┘ └──────────┘ └───────────┘ │ │ │ └──────────────────────┬───────────────────────────┘ │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 3. Run Experiment │ │ │ │ │ │ │ │ ChaosEngine → Pod Kill Job → Inject Failure │ │ │ │ │ │ │ │ │ │ │ ┌────────────────────────┤ │ │ │ │ ▼ ▼ ▼ │ │ │ │ HTTP Probe Prom Probe Abort Condition │ │ │ │ (health OK?) (SLO met?) (error > 1%? STOP) │ │ │ └──────────────────────┬───────────────────────────┘ │ │ ▼ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ 4. Analyze Results │ │ │ │ PASS → expand scope in next iteration │ │ │ │ FAIL → document gap → create remediation │ │ │ └──────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘
Quick Answer
A game day is a planned resilience exercise where teams deliberately inject failures into systems while observing how services, monitoring, and people respond. It validates that SLOs are maintained under stress, runbooks are effective, and on-call engineers can diagnose and recover from realistic failure scenarios.
Detailed Answer
Think of a game day like a military field exercise. Instead of sending troops into actual combat to find out if their training works, you create a realistic simulation — complete with communications failures, supply chain disruptions, and adversary movements — in a controlled environment where nobody actually gets hurt. The goal isn't to prove everything works perfectly; it's to find the gaps in training, communication, and procedures before a real crisis exposes them. The most valuable game days are the ones where things go wrong, because each failure becomes a training opportunity. A game day for a banking payments platform begins weeks before the actual event with planning and scoping. The game day lead defines 3-5 failure scenarios relevant to the platform's risk profile: primary database failover, Kafka broker loss during peak transaction volume, AZ failure affecting half the payments-api pods, certificate expiry on a critical internal service, and a sudden traffic spike of 10x normal volume. Each scenario has a defined hypothesis: 'When we kill 2 of 3 Kafka brokers, consumer lag will spike but recover within 5 minutes, and no transactions will be lost.' The scope explicitly lists what will and will not be tested — you never inject chaos into systems that process live customer transactions without extensive safeguards. During the game day, the facilitator runs scenarios one at a time while observers monitor dashboards, alerting, and team communication channels. The key participants are: the facilitator who injects failures and controls the timeline, the development team who respond to incidents as if they were real, the SRE team who monitor SLIs and system health, and observers who document everything — how long it took to detect the issue, what runbook was followed, what communication happened, and where confusion arose. Each scenario follows a structured flow: inject the failure, start a timer, observe whether alerts fire within the expected window, watch the team's response, measure time to detection and time to recovery, and document all observations. The most critical aspect of a banking game day is defining clear safety rails. You need a kill switch — a way to immediately reverse any injected failure if it threatens to cause real customer impact. The game day should run in a pre-production environment that mirrors production topology, or in production during a low-traffic maintenance window with explicit leadership approval. For PCI-DSS and SOC2 compliance, every game day must be formally documented with approvals, scope definitions, results, and remediation actions. Some regulators specifically require evidence of resilience testing, making game days not just a best practice but a compliance requirement. After all scenarios complete, the team conducts an immediate retrospective. This is where the real value emerges. Common findings include: alerts didn't fire because the threshold was wrong, the runbook had an outdated command that no longer works, the on-call engineer didn't know how to access the Kafka admin tools, DNS failover took 12 minutes instead of the expected 2 minutes, and the team communicated over three different Slack channels causing information fragmentation. Each finding becomes a tracked action item with an owner and deadline. The gotcha that makes game days fail: treating them as a one-time event rather than a regular practice. The first game day is always rough — it exposes dozens of issues and the team feels demoralized. The value comes from running them quarterly, tracking improvement on previously identified issues, and gradually increasing the complexity and realism of scenarios. A team that runs game days quarterly will eventually handle real incidents with the calm confidence of a practiced response, while a team that ran one game day 18 months ago will fumble through the next real outage.
Code Example
# Game Day Plan: Banking Payments Platform
# Date: 2026-Q3 Game Day
# Duration: 4 hours (10 AM - 2 PM, business hours)
# Pre-game: Verify monitoring baseline
# Record steady-state metrics for comparison
kubectl exec -n monitoring prometheus-0 -- \
promtool query instant http://localhost:9090 \
'sum(rate(http_requests_total{service="payments-api"}[5m]))'
# Expected: ~500 req/s baseline
# ──── Scenario 1: Pod Failure (30 min) ────
# Hypothesis: Killing 50% of payments-api pods maintains p99 < 200ms
# Inject: Kill 3 of 6 payments-api pods
kubectl delete pod -n payments -l app=payments-api \
--field-selector status.phase=Running \
--grace-period=0 | head -3
# Observe: Do alerts fire within 2 minutes?
# Observe: Do replacement pods start within 30 seconds?
# Observe: Does p99 latency stay below SLO threshold?
kubectl get pods -n payments -l app=payments-api -w
# Measure recovery
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -20
# ──── Scenario 2: Kafka Broker Failure (45 min) ────
# Hypothesis: Losing 1 of 3 Kafka brokers doesn't lose transactions
# Inject: Cordon and drain the node running kafka-1
kubectl cordon worker-node-03
kubectl drain worker-node-03 --delete-emptydir-data --force --ignore-daemonsets
# Monitor consumer lag — should spike then recover
kubectl exec -n kafka kafka-0 -- \
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--group payments-processor --describe
# Verify no messages lost — check dead letter queue
kubectl exec -n kafka kafka-0 -- \
kafka-console-consumer.sh --bootstrap-server localhost:9092 \
--topic payments.dlq --from-beginning --timeout-ms 5000
# ──── Scenario 3: Network Latency Injection (30 min) ────
# Hypothesis: 500ms latency to fraud-detector triggers circuit breaker
# Using LitmusChaos to inject network latency
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: gameday-network-latency
namespace: payments
spec:
engineState: active
appinfo:
appns: payments
applabel: app=fraud-detector
appkind: deployment
experiments:
- name: pod-network-latency
spec:
components:
env:
- name: NETWORK_LATENCY
value: "500" # 500ms added latency
- name: TOTAL_CHAOS_DURATION
value: "300" # 5 minutes
- name: DESTINATION_IPS
value: "10.96.0.0/12" # Only internal traffic
# ──── Post-Game Day Retrospective ────
# Document findings in structured format
# Finding 1: Alert for Kafka consumer lag fired after 8 minutes
# (expected: 2 minutes). Action: reduce evaluation window
# Finding 2: Runbook for broker recovery referenced deprecated CLI
# Action: update runbook with kraft-based commands
# Finding 3: Circuit breaker on fraud-detector opened at 1s timeout
# but SLO requires 500ms. Action: tune threshold
# Track action items
# kubectl create configmap gameday-q3-actions -n platform \
# --from-literal=finding1="Reduce Kafka lag alert window to 2m" \
# --from-literal=finding2="Update broker recovery runbook" \
# --from-literal=finding3="Tune circuit breaker to 500ms"◈ Architecture Diagram
┌──────────── Game Day Timeline ──────────────────────────┐ │ │ │ ┌─── Preparation (2-3 weeks before) ──────────────┐ │ │ │ • Define 3-5 failure scenarios │ │ │ │ • Write hypotheses tied to SLOs │ │ │ │ • Get leadership approval │ │ │ │ • Notify on-call and dependent teams │ │ │ │ • Verify kill switch / rollback procedures │ │ │ └──────────────────────┬──────────────────────────┘ │ │ ▼ │ │ ┌─── Execution (4 hours) ─────────────────────────┐ │ │ │ │ │ │ │ Roles: │ │ │ │ ┌───────────┐ ┌───────────┐ ┌────────────┐ │ │ │ │ │Facilitator│ │Responders │ │ Observers │ │ │ │ │ │(injects │ │(dev + SRE │ │(document │ │ │ │ │ │ failures) │ │ team) │ │ everything)│ │ │ │ │ └─────┬─────┘ └─────┬─────┘ └─────┬──────┘ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ │ │ Scenario 1 → Detect → Respond → Recover → Log │ │ │ │ Scenario 2 → Detect → Respond → Recover → Log │ │ │ │ Scenario 3 → Detect → Respond → Recover → Log │ │ │ │ │ │ │ │ Metrics tracked per scenario: │ │ │ │ • Time to detect (TTD) │ │ │ │ • Time to mitigate (TTM) │ │ │ │ • SLI impact during failure │ │ │ │ • Alert accuracy (fired? correct severity?) │ │ │ └──────────────────────┬──────────────────────────┘ │ │ ▼ │ │ ┌─── Retrospective (immediately after) ───────────┐ │ │ │ • Review each scenario: hypothesis vs reality │ │ │ │ • Document findings and surprises │ │ │ │ • Create action items with owners + deadlines │ │ │ │ • Schedule follow-up game day (quarterly) │ │ │ └──────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘