Quick Answer
Build a multi-stage pipeline with automated testing gates, manual approval for production, progressive canary deployment using Argo Rollouts or Flagger, metric-based canary analysis from Prometheus, and automatic rollback when error rates or latency exceed thresholds.
Detailed Answer
Think of deploying software to a bank like introducing a new procedure at a branch. You would not roll it out to all 500 branches simultaneously. You would train one branch first (canary), monitor customer satisfaction and error rates for a few days, get management approval (gate), then gradually roll it out to more branches while watching for problems. If complaints spike, you immediately revert to the old procedure. CI/CD pipelines for Kubernetes follow this exact pattern with automation replacing the manual monitoring. A production-grade CI/CD pipeline for banking has distinct stages. The CI portion runs on every pull request: code checkout, dependency vulnerability scanning (Snyk or Trivy), unit tests, static analysis (SonarQube), container image build, image vulnerability scanning, and push to ECR with a git-SHA tag. The CD portion triggers when code merges to main: deploy to staging, run integration tests against staging, wait for manual approval from a tech lead or release manager, deploy canary to production (5% traffic), run automated canary analysis for 30 minutes, gradually shift traffic (5% → 25% → 50% → 100%), and verify post-deployment health checks. In a regulated bank, the approval gate is not optional — SOX compliance requires documented approval for production changes, and the pipeline must log who approved, when, and what was deployed. Canary analysis is where the pipeline becomes intelligent. Instead of a human watching dashboards during canary, tools like Argo Rollouts with the Prometheus metrics provider or Flagger with its canary analysis engine automatically compare the canary's metrics against the baseline (stable version). You define success criteria: error rate must be below 1%, P99 latency must be below 500ms, and no new error log patterns. The tool queries Prometheus every 60 seconds during the canary window, compares canary metrics against the stable version's metrics, and makes a pass/fail decision. If any metric fails the threshold for two consecutive checks, the canary is automatically rolled back — no human intervention needed. This is critical for banking because a bad deployment to the payments-api could cause failed transactions, and automatic rollback limits the blast radius to the 5% canary traffic. Argo Rollouts replaces the standard Kubernetes Deployment with a Rollout resource that supports canary and blue-green strategies natively. The Rollout resource defines the canary steps (traffic weight, pause duration, analysis run), and an AnalysisTemplate defines the Prometheus queries and thresholds. When a new image is pushed, the Rollout controller creates a canary ReplicaSet, configures the Istio VirtualService (or nginx ingress) to split traffic, runs the analysis, and either promotes or aborts. The entire process is declarative and version-controlled — auditors can review the Git history to see exactly what canary criteria were in place for each deployment. In production at a bank, the pipeline must also handle database migrations, feature flags, and compliance artifacts. Database migrations run before the canary deployment using a Kubernetes Job with a migration container. Feature flags (via LaunchDarkly or Unleash) allow code to be deployed but not activated until the canary is promoted. Compliance artifacts — SBOM (Software Bill of Materials), vulnerability scan results, approval records, and deployment timestamps — are stored in an immutable artifact store (JFrog Artifactory or AWS CodeArtifact) and linked to the deployment for audit trails. The pipeline also enforces branch protection rules: only code that has passed peer review (minimum two approvals), all CI checks, and security scanning can reach the production deployment stage. The biggest gotcha is canary analysis that gives false confidence. If your canary only receives 5% of traffic and you are analyzing error rate, low traffic volume means a single error can swing your error rate from 0% to 10%, causing false rollbacks. Use absolute error counts alongside percentages for low-traffic services. Another gotcha is not testing the rollback path — if your canary deployment includes a database migration that is not backward-compatible, rolling back the application while the database has already migrated forward causes data issues. Always make database migrations backward-compatible (add columns but do not remove them until the next release). Finally, approval gates must have timeouts — a deployment waiting for approval for 48 hours in a banking context creates risk if the codebase has moved on.
Code Example
# Argo Rollouts - Canary deployment for payments-api
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
namespace: banking-prod
spec:
replicas: 10
revisionHistoryLimit: 5
selector:
matchLabels:
app: payments-api
strategy:
canary:
canaryService: payments-api-canary
stableService: payments-api-stable
trafficRouting:
istio:
virtualServices:
- name: payments-api-vsvc
routes:
- primary
steps:
# Step 1: 5% canary traffic + analysis
- setWeight: 5
- analysis:
templates:
- templateName: payments-api-canary-analysis
args:
- name: service-name
value: payments-api-canary
# Step 2: Manual approval gate (SOX compliance)
- pause: {} # Requires manual promotion
# Step 3: Gradual rollout
- setWeight: 25
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 100
# Automatic rollback on failure
abortScaleDownDelaySeconds: 30
template:
metadata:
labels:
app: payments-api
spec:
containers:
- name: payments-api
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payments-api:v2.5.1
ports:
- containerPort: 8080
---
# AnalysisTemplate - Prometheus-based canary validation
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: payments-api-canary-analysis
namespace: banking-prod
spec:
args:
- name: service-name
metrics:
- name: error-rate
interval: 60s
count: 10 # 10 checks over 10 minutes
successCondition: result[0] < 0.01 # < 1% error rate
failureLimit: 2 # Rollback after 2 consecutive failures
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{app="{{args.service-name}}",code=~"5.."}[2m]))
/
sum(rate(http_requests_total{app="{{args.service-name}}"}[2m]))
- name: latency-p99
interval: 60s
count: 10
successCondition: result[0] < 0.5 # < 500ms P99
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{app="{{args.service-name}}"}[2m])) by (le)
)
---
# GitHub Actions CI pipeline with security gates
# .github/workflows/payments-api-cicd.yaml
name: payments-api CI/CD
on:
push:
branches: [main]
paths: ['services/payments-api/**']
jobs:
ci:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run unit tests
run: go test ./... -coverprofile=coverage.out
- name: SonarQube analysis
uses: sonarsource/sonarqube-scan-action@v2
- name: Build container image
run: |
docker build -t payments-api:${{ github.sha }} \
-f services/payments-api/Dockerfile .
- name: Trivy vulnerability scan
uses: aquasecurity/trivy-action@master
with:
image-ref: payments-api:${{ github.sha }}
severity: CRITICAL,HIGH
exit-code: 1 # Fail pipeline on critical vulns
- name: Generate SBOM for compliance audit trail
run: syft payments-api:${{ github.sha }} -o spdx-json > sbom.json
- name: Push to ECR
run: |
aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
docker tag payments-api:${{ github.sha }} $ECR_REGISTRY/payments-api:${{ github.sha }}
docker push $ECR_REGISTRY/payments-api:${{ github.sha }}◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐ │ CI/CD Pipeline with Canary Analysis │ │ │ │ ┌──────┐ ┌──────┐ ┌───────┐ ┌──────┐ ┌──────────────────┐ │ │ │ Code │→ │ Unit │→ │ SAST │→ │Image │→ │ Trivy Scan + │ │ │ │Commit│ │ Test │ │Sonar │ │Build │ │ SBOM Generation │ │ │ └──────┘ └──────┘ └───────┘ └──────┘ └────────┬─────────┘ │ │ │ │ │ ┌──────────────────────────▼────────┐ │ │ │ Staging Deploy │ │ │ │ Integration Tests + E2E │ │ │ └──────────────┬────────────────────┘ │ │ │ │ │ ┌──────────────▼────────────────────┐ │ │ │ Manual Approval Gate │ │ │ │ (SOX: who + when + what) │ │ │ └──────────────┬────────────────────┘ │ │ │ │ │ ┌──────────────────────────────────────▼─────────────────────┐ │ │ │ Canary Deployment │ │ │ │ │ │ │ │ 5% ──→ Analysis ──→ 25% ──→ 50% ──→ 100% │ │ │ │ (Prometheus) │ │ │ │ error rate < 1% │ │ │ │ P99 < 500ms ┌──────────────────┐ │ │ │ │ │ Auto Rollback │ │ │ │ │ If analysis fails ─────────→│ (abort canary, │ │ │ │ │ │ restore stable) │ │ │ │ │ └──────────────────┘ │ │ │ └────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Quick Answer
During a rolling update, the Deployment creates a new ReplicaSet and gradually scales it up while scaling the old one down, controlled by maxSurge and maxUnavailable. Pods must pass readiness probes before old Pods are terminated. To rollback, run kubectl rollout undo which scales the previous ReplicaSet back up.
Detailed Answer
Think of a rolling update like replacing light bulbs in a theater while the show is still running. You can't turn off all the lights at once (that's a Recreate strategy — total blackout). Instead, you unscrew one old bulb, screw in a new one, verify it lights up (readiness probe), and only then move to the next one. At every moment during the replacement, the audience still has enough light to see the show. If a new bulb turns out to be defective (bad deployment), you stop the replacement and screw the old bulbs back in (rollback). When you update a Deployment (change the container image, environment variables, or any field in the Pod template), the Deployment controller creates a brand new ReplicaSet with the updated Pod template. It then orchestrates a carefully choreographed transition between the old and new ReplicaSets. The two parameters that control this dance are maxSurge (how many extra Pods above the desired count are allowed during the update — controls speed) and maxUnavailable (how many Pods can be offline simultaneously — controls safety margin). With replicas=4, maxSurge=1, maxUnavailable=1: Kubernetes can have up to 5 Pods running (4+1 surge) and at minimum 3 Pods available (4-1 unavailable) at any point. The update proceeds in cycles. In each cycle: (1) the new ReplicaSet is scaled up by maxSurge number of Pods, (2) Kubernetes waits for those new Pods to pass their readiness probe (this is the critical gate — without a readiness probe, Kubernetes considers Pods ready immediately, potentially sending traffic to uninitialized applications), (3) once new Pods are Ready, the old ReplicaSet is scaled down by maxUnavailable Pods. This cycle repeats until all old Pods are terminated and all new Pods are running. The entire process is recorded as a new revision in the Deployment's rollout history. Rollback is elegantly simple because Kubernetes keeps old ReplicaSets around (scaled to 0 replicas). When you run `kubectl rollout undo deployment/payments-api`, the Deployment controller doesn't create anything new — it simply scales up the previous ReplicaSet (which still has the old Pod template with the known-good image) and scales down the current one. This means rollback is typically faster than the original deployment because: the old image may still be cached on nodes (no pull needed), and the old ReplicaSet already exists (no creation delay). You can also rollback to a specific revision with `kubectl rollout undo --to-revision=3`. By default, Kubernetes keeps the last 10 old ReplicaSets (controlled by revisionHistoryLimit). Setting this too low (like 1) means you can only undo one step back. Setting it too high wastes API Server memory with stale ReplicaSet objects. For most teams, 5 revisions is the sweet spot. The most critical production gotcha: if you don't have a readiness probe, Kubernetes considers Pods ready the instant the container process starts — even if your Spring Boot app needs 45 seconds to initialize. During a rolling update, traffic gets routed to these half-started Pods, causing 500 errors for real users. The second gotcha: if your readiness probe never passes (bug in health endpoint, wrong port, misconfigured path), the rollout hangs forever — new Pods stay in a NotReady state, old Pods never get terminated, and the Deployment reports 'waiting for rollout to finish'. Use progressDeadlineSeconds (default 600s) to automatically mark a stuck rollout as Failed after a timeout.
Code Example
# Current state: payments-api running v2.1.0 with 4 replicas
kubectl get deployment payments-api
# NAME READY UP-TO-DATE AVAILABLE AGE
# payments-api 4/4 4 4 30d
# Trigger rolling update to v2.2.0
kubectl set image deployment/payments-api \
api=registry.company.io/payments-api:2.2.0
# Watch the rollout in real time
kubectl rollout status deployment/payments-api
# Waiting for deployment "payments-api" rollout to finish:
# 1 out of 4 new replicas have been updated...
# 2 out of 4 new replicas have been updated...
# 4 out of 4 new replicas have been updated...
# deployment "payments-api" successfully rolled out
# See both ReplicaSets during the transition
kubectl get replicasets -l app=payments-api
# NAME DESIRED CURRENT READY
# payments-api-7d5f8b6c4 4 4 4 ← new (v2.2.0)
# payments-api-6b4a9c1e2 0 0 0 ← old (v2.1.0)
# v2.2.0 has a bug! Rollback immediately
kubectl rollout undo deployment/payments-api
# deployment.apps/payments-api rolled back
# Confirm rollback succeeded
kubectl rollout status deployment/payments-api
kubectl describe deployment payments-api | grep Image
# Image: registry.company.io/payments-api:2.1.0 ← back to previous
# View full rollout history
kubectl rollout history deployment/payments-api
# REVISION CHANGE-CAUSE
# 1 initial deploy
# 2 image update to v2.2.0
# 3 rollback to revision 1
# Rollback to a specific revision
kubectl rollout undo deployment/payments-api --to-revision=1
# Deployment spec for controlled rollouts
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # 1 extra Pod allowed during update
maxUnavailable: 0 # Never drop below desired count (safest)
progressDeadlineSeconds: 300 # Fail rollout if stuck for 5 min
minReadySeconds: 30 # Wait 30s after Ready before continuing◈ Architecture Diagram
┌─── Rolling Update Timeline (replicas=4, maxSurge=1) ───┐ │ │ │ Step 0: Stable on v2.1.0 │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.1.0│ Available: 4 │ │ └──────┘└──────┘└──────┘└──────┘ │ │ │ │ Step 1: Create 1 new Pod (surge) │ │ ┌──────┐┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.1.0││v2.2.0│ Total: 5 │ │ └──────┘└──────┘└──────┘└──────┘└──┬───┘ │ │ │ │ │ wait for readiness probe │ │ ▼ │ │ Step 2: New Pod Ready → terminate 1 old │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.2.0│ Available: 4 │ │ └──────┘└──────┘└──────┘└──────┘ │ │ │ │ ... repeat until ... │ │ │ │ Step Final: All replaced │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.2.0││v2.2.0││v2.2.0││v2.2.0│ Available: 4 │ │ └──────┘└──────┘└──────┘└──────┘ │ │ │ │ Rollback: scale old RS back up (instant, no new pull) │ │ ┌──────┐┌──────┐┌──────┐┌──────┐ │ │ │v2.1.0││v2.1.0││v2.1.0││v2.1.0│ ← old RS restored │ │ └──────┘└──────┘└──────┘└──────┘ │ └──────────────────────────────────────────────────────────┘
Quick Answer
Helm charts are like blueprints that package application configurations and dependencies, making it easy to deploy complex applications in Kubernetes with consistent results.
Detailed Answer
Imagine you're building a house. You have all the materials (containers) but need a blueprint (chart) to organize them into the correct structure (deployments). Helm charts provide this blueprint for Kubernetes, allowing developers and operators to package up an application's configuration, dependencies, and resources into reusable templates that can be deployed consistently across different environments. Helm is a package manager for Kubernetes that helps in deploying complex applications by providing templated manifests. These charts contain metadata, values, and multiple YAML files defining the desired state of your application, including deployments, services, and configurations. A Helm chart is structured with a `Chart.yaml` file containing metadata about the package, a `values.yaml` for configurable parameters, and various templates (e.g., `templates/deployment.yaml`) that define Kubernetes resources. When you install a chart, Helm processes these templates using the provided values to generate the final manifests sent to the API Server. This process includes rendering placeholders in the template files with actual data from the values file. At scale, engineers use Helm to manage and automate deployments across multiple clusters. They use Helm repositories (like Chartmuseum) for versioning and sharing charts. Monitoring tools like Prometheus can track Helm release statuses and detect issues during upgrades or rollbacks. Common challenges include managing dependencies between different charts, ensuring proper resource allocation, and handling secret management securely. A critical gotcha is the complexity of dependency resolution when multiple charts have overlapping resources (like services or deployments). Making sure all dependencies are correctly resolved can be challenging, especially in large-scale environments. Another issue is handling secrets and sensitive data within charts without compromising security.
Code Example
# Create a new Helm chart helm create payments-api # Install a chart into the payments namespace helm install payments-api ./payments-api \\ --namespace payments \\ --set replicaCount=3 \\ --set image.repository=registry.company.com/payments-api \\ --set image.tag=2.8.4 # List installed releases helm list -n payments # Upgrade a release with new values helm upgrade payments-api ./payments-api \\ --namespace payments \\ --set image.tag=2.9.0 # Rollback to previous version if upgrade fails helm rollback payments-api 1 -n payments # View release history helm history payments-api -n payments
Quick Answer
Rolling update gradually replaces old pods with new ones (Kubernetes native). Blue-green maintains two full environments and switches traffic instantly. Canary sends a small percentage of traffic to the new version first, then gradually increases. Each trades off speed, resource cost, and risk differently.
Detailed Answer
Think of upgrading a restaurant's menu. A rolling update is like changing one table's menu at a time until all tables have the new menu — gradual with no extra cost but hard to undo if customers complain. Blue-green is like printing a complete new set of menus and swapping them all at once while keeping the old ones ready — instant switchback but double the printing cost. Canary is like giving the new menu to one table first, watching their reaction, and only rolling it out if they order successfully — safest but slowest. The rolling update is Kubernetes' native deployment strategy. When you update a Deployment's pod template, the controller creates a new ReplicaSet and gradually scales it up while scaling the old one down. The maxSurge parameter controls how many extra pods can exist during the transition (e.g., 25% means 12 becomes 15 temporarily), and maxUnavailable controls how many pods can be missing (e.g., 25% means at least 9 of 12 must always be ready). Traffic flows to both old and new pods simultaneously during the rollout. If the new pods fail readiness probes, the rollout pauses. Rollback is a single command: kubectl rollout undo. Blue-green deployment runs two complete environments: blue (current live version) and green (new version). Both are fully deployed and warmed up before any traffic switch. Once the green environment passes health checks and smoke tests, the Service selector or Ingress routing is updated to point all traffic to green. If problems emerge, switching back to blue is instantaneous because it is still running. The cost is double the resources during the transition. In Kubernetes, this is implemented by having two Deployments (payments-api-blue and payments-api-green) and changing the Service selector or using Ingress annotations to switch. Canary deployment sends a small fraction of production traffic to the new version (typically 1-5% initially) while the majority continues on the stable version. If metrics (error rate, latency, business KPIs) look healthy over a defined period, traffic is gradually shifted: 5%, 25%, 50%, 100%. If any metric degrades, traffic reverts to the stable version automatically. This provides the highest safety because real production traffic validates the new version with minimal blast radius. In Kubernetes, canary deployments are typically managed by tools like Argo Rollouts, Flagger, or Istio traffic splitting rather than native Kubernetes primitives. In production, the choice depends on your risk tolerance, infrastructure budget, and rollback speed requirements. Rolling updates are free (no extra resources) and native, but you cannot easily test the new version in isolation before it receives traffic. Blue-green is excellent for major version changes that might need instant rollback, but it doubles resource cost. Canary is ideal for high-traffic services where even a 1% error rate affects thousands of users and you need metric-driven progressive delivery. Many teams use rolling updates for internal services and canary for user-facing APIs. The non-obvious gotcha is database compatibility. All three strategies may have old and new application versions running simultaneously (during the rollout window). If the new version changes the database schema, both versions must work with both the old and new schema. This requires backward-compatible migrations: add columns as nullable, never rename in place, deploy the new schema before the new code, and remove old columns only after the old code is fully gone. Teams that forget this have both versions hitting the database and one version crashes on schema mismatch.
Code Example
# Check current rolling update strategy settings
kubectl get deployment payments-api -n payments -o jsonpath='{.spec.strategy}'
# Watch a rolling update in progress — shows old and new ReplicaSets
kubectl rollout status deployment/payments-api -n payments
# View both ReplicaSets during a rollout (old scaling down, new scaling up)
kubectl get replicaset -n payments -l app=payments-api
# Rollback a failed rolling update to the previous version
kubectl rollout undo deployment/payments-api -n payments
# Blue-green: switch traffic by updating the Service selector
kubectl patch svc payments-api -n payments -p '{"spec":{"selector":{"version":"green"}}}'
# Blue-green: rollback by switching back to blue
kubectl patch svc payments-api -n payments -p '{"spec":{"selector":{"version":"blue"}}}'
# Canary with Argo Rollouts: check canary step progress
kubectl argo rollouts get rollout payments-api -n payments
# Canary: manually promote after verifying metrics
kubectl argo rollouts promote payments-api -n payments
# Canary: abort if metrics degrade and roll back to stable
kubectl argo rollouts abort payments-api -n payments◈ Architecture Diagram
┌─────────────────────────────────────────────────┐ │ Rolling Update (native) │ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ │ │ │v1 │ │v1 │ │v2 │ │v2 │ gradual replacement │ │ └───┘ └───┘ └───┘ └───┘ │ ├─────────────────────────────────────────────────┤ │ Blue-Green │ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ blue (live) │ │ │v1 │ │v1 │ │v1 │ │v1 │ │ │ └───┘ └───┘ └───┘ └───┘ │ │ ┌───┐ ┌───┐ ┌───┐ ┌───┐ green (standby) │ │ │v2 │ │v2 │ │v2 │ │v2 │ ← switch traffic │ │ └───┘ └───┘ └───┘ └───┘ │ ├─────────────────────────────────────────────────┤ │ Canary │ │ ┌───┐ ┌───┐ ┌───┐ 95% traffic │ │ │v1 │ │v1 │ │v1 │ │ │ └───┘ └───┘ └───┘ │ │ ┌───┐ 5% traffic │ │ │v2 │ ← monitor metrics before promoting │ │ └───┘ │ └─────────────────────────────────────────────────┘
Quick Answer
A rolling update gradually replaces old pods with new ones by creating new ReplicaSet pods while terminating old ones. maxSurge controls how many extra pods can exist above the desired count during the update, while maxUnavailable controls how many pods can be down simultaneously.
Detailed Answer
Think of maxSurge and maxUnavailable like staffing rules during a shift change at a hospital. maxSurge is how many extra nurses you can have on the floor simultaneously (overtime budget), and maxUnavailable is how many nurse positions can be empty at once (minimum staffing). A hospital might say 'we can have 1 extra nurse on overtime (maxSurge=1) but no positions can be vacant (maxUnavailable=0)' meaning the new shift arrives before the old shift leaves. When you update a Deployment's pod template (new image, env vars, etc.), the Deployment controller creates a new ReplicaSet with the updated spec and begins scaling it up while scaling the old ReplicaSet down. The speed and safety of this transition is controlled by the `strategy.rollingUpdate.maxSurge` and `strategy.rollingUpdate.maxUnavailable` parameters. Both can be absolute numbers or percentages of the desired replica count. Here is the exact sequence with replicas=4, maxSurge=1, maxUnavailable=1: The controller can have at most 5 total pods (4 + maxSurge=1) and at least 3 available pods (4 - maxUnavailable=1). It starts by creating 1 new pod (now 5 total: 4 old + 1 new). Once the new pod is Ready, it terminates 1 old pod (now 4 total: 3 old + 1 new, with 1 old terminating making 3 available, which satisfies the minimum). It then creates another new pod (5 total again), waits for it to be Ready, terminates another old one, and repeats until all 4 pods are running the new version. The entire process respects both constraints at every step. In production, the two most common configurations are: (1) maxSurge=25%, maxUnavailable=25% (the default) which balances speed and safety, allowing the update to happen in about 4 rounds for most replica counts; (2) maxSurge=1, maxUnavailable=0 which is the safest option because no old pod is terminated until its replacement is proven Ready. The second option means your cluster temporarily runs more pods than desired, so you need spare capacity, but it guarantees zero capacity reduction during the update. The critical gotcha: maxUnavailable counts pods that are not Ready, not pods that are Terminating. If a new pod fails its readiness probe, it counts as unavailable, which may block the rollout from proceeding. A common failure mode is a broken readiness probe on the new version that never passes: the rollout creates maxSurge new pods, they all fail readiness, and the rollout stalls with both old and new pods running but no progress being made. This is when `kubectl rollout undo` becomes necessary.
Code Example
# Deployment with explicit rolling update strategy
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
namespace: production
spec:
replicas: 6
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 2 # Allow 2 extra pods (total can be 8)
maxUnavailable: 1 # At most 1 pod can be unavailable (min 5 available)
selector:
matchLabels:
app: checkout
template:
metadata:
labels:
app: checkout
spec:
containers:
- name: checkout
image: checkout:3.4.1 # Change this to trigger rolling update
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
# Trigger a rolling update by changing the image
kubectl set image deploy/checkout-service checkout=checkout:3.5.0 -n production
# Watch the rollout progress
kubectl rollout status deploy/checkout-service -n production
# Waiting for deployment "checkout-service" rollout to finish:
# 2 out of 6 new replicas have been updated...
# 4 out of 6 new replicas have been updated...
# 6 out of 6 new replicas have been updated...
# deployment "checkout-service" successfully rolled out
# See both ReplicaSets during rollout
kubectl get rs -n production -l app=checkout
# NAME DESIRED CURRENT READY
# checkout-service-6d4f7b 6 6 6 ← new (complete)
# checkout-service-5c3e8a 0 0 0 ← old (scaled down)
# Rollback if the new version has issues
kubectl rollout undo deploy/checkout-service -n production
# Safe config: no capacity loss during update
# maxSurge: 1, maxUnavailable: 0
# Means: always at least 6 pods available, create 1 new before removing 1 old◈ Architecture Diagram
Rolling Update with replicas=4, maxSurge=1, maxUnavailable=1: Step 0 (before update): Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 ✓] [Pod4 ✓] = 4 Ready New RS: (empty) Total: 4 pods, 4 available Step 1 (create new, terminate old): Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 ✓] [Pod4 terminating] New RS: [Pod5 ✓] Total: 5 pods, 4 available (within maxSurge=1, maxUnavail=1) Step 2: Old RS: [Pod1 ✓] [Pod2 ✓] [Pod3 terminating] New RS: [Pod5 ✓] [Pod6 ✓] Total: 5 pods, 4 available Step 3: Old RS: [Pod1 ✓] [Pod2 terminating] New RS: [Pod5 ✓] [Pod6 ✓] [Pod7 ✓] Total: 5 pods, 4 available Step 4 (complete): Old RS: (empty) New RS: [Pod5 ✓] [Pod6 ✓] [Pod7 ✓] [Pod8 ✓] = 4 Ready Total: 4 pods, 4 available Constraint at every step: max total pods = replicas + maxSurge = 4 + 1 = 5 min available = replicas - maxUnavail = 4 - 1 = 3