Shopify

15 interview questions · kubernetes, terraform

kubernetesterraformadvancedarchitectintermediate

How do you identify whether a pod restart is caused by OOMKilled, a connectivity failure, or an application-level bug?

advancedpodskubernetes

▼

Quick Answer

Check the container's last termination reason and exit code: OOMKilled shows reason=OOMKilled with exit 137, connectivity failures show timeout errors in application logs with exit 1, and application bugs show stack traces or panic messages in logs with exit 1 or 2. The distinction comes from correlating exit codes, termination reasons, and log content.

Detailed Answer

Think of a car that keeps stalling. A mechanic checks three things in order: is it out of fuel (OOMKilled — out of memory), is the road blocked (connectivity — cannot reach a dependency), or is the engine itself broken (application bug). Each has a different diagnostic signature, and checking them in the right order saves time. In Kubernetes, every container termination has metadata that points to the cause. The container status records the exit code, the termination reason, and the termination message. OOMKilled is the clearest: Kubernetes sets the reason field to OOMKilled and the exit code to 137. This means the kernel's Out-Of-Memory killer terminated the process because it exceeded its cgroup memory limit. The container did not choose to exit — it was killed by the kernel. For connectivity failures, the exit code is typically 1 (generic application error) and the logs show timeout or connection refused messages when trying to reach a database, cache, or external API. The key diagnostic is checking the application logs for patterns like 'connection refused,' 'timeout,' 'no such host,' or 'TLS handshake failed.' You can verify by execing into the pod and testing connectivity manually with nc, curl, or nslookup to isolate whether it is a DNS, network policy, or service availability issue. For application bugs, the exit code is 1 or sometimes 2 (misuse), and the logs show stack traces, null pointer exceptions, panic messages, or assertion failures. These are predictable (deterministic) — the same input or configuration triggers the same crash. You can distinguish them from connectivity issues because the error occurs during request processing or startup logic, not during a connection attempt. The non-obvious gotcha is that OOMKilled can masquerade as an application bug if you only check logs. When the OOM killer strikes, the process is terminated immediately — there may be no log line because the application never got a chance to write one. If you see a container with exit code 137, zero log output, and high restart count, check the termination reason field directly. Also, a JVM application may show exit code 1 with a java.lang.OutOfMemoryError in logs if it hits the JVM heap limit before hitting the cgroup limit — this is an application-level OOM, not a kernel OOMKill, and the fix is different (increase JVM heap, not container memory limit).

Code Example

# Step 1: Check termination reason and exit code
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason, exitCode, startedAt, finishedAt

# Step 2: If exit code 137, confirm OOMKilled
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -i 'oom\|killed\|reason' # Confirms OOMKilled

# Step 3: Check memory usage vs limits for OOMKilled
kubectl top pod payments-api-7d9f8b6c4-abc12 -n payments # Current memory usage
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
  -o jsonpath='{.spec.containers[0].resources.limits.memory}' # Configured memory limit

# Step 4: If exit code 1, check logs for connectivity vs application error
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=100 # Check for timeout/connection vs stack trace

# Step 5: Test connectivity from inside the pod
kubectl exec -n payments deploy/payments-api -- nc -zv payments-db.internal 5432 # Test database connectivity
kubectl exec -n payments deploy/payments-api -- nslookup redis-cache.payments.svc # Test DNS resolution

# Quick reference for exit codes:
# Exit 0   = Normal termination (container completed successfully)
# Exit 1   = Application error (check logs for stack trace or connection error)
# Exit 137 = SIGKILL (OOMKilled by kernel or killed by kubelet)
# Exit 143 = SIGTERM (graceful shutdown, often from liveness probe failure)

◈ Architecture Diagram

┌──────────────┐
│ Pod Restart  │
└──────┬───────┘
       ↓
┌──────────────┐
│ Exit Code?   │
├──────┬───────┤
│ 137  │  1    │
│ OOM  │ Logs? │
└──┬───┴───┬───┘
   ↓       ↓
┌─────┐ ┌────────┐
│ OOM │ │Timeout?│
│Kill │ ├────┬───┤
└─────┘ │Yes │No │
        ↓    ↓
     ┌────┐┌────┐
     │Conn││ Bug│
     └────┘└────┘

How do HPA and VPA work together for autoscaling in production?

advancedschedulingkubernetes

▼

Quick Answer

HPA (Horizontal Pod Autoscaler) scales the number of Pod replicas based on metrics like CPU or custom metrics, while VPA (Vertical Pod Autoscaler) adjusts individual Pod resource requests and limits. Using them together requires careful configuration to avoid conflicts where both try to respond to the same metric.

Detailed Answer

Imagine a restaurant kitchen during peak hours. Horizontal scaling is hiring more cooks to handle more orders in parallel -- each cook handles a portion of the workload. Vertical scaling is upgrading your existing cooks to faster, more skilled chefs who can each handle more complex dishes. In practice, you need both strategies: more cooks for sheer volume, and better-equipped cooks so each one operates efficiently. That is the relationship between HPA and VPA in Kubernetes. HPA watches specified metrics (CPU use, memory, or custom/external metrics) and adjusts the replica count of a Deployment, ReplicaSet, or StatefulSet. It runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period) that calculates the desired replicas using the formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)). VPA, on the other hand, monitors actual resource consumption of Pods over time and recommends or automatically updates the resource requests and limits in the Pod spec. VPA has three modes: Off (recommendations only), Initial (sets requests at Pod creation), and Auto (evicts and recreates Pods with updated requests). Internally, HPA queries the metrics API (metrics.k8s.io for resource metrics, custom.metrics.k8s.io for custom metrics, or external.metrics.k8s.io for external sources like Prometheus). The metrics-server or a Prometheus adapter populates these APIs. HPA calculates the ratio of current to desired metric values across all Pods, applies a tolerance (default 10%) to prevent flapping, and issues a scale request to the API server. VPA consists of three components: the Recommender (analyzes historical usage and computes recommendations), the Updater (evicts Pods whose requests deviate significantly from recommendations), and the Admission Controller (mutates Pod specs at creation time to inject recommended requests). The Recommender uses a decaying histogram of resource usage to generate its recommendations. In production, running HPA and VPA together on the same metric (like CPU) creates a conflict. HPA sees high CPU and adds replicas; VPA sees high CPU and increases requests per Pod. Both react to the same signal, leading to over-provisioning or oscillation. The recommended pattern is to use HPA for CPU-based scaling and VPA in recommendation-only mode (mode: Off) so operators can manually adjust requests based on VPA suggestions. Alternatively, use HPA with custom metrics (like requests-per-second from Prometheus) and let VPA manage CPU and memory requests in Auto mode, since they are responding to different signals. Multidimensional Pod Autoscaler (MPA), available in some managed Kubernetes distributions, attempts to coordinate both axes natively. A non-obvious gotcha is that VPA in Auto mode evicts Pods to apply new resource requests, which means it causes rolling restarts that can impact availability if your PodDisruptionBudget is not configured correctly. Another trap: HPA uses resource requests as the baseline for percentage calculations (e.g., 80% CPU target means 80% of the CPU request), so if VPA increases the request, the same absolute CPU usage now represents a lower percentage, potentially causing HPA to scale in and reduce replicas. This feedback loop can destabilize your scaling behavior. Always set VPA minAllowed and maxAllowed bounds to prevent runaway resource allocation, and use HPA stabilization windows (behavior.scaleDown.stabilizationWindowSeconds) to dampen rapid fluctuations.

Code Example

# HPA scaling on custom metric (requests-per-second) to avoid conflict with VPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api-hpa  # HPA for the payments API service
  namespace: payments  # Same namespace as the target Deployment
spec:
  scaleTargetRef:  # Reference to the workload being scaled
    apiVersion: apps/v1  # API version of the target
    kind: Deployment  # Scale a Deployment
    name: payments-api  # Name of the Deployment
  minReplicas: 3  # Never go below 3 replicas for HA
  maxReplicas: 25  # Cap at 25 to control costs
  metrics:  # Use custom metric to avoid conflict with VPA on CPU
    - type: Pods  # Per-pod custom metric
      pods:
        metric:
          name: http_requests_per_second  # Custom metric from Prometheus adapter
        target:
          type: AverageValue  # Target average across all Pods
          averageValue: "100"  # Scale up when RPS exceeds 100 per Pod
  behavior:  # Fine-tune scaling behavior to prevent flapping
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
        - type: Percent  # Scale down by percentage
          value: 10  # Remove at most 10% of Pods per period
          periodSeconds: 60  # Evaluate every 60 seconds
    scaleUp:
      stabilizationWindowSeconds: 30  # React quickly to traffic spikes
      policies:
        - type: Pods  # Scale up by fixed number
          value: 4  # Add at most 4 Pods per period
          periodSeconds: 60  # Evaluate every 60 seconds
---
# VPA managing CPU and memory requests (Auto mode safe since HPA uses custom metric)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payments-api-vpa  # VPA for the same payments API
  namespace: payments  # Same namespace
spec:
  targetRef:  # Reference to the workload
    apiVersion: apps/v1  # API version of the target
    kind: Deployment  # Target Deployment
    name: payments-api  # Same Deployment as HPA targets
  updatePolicy:
    updateMode: Auto  # Automatically evict and recreate Pods with new requests
  resourcePolicy:
    containerPolicies:
      - containerName: payments-api  # Apply to the main container
        minAllowed:  # Floor to prevent under-provisioning
          cpu: 250m  # Minimum 250 millicores
          memory: 256Mi  # Minimum 256MB
        maxAllowed:  # Ceiling to prevent runaway costs
          cpu: "2"  # Maximum 2 CPU cores
          memory: 2Gi  # Maximum 2GB memory
        controlledResources:  # Only manage these resources
          - cpu  # VPA manages CPU requests
          - memory  # VPA manages memory requests

◈ Architecture Diagram

┌──────────┐         ┌──────────┐
│  Metrics │         │  Metrics │
│  Server  │         │ Prometheus│
└────┬─────┘         └────┬─────┘
     │ cpu/mem             │ rps
     ↓                     ↓
┌──────────┐         ┌──────────┐
│   VPA    │         │   HPA    │
│ Adjusts  │         │ Adjusts  │
│ Requests │         │ Replicas │
└────┬─────┘         └────┬─────┘
     │                     │
     └──────────┬──────────┘
                ↓
         ┌──────────┐
         │ Payments │
         │   API    │
         │ Deploy   │
         └──────────┘

How do pod topology spread constraints work internally in the Kubernetes scheduler, and what production failures can occur when they interact with cluster autoscaling?

architectschedulingkubernetes

▼

Quick Answer

Topology spread constraints tell the scheduler to distribute Pods across failure domains defined by node labels such as zone or hostname, using maxSkew to control imbalance. When combined with cluster autoscaling, problems arise if a zone has zero nodes — the autoscaler may not know about the zone, causing the scheduler to leave Pods pending indefinitely.

Detailed Answer

Think of seating guests at a wedding reception. You want to spread friends evenly across tables so no table is overcrowded and no group is isolated. The wedding planner checks how many people are at each table and seats the next guest at the most empty one, but if a table does not exist yet (no physical table has been set up), the planner cannot seat anyone there even if the venue has room. Topology spread constraints in Kubernetes work the same way. Kubernetes topology spread constraints are declared in the Pod spec under topologySpreadConstraints. Each constraint specifies a topologyKey (a node label like topology.kubernetes.io/zone or kubernetes.io/hostname), a maxSkew (the maximum allowed difference in Pod count between the most-populated and least-populated domain), a whenUnsatisfiable behavior (DoNotSchedule or ScheduleAnyway), and a labelSelector to identify which Pods count toward the spread calculation. Internally, the scheduler evaluates topology spread during the Filter and Score phases. In the Filter phase, it eliminates nodes where placing the Pod would violate the maxSkew when whenUnsatisfiable is DoNotSchedule. In the Score phase, it ranks remaining nodes by how well they balance the distribution. The scheduler considers the topologyKey label on existing nodes to define domains — a domain only exists if at least one node carries that label value. It then counts matching Pods per domain and calculates whether the new Pod can land in each domain without exceeding maxSkew. At production scale, the interaction with cluster autoscaling creates subtle failures. If a node pool in one availability zone scales to zero, that zone disappears from the scheduler's topology map. The scheduler only sees zones with active nodes, so it may consider a two-zone spread sufficient even when three zones are available. When maxSkew is 1 and whenUnsatisfiable is DoNotSchedule, the scheduler can leave Pods pending because it cannot place them in a zone that has no nodes, and the autoscaler may not create a node in the missing zone because it does not see pending Pods that specifically require it. This chicken-and-egg problem is one of the most common production issues with topology spread constraints. The non-obvious gotcha is that topology spread constraints count all matching Pods, including ones that are terminating, not-ready, or failing. During a rolling update, old Pods being terminated still count toward the spread calculation, which can cause new Pods to be unschedulable until the old ones are fully removed. Architects should set minDomains to explicitly declare how many zones the spread should consider, use node affinity in combination with spread constraints to ensure the autoscaler knows about expected zones, and monitor for unschedulable Pods with topology spread violation events.

Code Example

# Apply a Deployment with zone and node spread constraints
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages replicated Pods
metadata:
  name: checkout-api # Production checkout service
  namespace: payments # Team namespace
spec:
  replicas: 6 # Six replicas to spread across three zones with two per zone
  selector:
    matchLabels:
      app: checkout-api # Pod selector
  template:
    metadata:
      labels:
        app: checkout-api # Label used by spread constraint selector
    spec:
      topologySpreadConstraints:
      - maxSkew: 1 # Allows at most one Pod difference between zones
        topologyKey: topology.kubernetes.io/zone # Spreads across availability zones
        whenUnsatisfiable: DoNotSchedule # Strictly enforces zone balance
        labelSelector:
          matchLabels:
            app: checkout-api # Counts only checkout-api Pods
        minDomains: 3 # Expects three zones even if some have zero nodes
      - maxSkew: 1 # Allows at most one Pod difference between nodes within a zone
        topologyKey: kubernetes.io/hostname # Spreads across individual nodes
        whenUnsatisfiable: ScheduleAnyway # Prefers balance but allows imbalance
        labelSelector:
          matchLabels:
            app: checkout-api # Counts only checkout-api Pods
      containers:
      - name: api # Application container
        image: registry.company.com/checkout-api:3.7.2 # Versioned production image
        resources:
          requests:
            cpu: 250m # Minimum CPU for scheduling
            memory: 512Mi # Minimum memory for scheduling

# Check Pod distribution across zones
kubectl get pods -n payments -l app=checkout-api -o custom-columns='POD:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'

# Identify Pods pending due to topology spread violations
kubectl get events -n payments --field-selector reason=FailedScheduling | grep topology

◈ Architecture Diagram

┌─── Zone A ──┐ ┌─── Zone B ──┐ ┌─── Zone C ──┐
│ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│
│ │Pod1││Pod2││ │ │Pod3││Pod4││ │ │Pod5││Pod6││
│ └────┘└────┘│ │ └────┘└────┘│ │ └────┘└────┘│
│  maxSkew=1  │ │  maxSkew=1  │ │  maxSkew=1  │
└─────────────┘ └─────────────┘ └─────────────┘

How should architects combine Vertical Pod Autoscaler and Horizontal Pod Autoscaler in the same cluster without creating scaling conflicts, and when does KEDA fit better than either?

architectgeneralkubernetes

▼

Quick Answer

VPA and HPA conflict when both scale on the same metric because VPA resizes Pods while HPA changes Pod count, creating feedback loops. The safe pattern is VPA on memory and HPA on CPU or custom metrics, or VPA in recommendation-only mode. KEDA fits better for event-driven workloads that scale to zero or react to external queue depth rather than Pod-level CPU or memory.

Detailed Answer

Think of a restaurant kitchen during dinner rush. The horizontal approach is adding more cooks to handle more orders. The vertical approach is giving each cook a bigger stove and more counter space so they can cook faster. If you try both approaches based on the same signal — how backed up the order queue is — the kitchen oscillates between adding cooks and upgrading stoves in a confusing loop. You need different signals for each scaling axis. The Horizontal Pod Autoscaler watches metrics like CPU use, memory use, or custom metrics, and adjusts the number of Pod replicas to meet a target value. The Vertical Pod Autoscaler observes actual resource consumption over time and recommends or applies changes to the resource requests and limits of individual Pods. When both use CPU as their metric, VPA may increase a Pod's CPU request, which changes the Pod's use percentage, which then triggers HPA to scale down replicas, which increases use again, creating an oscillation loop. The recommended production pattern separates their concerns. Run VPA in recommendation mode (updateMode: Off) or target only memory, while HPA scales on CPU or custom metrics. Alternatively, use VPA in Auto mode for stateful workloads where horizontal scaling is impractical — databases, caches, or single-instance controllers — and reserve HPA for stateless services that benefit from replica scaling. Some teams use VPA recommendations to set initial resource requests in CI/CD pipelines rather than letting VPA mutate Pods at runtime, which avoids the Pod restart that VPA triggers when updating in-place. At production scale, KEDA (Kubernetes Event-Driven Autoscaler) fills a gap that neither HPA nor VPA addresses well. KEDA scales based on external event sources — message queue depth in Kafka or SQS, pending items in a Redis stream, HTTP request rate from Prometheus, or custom metrics from any source with a KEDA scaler. Critically, KEDA can scale Deployments to zero replicas when there is no work, which standard HPA cannot do (HPA's minimum is one). This makes KEDA the right choice for batch processors, event consumers, and asynchronous workers where idle cost matters. KEDA works by deploying ScaledObject resources that create HPA objects under the hood, so it integrates with existing Kubernetes autoscaling infrastructure. The non-obvious gotcha is that VPA in Auto mode restarts Pods when it changes resource requests, which can cause brief service disruption and interact badly with PodDisruptionBudget limits. Teams often discover this during peak traffic when VPA decides to right-size all replicas simultaneously. Architects should set VPA update policies with minReplicas and eviction requirements, and test VPA behavior during high-traffic scenarios before enabling Auto mode on critical services.

Code Example

# Deploy VPA in recommendation mode for a service — no automatic Pod restarts
apiVersion: autoscaling.k8s.io/v1 # VPA API group
kind: VerticalPodAutoscaler # Recommends or applies resource changes
metadata:
  name: checkout-api-vpa # VPA for the checkout service
  namespace: payments # Same namespace as the target
spec:
  targetRef:
    apiVersion: apps/v1 # References a Deployment
    kind: Deployment # Target workload type
    name: checkout-api # The Deployment to analyze
  updatePolicy:
    updateMode: "Off" # Generates recommendations without applying them
  resourcePolicy:
    containerPolicies:
    - containerName: api # Target the main container
      minAllowed:
        cpu: 100m # Never recommend below 100m CPU
        memory: 256Mi # Never recommend below 256Mi memory
      maxAllowed:
        cpu: 2 # Cap recommendations at 2 CPU cores
        memory: 4Gi # Cap recommendations at 4Gi memory

# Deploy HPA scaling on CPU utilization (safe alongside VPA on memory)
apiVersion: autoscaling/v2 # HPA v2 API for custom metrics support
kind: HorizontalPodAutoscaler # Scales replica count
metadata:
  name: checkout-api-hpa # HPA for the checkout service
  namespace: payments # Same namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1 # References the same Deployment
    kind: Deployment # Target workload type
    name: checkout-api # The Deployment to scale
  minReplicas: 3 # Never scale below three replicas
  maxReplicas: 20 # Cap at twenty replicas
  metrics:
  - type: Resource # Uses built-in resource metrics
    resource:
      name: cpu # Scales on CPU utilization only
      target:
        type: Utilization # Target a percentage of requests
        averageUtilization: 70 # Scale up when average CPU exceeds 70 percent

# Read VPA recommendations without applying them
kubectl get vpa checkout-api-vpa -n payments -o jsonpath='{.status.recommendation.containerRecommendations[*]}'

◈ Architecture Diagram

┌─────────────────────────────────┐
│        Scaling Decision         │
├───────────┬───────────┬─────────┤
│  HPA      │  VPA      │  KEDA   │
│ ─ ─ ─ ─  │ ─ ─ ─ ─   │ ─ ─ ─  │
│ CPU/custom│ Memory    │ Queue   │
│ → replicas│ → sizing  │ → 0..N │
│ stateless │ stateful  │ events  │
└───────────┴───────────┴─────────┘

How does Karpenter differ from Cluster Autoscaler in node provisioning strategy, and when should architects choose one over the other for production workloads?

architectschedulingkubernetes

▼

Quick Answer

Karpenter provisions nodes directly from cloud provider APIs based on pending Pod requirements, selecting instance types dynamically without predefined node groups. Cluster Autoscaler adjusts the size of existing node groups. Karpenter is faster and more flexible for heterogeneous workloads, while Cluster Autoscaler is simpler for teams with well-defined node group templates and multi-cloud portability needs.

Detailed Answer

Think of hiring staff for a catering company. Cluster Autoscaler is like posting a job ad for a specific role — you have predefined job descriptions (node groups), and when you need more people, you hire from those templates. Karpenter is like a staffing agency that looks at the actual tasks on the board, finds a person with exactly the right skills and availability, and places them immediately. The agency is faster and more flexible, but you need to trust it with your hiring criteria. Cluster Autoscaler has been the standard Kubernetes node scaling solution since early Kubernetes versions. It works by monitoring pending Pods that cannot be scheduled due to insufficient resources, then scaling up one of the configured node groups (Auto Scaling Groups on AWS, Managed Instance Groups on GCP, or VM Scale Sets on Azure). It also scales down by identifying underutilized nodes and draining them. The key limitation is that node groups must be pre-configured with specific instance types, and the autoscaler chooses among existing groups rather than selecting the optimal instance type for each workload. Karpenter, originally created by AWS and now a CNCF project, takes a fundamentally different approach. Instead of managing node groups, Karpenter watches for unschedulable Pods and directly provisions compute from the cloud provider API, choosing the best instance type based on Pod resource requirements, node selectors, affinity rules, and topology spread constraints. NodePool resources define constraints like allowed instance families, availability zones, capacity types (on-demand or spot), and expiration policies. Karpenter evaluates all pending Pods together and can bin-pack them onto a single optimally-sized instance rather than scaling a node group one unit at a time. At production scale, Karpenter typically provisions nodes in under 60 seconds compared to 2-5 minutes for Cluster Autoscaler, because it skips the Auto Scaling Group scaling process and calls the EC2 API directly. Karpenter also handles node disruption proactively through consolidation, which replaces underutilized nodes with cheaper or better-fitting ones, and expiration, which rotates nodes to pick up AMI updates. However, Karpenter is currently most mature on AWS, with Azure support in development and GCP support community-driven. Teams needing multi-cloud portability or running on GKE or AKS may still prefer Cluster Autoscaler. The non-obvious gotcha is that Karpenter's flexibility requires careful constraint definition. Without proper NodePool limits on instance families, maximum Pods per node, or total cluster capacity, Karpenter can provision very large or very expensive instances. It can also create infrastructure drift if the team's Terraform or IaC does not account for Karpenter-managed nodes. Architects should set explicit NodePool constraints, integrate Karpenter's provisioned nodes into their monitoring and cost dashboards, and understand that Karpenter manages node lifecycle independently of any external node group definition.

Code Example

# Karpenter NodePool that provisions cost-optimized compute for general workloads
apiVersion: karpenter.sh/v1 # Karpenter API group
kind: NodePool # Defines provisioning constraints and behavior
metadata:
  name: general-workloads # Pool name for general-purpose services
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type # Defines instance purchasing model
        operator: In
        values: [on-demand, spot] # Allows both on-demand and spot instances
      - key: node.kubernetes.io/instance-type # Limits instance families
        operator: In
        values: [m6i.large, m6i.xlarge, m7i.large, m7i.xlarge, c6i.large, c6i.xlarge] # Curated list of right-sized instances
      - key: topology.kubernetes.io/zone # Constrains to specific zones
        operator: In
        values: [us-east-1a, us-east-1b, us-east-1c] # All three availability zones
      nodeClassRef:
        group: karpenter.k8s.aws # AWS-specific node configuration
        kind: EC2NodeClass # References the EC2 node template
        name: general-al2023 # Node class with AL2023 AMI and security groups
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized # Replaces wasteful nodes automatically
    consolidateAfter: 30s # Waits 30 seconds before consolidating
  limits:
    cpu: 200 # Maximum total CPU across all nodes in this pool
    memory: 800Gi # Maximum total memory across all nodes in this pool
---
apiVersion: karpenter.k8s.aws/v1 # AWS-specific Karpenter API
kind: EC2NodeClass # Configures the EC2 instance template
metadata:
  name: general-al2023 # Referenced by the NodePool above
spec:
  amiSelectorTerms:
  - alias: al2023@latest # Uses the latest Amazon Linux 2023 EKS-optimized AMI
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: payments-cluster # Discovers subnets by tag
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: payments-cluster # Discovers security groups by tag

# Check which instances Karpenter provisioned and why
kubectl get nodeclaims -o custom-columns='NAME:.metadata.name,TYPE:.status.instanceType,ZONE:.metadata.labels.topology\.kubernetes\.io/zone,CAPACITY:.metadata.labels.karpenter\.sh/capacity-type'

◈ Architecture Diagram

┌──────────────────────────────────────┐
│         Cluster Autoscaler           │
│  Pending → ASG Scale → Node Ready    │
│  (2-5 min, fixed instance types)     │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│         Karpenter                    │
│  Pending → EC2 API → Node Ready      │
│  (<60s, dynamic instance selection)  │
└──────────────────────────────────────┘

How do HPA, VPA, and Cluster Autoscaler work together to handle traffic spikes?

intermediateschedulingkubernetes

▼

Quick Answer

HPA scales pods horizontally based on CPU, memory, or custom metrics. VPA adjusts pod resource requests and limits vertically to right-size containers. Cluster Autoscaler adds or removes nodes when pods cannot be scheduled due to insufficient cluster capacity. Together, they form a three-layer scaling system: right-size pods, scale pod count, then scale infrastructure.

Detailed Answer

Think of a restaurant handling a dinner rush. VPA is like giving each chef a bigger workstation when they are cramped (vertical scaling of individual workers). HPA is like calling in more chefs when orders pile up (horizontal scaling of worker count). Cluster Autoscaler is like opening additional kitchen rooms when there is no space for more chefs (infrastructure scaling). Each addresses a different bottleneck, and they work best when coordinated. The Horizontal Pod Autoscaler (HPA) watches metrics and adjusts the replica count of a Deployment or StatefulSet. By default it uses CPU utilization, but it can target memory, custom metrics (requests per second from Prometheus), or external metrics (SQS queue depth from CloudWatch). HPA evaluates every 15 seconds (configurable), calculates the desired replica count using the formula desiredReplicas = currentReplicas * (currentMetric / targetMetric), and scales accordingly. It respects minReplicas and maxReplicas boundaries and has stabilization windows to prevent flapping. The Vertical Pod Autoscaler (VPA) analyzes historical resource usage and recommends or automatically adjusts the CPU and memory requests and limits for containers. It operates in three modes: Off (only recommends), Initial (sets resources only at pod creation), and Auto (evicts and recreates pods with updated resources). VPA solves the problem of developers guessing resource requests — either setting them too high (wasting cluster capacity) or too low (causing throttling and OOMKills). It uses a recommender component that analyzes metrics over time and an updater that evicts pods needing adjustment. The Cluster Autoscaler watches for pods stuck in Pending state because no node has sufficient resources. When it detects unschedulable pods, it evaluates which node group can accommodate them and triggers the cloud provider to add nodes. Conversely, it removes underutilized nodes (below 50 percent utilization by default) after a cooldown period, draining pods safely using PodDisruptionBudgets. It works with AWS Auto Scaling Groups, GCP Managed Instance Groups, or Azure VM Scale Sets. In production, coordination between these three autoscalers requires careful planning. The critical rule is never to use HPA and VPA on the same metric for the same pod, because they will fight: HPA tries to add replicas while VPA tries to resize existing ones, creating oscillation. The recommended pattern is HPA on CPU with VPA on memory in recommendation-only mode, or HPA on custom metrics while VPA handles resource right-sizing in Initial mode. Cluster Autoscaler must respond within 30-60 seconds to add nodes or traffic will be dropped during spikes. Teams configure node pool warm-up strategies or use Karpenter for faster, more flexible node provisioning. The non-obvious gotcha is scaling lag. HPA reacts in seconds but new pods need time to start (image pull, readiness probe). Cluster Autoscaler takes 1-3 minutes to provision new nodes. During a sudden traffic spike, the system can drop requests for several minutes. Mitigation strategies include setting higher minReplicas during known peak windows, using PodPriority to preempt less critical workloads, over-provisioning with pause pods (low-priority placeholder pods that get evicted instantly to free capacity), and combining with KEDA for event-driven scaling that responds to queue depth before CPU rises.

Code Example

# Check current HPA status including current vs target metrics and replica count
kubectl get hpa -n payments

# Describe HPA to see scaling events, conditions, and metric sources
kubectl describe hpa payments-api -n payments

# Example HPA manifest scaling on CPU and custom requests-per-second metric
# ---
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
#   name: payments-api
#   namespace: payments
# spec:
#   scaleTargetRef:
#     apiVersion: apps/v1
#     kind: Deployment
#     name: payments-api
#   minReplicas: 4
#   maxReplicas: 40
#   metrics:
#   - type: Resource
#     resource:
#       name: cpu
#       target:
#         type: Utilization
#         averageUtilization: 65

# Check VPA recommendations without applying them (Off mode)
kubectl get vpa payments-api -n payments -o yaml | grep -A10 recommendation

# Check Cluster Autoscaler status and recent scaling decisions
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# Look for pods stuck in Pending that might trigger Cluster Autoscaler
kubectl get pods -A --field-selector=status.phase=Pending

◈ Architecture Diagram

┌─────────────────────────────────────────────────┐
│                Traffic Spike                      │
└────────────────────┬────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────┐
│  Layer 1: HPA (seconds)                          │
│  Scale pods 4 → 12 based on CPU/custom metrics   │
└────────────────────┬────────────────────────────┘
                     ↓ pods Pending?
┌─────────────────────────────────────────────────┐
│  Layer 2: Cluster Autoscaler (1-3 minutes)       │
│  Add nodes to fit unschedulable pods             │
└────────────────────┬────────────────────────────┘
                     ↓ right-size over time
┌─────────────────────────────────────────────────┐
│  Layer 3: VPA (background)                       │
│  Adjust resource requests based on actual usage  │
└─────────────────────────────────────────────────┘

Why do Kubernetes pods get evicted, and what commands and steps do you use to diagnose and resolve pod eviction?

intermediatepodskubernetes

▼

Quick Answer

Pods are evicted when a node is under resource pressure — disk (DiskPressure), memory (MemoryPressure), or PID exhaustion (PIDPressure). The kubelet evicts pods based on QoS class priority: BestEffort first, then Burstable, then Guaranteed last. Diagnosis starts with kubectl describe node to check conditions and kubectl get events to find eviction reasons.

Detailed Answer

Think of an overcrowded bus. When the bus exceeds its weight limit, the driver must ask some passengers to leave. Passengers without tickets (BestEffort pods) are asked first, then those with partial tickets (Burstable pods), and finally full-fare passengers (Guaranteed pods) are the last to go. The bus driver does not choose randomly — there is a clear priority order based on who has the strongest claim to stay. In Kubernetes, pod eviction is the kubelet's mechanism for protecting node stability. When a node runs low on a critical resource — memory, temporary (ephemeral) storage, or process IDs — the kubelet begins evicting pods to reclaim that resource. This is different from preemption (which is the scheduler removing lower-priority pods to make room for higher-priority ones) and different from API-initiated eviction (which is used during node drain for maintenance). Internally, the kubelet monitors resource usage against configurable eviction thresholds. The default soft eviction threshold for memory is memory.available < 100Mi, and for disk is nodefs.available < 10%. When a threshold is breached, the kubelet sets the corresponding node condition (MemoryPressure, DiskPressure, PIDPressure) and begins ranking pods for eviction. The ranking uses QoS class: BestEffort pods (no resource requests or limits) are evicted first, Burstable pods (requests set but lower than limits) are evicted next based on how much they exceed their requests, and Guaranteed pods (requests equal limits for all containers) are evicted last. At production scale, the most common eviction cause is ephemeral storage exhaustion from container logs, emptyDir volumes, or container writable layers growing unbounded. Memory-based evictions happen when applications have memory leaks or when resource limits are set too low for actual workload requirements. Teams should monitor node conditions, set appropriate resource requests and limits to ensure critical pods get Guaranteed QoS, configure log rotation to prevent disk pressure, and use PodDisruptionBudgets to limit the impact of evictions on service availability. The non-obvious gotcha is that eviction thresholds have both soft and hard variants. Soft evictions give pods a grace period to terminate cleanly, while hard evictions kill pods immediately. If the hard eviction threshold is hit (e.g., memory.available < 50Mi), the kubelet kills pods without waiting for graceful shutdown, which can cause data loss or incomplete request processing. Architects should ensure hard thresholds are never reached by setting soft thresholds with enough buffer.

Code Example

# Check node conditions for resource pressure
kubectl describe node ip-10-0-1-42.ec2.internal | grep -A5 'Conditions' # Shows MemoryPressure, DiskPressure status

# Find eviction events in the namespace
kubectl get events -n payments --field-selector reason=Evicted --sort-by='.lastTimestamp' # Lists evicted pods with reasons

# Check which pod was evicted and why
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.reason}' # Shows 'Evicted'
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.message}' # Shows the resource that triggered eviction

# Check node resource usage
kubectl top node ip-10-0-1-42.ec2.internal # Shows current CPU and memory usage

# Check disk usage on the node (requires node access)
kubectl debug node/ip-10-0-1-42.ec2.internal -it --image=busybox -- df -h # Shows filesystem usage on the node

# Check QoS class of pods to understand eviction priority
kubectl get pods -n payments -o custom-columns='NAME:.metadata.name,QOS:.status.qosClass' # Shows BestEffort, Burstable, or Guaranteed

# Set proper resource requests equal to limits for Guaranteed QoS
# resources:
#   requests:
#     cpu: 250m      # Request equals limit for Guaranteed QoS
#     memory: 512Mi  # Request equals limit for Guaranteed QoS
#   limits:
#     cpu: 250m      # Matches request
#     memory: 512Mi  # Matches request

◈ Architecture Diagram

┌──────────────────────────┐
│ Node Resource Pressure   │
│ Memory < 100Mi           │
└────────────┬─────────────┘
             ↓
┌──────────────────────────┐
│ Eviction Priority        │
│ 1. BestEffort  (first)   │
│ 2. Burstable   (next)    │
│ 3. Guaranteed  (last)    │
└──────────────────────────┘

How do you troubleshoot a pod stuck in CrashLoopBackOff, and what are the most common root causes?

intermediatepodskubernetes

▼

Quick Answer

CrashLoopBackOff means the container starts, crashes, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5 minutes). Common causes are application startup errors, missing environment variables or secrets, misconfigured commands or entrypoints, failed health probes, and OOMKilled. Diagnosis uses kubectl logs --previous, kubectl describe pod, and checking exit codes.

Detailed Answer

Think of a light switch connected to a circuit breaker. You flip the switch (container starts), the circuit overloads (container crashes), and the breaker trips (Kubernetes waits before retrying). Each time you try again, the breaker waits longer before allowing another attempt. CrashLoopBackOff is Kubernetes telling you that the container keeps failing and the wait time between restarts is increasing. In Kubernetes, CrashLoopBackOff is not a separate error state — it is the backoff delay that kubelet applies after repeated container crashes. The container exits with a non-zero code, kubelet restarts it after 10 seconds, it crashes again, kubelet waits 20 seconds, then 40, then 80, capping at 300 seconds (5 minutes). The pod status shows CrashLoopBackOff during these waiting periods and Error or Completed when the container actually exits. The most common root causes fall into categories. Application errors: the application throws an unhandled exception during startup because a required database is unreachable, a configuration file is malformed, or a required API key is missing. Configuration errors: the container command or args field is wrong (pointing to a script that does not exist in the image), the image tag points to a version with a different entrypoint, or a required environment variable is not set. Resource errors: the container is OOMKilled immediately on startup because the memory limit is too low for the JVM heap or the application's baseline memory footprint. Probe errors: an aggressive liveness probe kills the container before it finishes starting up, especially for Java applications with long startup times. At production scale, the diagnostic sequence is: first check exit code with kubectl describe pod (exit code 1 = application error, 137 = OOMKilled/SIGKILL, 143 = SIGTERM). Then check previous container logs with kubectl logs --previous since the current container may have already crashed. Check whether the container image recently changed with kubectl rollout history. Verify that ConfigMaps, Secrets, and PersistentVolumeClaims referenced by the pod actually exist in the namespace. The non-obvious gotcha is that CrashLoopBackOff can be caused by a liveness probe that is too aggressive during startup. If the liveness probe starts checking before the application is ready and the initialDelaySeconds is too short, the probe fails, kubelet kills the container, it restarts, and the cycle continues. The fix is to use a startup probe with a longer timeout to protect the liveness probe during application initialization, or to increase the liveness probe's initialDelaySeconds and failureThreshold.

Code Example

# Check pod status and restart count
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments # Shows status CrashLoopBackOff and restart count

# Get the exit code to categorize the failure
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -A10 'Last State' # Exit code 1=app error, 137=OOMKilled

# Check logs from the PREVIOUS crashed container (critical — current container may already be dead)
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=50 # Shows why the last container died

# Check if required ConfigMaps and Secrets exist
kubectl get configmap payments-config -n payments # Verify ConfigMap exists
kubectl get secret payments-db-credentials -n payments # Verify Secret exists

# Check if the container command is correct by inspecting the image
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.spec.containers[0].command}' # Shows configured command

# Check if OOMKilled is the cause
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason and exit code

# Fix startup probe to prevent liveness probe from killing slow-starting apps
# startupProbe:
#   httpGet:
#     path: /health        # Startup health endpoint
#     port: 8080           # Application port
#   failureThreshold: 30   # Allow 30 x 10s = 5 minutes to start
#   periodSeconds: 10      # Check every 10 seconds during startup

◈ Architecture Diagram

┌──────────┐
│ Start    │
└────┬─────┘
     ↓
┌──────────┐
│ Crash    │←─── Exit 1: App Error
│ (exit≠0) │←─── Exit 137: OOMKill
└────┬─────┘←─── Exit 143: Probe
     ↓
┌──────────┐
│ Backoff  │
│10→20→40s │
└────┬─────┘
     ↓
┌──────────┐
│ Restart  │
└──────────┘

How do you manage infrastructure drift detection at enterprise scale with scheduled plan runs?

advancedstateterraform

▼

Quick Answer

Configure Terraform Enterprise workspaces with scheduled plan-only runs (e.g., nightly) that detect differences between actual infrastructure and the Terraform state. Alert on drift via webhook notifications, categorize drift by severity, and either auto-remediate safe drifts or create tickets for manual review.

Detailed Answer

Think of drift detection like a nightly security guard doing rounds. The guard has a checklist of how every door, window, and safe should look. If something has changed — a window left open, a safe combination altered, a new lock installed — the guard reports it. Drift detection in Terraform works the same way: scheduled plan runs compare what actually exists in AWS against what Terraform expects, and any discrepancy is flagged for investigation. Infrastructure drift occurs when the actual state of cloud resources diverges from the Terraform-declared state. This happens through manual console changes (someone modifies a security group via the AWS console), changes by other tools (an automation script modifies a resource that Terraform also manages), auto-scaling events that modify resource attributes, and AWS service updates that change default behaviors. In a banking environment, drift is a compliance risk — if your Terraform code declares that an RDS instance has encryption enabled but someone disables it through the console, your compliance posture is degraded and your Terraform state does not reflect reality. Terraform Enterprise enables scheduled plan-only runs on workspaces. You configure a workspace to run terraform plan automatically at a set interval — typically nightly for production workspaces and weekly for non-production. The plan compares the current Terraform configuration and state against the actual infrastructure via provider API calls. If the plan detects changes (resources to update, create, or destroy), it means drift has occurred. TFE marks the run as 'planned and finished' with a non-empty plan, and you can configure webhook notifications to alert your team via Slack, PagerDuty, or a custom drift-tracking system. At enterprise scale, not all drift is equal. A changed tag is low-severity drift that might be auto-remediated. A modified security group rule is high-severity drift that requires immediate investigation — someone may have opened a port that violates PCI-DSS. A deleted resource is critical drift that needs urgent attention. Build a drift classification system: the webhook from TFE sends the plan summary to a Lambda function or custom service that parses the plan output, categorizes each change by resource type and attribute, assigns a severity level, and routes the notification appropriately. Low-severity drift creates a Jira ticket for the next sprint. High-severity drift pages the security team. Critical drift triggers an incident response. For drift remediation, there are two approaches. Auto-remediation configures TFE to automatically apply the plan when drift is detected, restoring infrastructure to the declared state. This is appropriate for low-risk drifts like tag changes or description updates, but dangerous for high-risk resources — auto-applying a plan that wants to recreate an RDS instance would cause downtime. Selective auto-remediation uses Sentinel policies to evaluate the drift plan: if the only changes are to tags and descriptions, auto-apply; if the plan includes any destroy or replace actions, block and alert. Manual remediation requires a human to review the drift, determine whether the Terraform code or the infrastructure should be updated, and either apply the plan or update the code to match the new reality. The biggest gotcha is drift detection generating noise that teams ignore. If your scheduled plans consistently show drift from resources that Terraform partially manages (like ASG instance counts that change with auto-scaling), the team learns to dismiss all drift alerts. Use lifecycle ignore_changes blocks in Terraform for attributes that are expected to drift (like ASG desired_count), and ensure your scheduled plans only flag genuine unauthorized changes. Another gotcha is the API rate limiting — running terraform plan across 200 workspaces simultaneously hammers the AWS API. Stagger your scheduled plans across the night, and use workspace tags to group and schedule them in batches. Finally, drift detection only catches drift in resources Terraform manages — resources created manually outside of Terraform are invisible. Complement TFE drift detection with AWS Config rules that detect unmanaged resources.

Code Example

# TFE workspace with scheduled drift detection
resource "tfe_workspace" "payments_infra_prod" {
  name              = "payments-infra-production"
  organization      = "bank-platform"
  terraform_version = "1.7.0"
  auto_apply        = false

  vcs_repo {
    identifier     = "bank/payments-infrastructure"
    branch         = "main"
    oauth_token_id = var.github_oauth_id
  }
}

# Scheduled plan-only run for nightly drift detection
resource "tfe_workspace_run_schedule" "payments_drift_check" {
  workspace_id = tfe_workspace.payments_infra_prod.id

  # Run plan every night at 2 AM ET (7 AM UTC)
  cron_schedule = "0 7 * * *"

  # Plan only — do not auto-apply
  plan_only = true
}

# Webhook notification for drift alerts
resource "tfe_notification_configuration" "drift_alert" {
  name             = "drift-detection-alert"
  enabled          = true
  workspace_id     = tfe_workspace.payments_infra_prod.id
  destination_type = "generic"  # Custom webhook
  url              = "https://drift-handler.bank.internal/webhook"

  triggers = [
    "run:needs_attention",  # Plan with changes detected
    "run:errored",          # Plan failed (possible API issue)
  ]
}
---
# Drift classification Lambda (triggered by TFE webhook)
# drift-handler/handler.py
import json
import boto3

def classify_drift(event):
    """Classify drift severity based on resource type and change type."""
    plan_summary = event.get('plan_summary', {})
    changes = plan_summary.get('resource_changes', [])

    severity = 'low'
    findings = []

    for change in changes:
        resource_type = change['type']
        actions = change['actions']

        # Critical: any destroy or replace action
        if 'delete' in actions or 'replace' in actions:
            severity = 'critical'
            findings.append(f"CRITICAL: {resource_type} will be {actions}")

        # High: security-related resources modified
        elif resource_type in [
            'aws_security_group_rule',
            'aws_iam_policy',
            'aws_iam_role_policy',
            'aws_kms_key',
            'aws_s3_bucket_policy'
        ]:
            severity = max(severity, 'high')
            findings.append(f"HIGH: {resource_type} drifted")

        # Low: tags, descriptions, non-functional changes
        else:
            findings.append(f"LOW: {resource_type} drifted")

    return severity, findings

def route_alert(severity, findings, workspace_name):
    """Route drift alerts based on severity."""
    if severity == 'critical':
        # Page security team immediately
        pagerduty_alert(f"CRITICAL drift in {workspace_name}", findings)
        create_jira_incident(workspace_name, findings)
    elif severity == 'high':
        # Slack alert to security channel
        slack_alert('#security-ops', workspace_name, findings)
        create_jira_ticket('HIGH', workspace_name, findings)
    else:
        # Low priority — create ticket for next sprint
        create_jira_ticket('LOW', workspace_name, findings)
---
# Terraform lifecycle blocks to reduce drift noise
# Ignore expected drift from auto-scaling
resource "aws_autoscaling_group" "payments_api" {
  # ... configuration ...

  lifecycle {
    ignore_changes = [
      desired_capacity,  # Changes with auto-scaling
      target_group_arns, # Changes with blue-green deploys
    ]
  }
}

# Ignore expected drift from external secret rotation
resource "aws_db_instance" "settlements_db" {
  # ... configuration ...

  lifecycle {
    ignore_changes = [
      password,  # Rotated by Vault, not managed by Terraform
    ]
  }
}

◈ Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│              Infrastructure Drift Detection Pipeline             │
│                                                                 │
│  ┌──────────────────┐                                           │
│  │  TFE Workspace   │  Scheduled: Nightly at 2 AM              │
│  │  Plan-Only Run   │──────────────────────────────┐            │
│  └──────────────────┘                              │            │
│                                                    ▼            │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │                terraform plan (read-only)                  │  │
│  │                                                            │  │
│  │   Declared State ←──compare──→ Actual Infrastructure      │  │
│  │   (Terraform code)              (AWS API responses)        │  │
│  └──────────────────────┬────────────────────────────────────┘  │
│                         │                                       │
│              ┌──────────▼──────────┐                            │
│              │ Changes Detected?   │                            │
│              └──┬──────────────┬───┘                            │
│          No     │              │  Yes                           │
│          ┌──────▼────┐  ┌──────▼───────────────────────────┐   │
│          │ No drift  │  │ Webhook → Drift Classifier       │   │
│          │ All good  │  │                                   │   │
│          └───────────┘  │  ┌─────────┐ ┌────────┐ ┌─────┐ │   │
│                         │  │CRITICAL │ │ HIGH   │ │ LOW │ │   │
│                         │  │Delete/  │ │SecGroup│ │Tags │ │   │
│                         │  │Replace  │ │IAM/KMS │ │Desc │ │   │
│                         │  │→ Page   │ │→ Slack │ │→Jira│ │   │
│                         │  │  SecOps │ │  Alert │ │ Tkt │ │   │
│                         │  └─────────┘ └────────┘ └─────┘ │   │
│                         └──────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

How do you implement Terraform CI/CD pipelines — what approval gates and plan/apply workflows do you use for production changes?

advancedgeneralterraform

▼

Quick Answer

Implement a multi-stage pipeline: PR triggers terraform plan with output posted as a PR comment, OPA/Sentinel policy checks validate compliance, manual approval gates (GitHub Environments with required reviewers) protect production, and merge-to-main triggers terraform apply using the saved plan file. Use OIDC for keyless authentication and concurrency controls to prevent parallel applies on the same stack.

Detailed Answer

A Terraform CI/CD pipeline is like an air traffic control system: every infrastructure change (flight) must file a plan (flight plan), get reviewed by controllers (PR reviewers), receive clearance (approval gate), and land on the correct runway (target environment) — all while preventing two planes from using the same runway simultaneously (state locking and concurrency control). The pipeline begins with authentication. Modern pipelines use OIDC federation instead of stored AWS credentials. GitHub Actions requests a JWT token from GitHub's OIDC provider, presents it to AWS STS via AssumeRoleWithWebIdentity, and receives short-lived credentials scoped to the Terraform execution role. The OIDC trust policy restricts which repositories, branches, and environments can assume the role: production apply roles should only be assumable by the main branch, while plan roles can be assumed by any branch. This eliminates long-lived access keys that could be exfiltrated from CI secrets. The plan stage runs on every pull request. It executes terraform init, terraform validate, terraform fmt -check, and terraform plan -out=plan.tfplan. The plan output is captured and posted as a PR comment using tools like tfcmt or the native GitHub Actions Terraform setup action. Reviewers see exactly what resources will be created, modified, or destroyed — including sensitive changes like security group rule modifications or IAM policy updates. The saved plan file is uploaded as a CI artifact for use in the apply stage. Policy-as-code gates run between plan and approval. Open Policy Agent (OPA) evaluates the plan JSON (terraform show -json plan.tfplan) against organizational policies: no S3 buckets without encryption, no security groups with 0.0.0.0/0 ingress on port 22, all RDS instances must have deletion protection in production. These checks are non-negotiable — a policy violation fails the pipeline regardless of who approves the PR. Sentinel serves the same purpose in Terraform Cloud/Enterprise environments. The approval gate differs by environment. Dev and QA may auto-apply on merge — the PR review itself is sufficient approval. UAT requires team lead approval via a GitHub Environment with one required reviewer. Production requires two approvals from the platform-admins team, with a 15-minute wait timer to prevent hasty approvals. These are configured as GitHub Environments with protection rules, which the apply job references via the environment keyword. The apply stage triggers after merge to main. Critically, it should use the saved plan file from the plan stage rather than re-running plan, because infrastructure may have changed between plan review and apply execution. If the saved plan is stale (state serial mismatch), Terraform rejects it and the pipeline must re-plan. After successful apply, the pipeline posts results to a Slack channel (#infra-changes-prod) and creates a GitHub deployment record for audit trail. Concurrency control prevents two merged PRs from applying simultaneously to the same stack. GitHub Actions concurrency groups scoped to the stack name (concurrency: group: terraform-payments-prod) ensure only one apply runs at a time. Queued runs wait for the current apply to complete. Combined with DynamoDB state locking, this provides two layers of concurrent modification prevention.

Code Example

# .github/workflows/terraform-payments.yml
# Multi-stage Terraform pipeline with OIDC and approval gates
name: Payments Infrastructure Pipeline

# Trigger on PRs and pushes to main affecting payments infra
on:
  pull_request:
    paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']
  push:
    branches: [main]
    paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']

# OIDC permissions for keyless AWS authentication
permissions:
  id-token: write
  contents: read
  pull-requests: write

# Prevent concurrent applies on the same stack
concurrency:
  group: terraform-payments-prod-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}

jobs:
  # Stage 1: Validate and plan on every PR
  plan:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      # Checkout the infrastructure code
      - uses: actions/checkout@v4
      # OIDC authentication — plan role (read-only)
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformPlan
          aws-region: us-east-1
      # Install pinned Terraform version
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.4
      # Initialize the backend and download providers
      - name: Init
        run: terraform -chdir=infrastructure/envs/prod init -input=false
      # Validate syntax and configuration
      - name: Validate
        run: terraform -chdir=infrastructure/envs/prod validate
      # Format check to enforce style standards
      - name: Format Check
        run: terraform fmt -check -recursive infrastructure/
      # Generate execution plan and save to file
      - name: Plan
        run: terraform -chdir=infrastructure/envs/prod plan -input=false -out=prod.tfplan
      # Export plan as JSON for OPA policy evaluation
      - name: Export Plan JSON
        run: terraform -chdir=infrastructure/envs/prod show -json prod.tfplan > plan.json
      # Run OPA policy checks against the plan
      - name: OPA Policy Check
        run: |
          opa eval --data policies/ --input plan.json "data.terraform.deny[msg]" --fail-defined
      # Post plan output as a PR comment for reviewers
      - name: Comment Plan on PR
        uses: borchero/terraform-plan-comment@v2
        with:
          working-directory: infrastructure/envs/prod
      # Upload plan artifact for the apply stage
      - uses: actions/upload-artifact@v4
        with:
          name: prod-tfplan
          path: infrastructure/envs/prod/prod.tfplan
          retention-days: 5

  # Stage 2: Apply after merge with manual approval
  apply:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    # Production environment with required approvers and wait timer
    environment:
      name: production-payments
      url: https://console.aws.amazon.com/eks
    steps:
      - uses: actions/checkout@v4
      # OIDC authentication — apply role (read-write)
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformApply
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.4
      # Re-init and apply (saved plan may be stale after merge)
      - name: Init and Apply
        run: |
          terraform -chdir=infrastructure/envs/prod init -input=false
          terraform -chdir=infrastructure/envs/prod apply -input=false -auto-approve
      # Notify team of successful deployment
      - name: Slack Notification
        if: success()
        uses: slackapi/slack-github-action@v1
        with:
          payload: '{"text": "Prod payments infra deployed by ${{ github.actor }}"}'

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│       Terraform CI/CD Pipeline with Approval Gates             │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  PR Opened                                                    │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Stage 1: Plan (on every PR)                      │        │
│  │                                                    │        │
│  │  OIDC → AssumeRole (Plan Role, read-only)         │        │
│  │  ┌──────┐ ┌────────┐ ┌────┐ ┌──────┐             │        │
│  │  │ init  │→│validate│→│fmt │→│ plan  │             │        │
│  │  └──────┘ └────────┘ └────┘ └──┬───┘             │        │
│  │                                 │                  │        │
│  │                          ┌──────┴──────┐          │        │
│  │                          │ plan.json   │          │        │
│  │                          └──────┬──────┘          │        │
│  │                                 ↓                  │        │
│  │                    ┌────────────────────┐          │        │
│  │                    │  OPA Policy Check   │          │        │
│  │                    │  - no public S3     │          │        │
│  │                    │  - encryption on    │          │        │
│  │                    │  - tags required    │          │        │
│  │                    └────────┬───────────┘          │        │
│  │                             ↓                      │        │
│  │                    ┌────────────────────┐          │        │
│  │                    │  PR Comment with    │          │        │
│  │                    │  plan output        │          │        │
│  │                    └────────────────────┘          │        │
│  └──────────────────────────────────────────────────┘        │
│                                                               │
│  PR Approved + Merged to main                                 │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Stage 2: Manual Approval Gate                    │        │
│  │  GitHub Environment: production-payments          │        │
│  │  Required reviewers: 2 from platform-admins       │        │
│  │  Wait timer: 15 minutes                           │        │
│  └──────────────────────────────────────────────────┘        │
│                          ↓                                    │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Stage 3: Apply (after approval)                  │        │
│  │                                                    │        │
│  │  OIDC → AssumeRole (Apply Role, read-write)       │        │
│  │  ┌──────┐ ┌────────────────┐                      │        │
│  │  │ init  │→│ apply          │                      │        │
│  │  └──────┘ │ -auto-approve  │                      │        │
│  │           └───────┬────────┘                      │        │
│  │                   ↓                                │        │
│  │           ┌──────────────┐                        │        │
│  │           │ Slack notify  │                        │        │
│  │           │ #infra-changes│                        │        │
│  │           └──────────────┘                        │        │
│  └──────────────────────────────────────────────────┘        │
│                                                               │
│  Concurrency: group=terraform-payments-prod (1 at a time)     │
└───────────────────────────────────────────────────────────────┘

How do you handle Terraform state drift when someone makes manual changes in the AWS console, and how do you detect and remediate it?

advancedstateterraform

▼

Quick Answer

Detect drift by running terraform plan -refresh-only to compare actual infrastructure against state without proposing changes. Remediate by either importing the manual change into Terraform (terraform import or import blocks), reverting the manual change by running terraform apply to converge back to the declared configuration, or updating the Terraform code to reflect the intentional change and then applying.

Detailed Answer

Terraform state drift is like someone rearranging furniture in a room that has a blueprint: the blueprint (state file) says the couch is by the window, but someone physically moved it to the center of the room (console change). Terraform detects this discrepancy during the refresh phase of plan and proposes moving the couch back to the window (converging to declared state). The question is whether the move was intentional or accidental — and that determines whether you update the blueprint or move the couch back. Drift detection happens during the refresh phase of terraform plan. For every resource tracked in state, Terraform calls the cloud provider's API to read the current configuration. If the API response differs from what state records, Terraform updates its in-memory state and then diffs that against your HCL configuration. The -refresh-only flag runs only the refresh phase without proposing configuration-driven changes, making it a pure drift detection scan. The output shows which attributes have drifted and their before/after values. There are three categories of drift, each requiring a different remediation strategy. The first is accidental drift: an engineer manually opened port 443 on a security group to debug a connectivity issue and forgot to revert it. The fix is to run terraform apply, which converges the security group back to the declared configuration, removing the manually added rule. This is Terraform's self-healing property — the declared state is the source of truth. The second is intentional drift: an operations engineer manually scaled up an RDS instance from db.r6g.xlarge to db.r6g.2xlarge during a traffic incident. The change was correct and should be preserved. The fix is to update the Terraform code to reflect the new instance class, then run terraform plan to verify the plan shows no changes (the code now matches reality). If you run apply without updating the code, Terraform would downgrade the instance back to the original size — potentially causing another outage. The third is untracked resource creation: someone created a new S3 bucket via the console that Terraform knows nothing about. Since Terraform only tracks resources in its state, it cannot detect untracked resources. Tools like AWS Config, Driftctl (now Snyk IaC), or CloudQuery scan the entire account and compare against Terraform state to find resources that exist but are not managed. Once identified, you either import the resource into Terraform using import blocks (Terraform 1.5+) or the terraform import command, or you delete the resource if it should not exist. Proactive drift prevention is better than reactive detection. Implement AWS Config rules that alert on configuration changes not made by the Terraform execution role. Set up CloudTrail-based alarms that trigger when console users modify resources tagged with ManagedBy=terraform. Use IAM policies that restrict console users to read-only access for Terraform-managed resource types. Schedule a daily terraform plan -refresh-only in CI that posts drift reports to a Slack channel — this catches drift within 24 hours instead of discovering it during the next deployment. The lifecycle meta-argument ignore_changes is the escape hatch for expected drift. Auto-scaling groups change desired_capacity based on scaling policies, ECS services change task_count, and some resources have attributes that are set once and then managed externally. Adding these attributes to ignore_changes tells Terraform to skip them during drift comparison, preventing false positives and accidental reverts of legitimate operational changes.

Code Example

# Drift detection and remediation workflow
# Step 1: Run refresh-only plan to detect drift without proposing changes
# terraform plan -refresh-only -out=drift-check.tfplan
# This shows which resources have drifted from their recorded state

# Step 2: Review the drift report
# terraform show drift-check.tfplan
# Example output:
# ~ aws_security_group_rule.payments_api_ingress
#     from_port: 443 → 8080  (someone changed the port manually)

# Step 3a: Revert accidental drift — apply converges back to declared state
# terraform apply
# This restores the security group rule to port 443 as declared in code

# Step 3b: Adopt intentional drift — update code to match reality
resource "aws_rds_cluster" "payments_db" {
  # Cluster identifier for the payments transaction database
  cluster_identifier = "payments-db-prod"
  # Updated instance class to match the manual scaling during incident
  # Previously: db.r6g.xlarge — changed during traffic spike on 2026-06-15
  engine             = "aurora-postgresql"
  engine_version     = "15.4"
  deletion_protection = true
  backup_retention_period = 30
}

# Step 3c: Import untracked resources using import blocks (TF 1.5+)
import {
  # S3 bucket created manually via console during incident response
  to = aws_s3_bucket.payments_audit_logs
  # The actual bucket name to import from AWS
  id = "valuemomentum-payments-audit-logs-prod"
}

# Resource block to match the imported bucket's configuration
resource "aws_s3_bucket" "payments_audit_logs" {
  # Bucket name matching the manually created bucket
  bucket = "valuemomentum-payments-audit-logs-prod"
  tags = {
    Purpose     = "audit-log-storage"
    Environment = "prod"
    ManagedBy   = "terraform"
    ImportedOn  = "2026-06-20"
  }
}

# Lifecycle ignore_changes for expected drift patterns
resource "aws_autoscaling_group" "payments_api_fleet" {
  # ASG name following the naming convention
  name             = "payments-api-fleet-prod-use1"
  # Baseline desired capacity — autoscaler adjusts this
  desired_capacity = 6
  # Minimum instances for SLA compliance
  min_size         = 3
  # Maximum instances during peak events
  max_size         = 24
  launch_template {
    id      = aws_launch_template.payments_api.id
    version = "$Latest"
  }
  lifecycle {
    # Ignore desired_capacity — managed by cluster autoscaler
    # Ignore target_group_arns — managed by EKS ingress controller
    ignore_changes = [desired_capacity, target_group_arns]
  }
}

# Scheduled drift detection in CI (runs daily at 6 AM UTC)
# .github/workflows/drift-detection.yml
# name: Daily Drift Detection
# on:
#   schedule:
#     - cron: '0 6 * * *'
# jobs:
#   detect-drift:
#     runs-on: ubuntu-latest
#     steps:
#       - uses: actions/checkout@v4
#       - run: terraform -chdir=infrastructure/envs/prod init
#       - run: terraform -chdir=infrastructure/envs/prod plan -refresh-only -detailed-exitcode
#       # Exit code 2 means drift detected
#       - if: failure()
#         run: |
#           curl -X POST $SLACK_WEBHOOK -d '{"text": "DRIFT DETECTED in prod payments infra"}'

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│           Terraform State Drift Detection & Remediation        │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌───────────────┐      Manual       ┌──────────────────┐    │
│  │  Terraform     │      Change       │  AWS Console      │    │
│  │  State File    │                   │  or CLI           │    │
│  │                │                   │                   │    │
│  │  sg port: 443  │                   │  sg port: 8080    │    │
│  │  (declared)    │                   │  (actual)         │    │
│  └───────┬───────┘                   └──────────────────┘    │
│          │                                    │               │
│          └──────────────┬─────────────────────┘               │
│                         ↓                                     │
│              ┌──────────────────────┐                        │
│              │ terraform plan        │                        │
│              │ -refresh-only         │                        │
│              │                      │                        │
│              │ DRIFT DETECTED:      │                        │
│              │ sg port: 443 → 8080  │                        │
│              └──────────┬───────────┘                        │
│                         │                                     │
│          ┌──────────────┼──────────────┐                     │
│          ↓              ↓              ↓                     │
│  ┌──────────────┐┌─────────────┐┌──────────────────┐        │
│  │ Accidental   ││ Intentional ││ Untracked        │        │
│  │ Drift        ││ Drift       ││ Resource         │        │
│  │              ││             ││                  │        │
│  │ terraform    ││ Update HCL  ││ terraform import │        │
│  │ apply        ││ to match    ││ or delete the    │        │
│  │ (revert to   ││ reality,    ││ resource         │        │
│  │  declared)   ││ then apply  ││                  │        │
│  └──────────────┘└─────────────┘└──────────────────┘        │
│                                                               │
│  Proactive Detection:                                         │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Daily CI Job (cron: 0 6 * * *)                   │        │
│  │  terraform plan -refresh-only -detailed-exitcode  │        │
│  │                                                    │        │
│  │  Exit 0 → No drift → all clear                    │        │
│  │  Exit 2 → Drift detected → Slack alert            │        │
│  └──────────────────────────────────────────────────┘        │
│                                                               │
│  Expected Drift (ignore_changes):                             │
│  ┌──────────────────────────────────────────────────┐        │
│  │  ASG desired_capacity → managed by autoscaler     │        │
│  │  ECS task_count → managed by scaling policy       │        │
│  │  ignore_changes = [desired_capacity]              │        │
│  └──────────────────────────────────────────────────┘        │
└───────────────────────────────────────────────────────────────┘

How does terraform plan detect drift and what are its limitations?

advancedgeneralterraform

▼

Quick Answer

Terraform plan detects drift by reading the current state of every resource via provider API calls and comparing it against the state file. It identifies differences as drift. However, it only checks attributes it manages, cannot detect out-of-band resource creation, misses resources not in state, and some providers do not report all attributes accurately.

Detailed Answer

Terraform plan's drift detection works through a refresh-then-diff process. Think of it like an inventory audit: Terraform reads the last known inventory (state file), physically checks every item in the warehouse (API calls to cloud providers), updates the inventory with actual findings (state refresh), and then compares the updated inventory against the blueprint (configuration). Any discrepancies between the refreshed state and the desired configuration become the plan. The refresh phase is where drift detection happens. For every resource tracked in the state file, Terraform calls the provider's ReadResource RPC method, which translates to cloud API calls. For an aws_rds_cluster.payments_db, this triggers a DescribeDBClusters API call. The provider compares the API response against the state file's recorded attributes. If the production database's backup_retention_period was changed from 30 to 7 via the AWS console, the refresh detects this as drift and updates the in-memory state. After refresh, Terraform diffs the refreshed state against the configuration. If your configuration says backup_retention_period = 30 but the refreshed state shows 7, the plan proposes changing it back to 30. This is Terraform's self-healing property: it converges actual infrastructure toward the declared configuration. However, the limitations are significant and often misunderstood in production. First, Terraform only detects drift on resources it manages. If someone creates an additional security group rule via the AWS console that is not in Terraform's state, Terraform has no knowledge of it. This is the 'unknown unknowns' problem: Terraform cannot detect resources it does not track. Second, not all providers report all attributes during refresh. Some cloud APIs return partial data, or certain attributes are write-only (like passwords). The AWS provider, for example, cannot detect drift on certain IAM policy document orderings because the API returns a canonicalized version that may not match the original. Third, the refresh phase can be slow and expensive. In a large infrastructure with thousands of resources, the refresh makes thousands of API calls, which can hit rate limits and take tens of minutes. Terraform 1.5 introduced the -refresh=false flag to skip refresh for faster plans, but this trades drift detection for speed. Fourth, eventual consistency in cloud APIs can cause false drift detection. After an AWS resource is created, the API may return stale data for seconds or minutes. Running plan immediately after apply can show phantom drift that resolves itself. Fifth, Terraform cannot detect drift on resource dependencies that are not explicitly modeled. If a VPC peering connection's route table was modified outside Terraform but the peering resource itself was not, Terraform might not detect the functional impact. Tools like AWS Config, CloudTrail-based drift detection, or Driftctl (now part of Snyk) fill these gaps by scanning entire accounts for unmanaged resources.

Code Example

# Demonstrating drift detection behavior with refresh configuration
# Backend configuration for the payments infrastructure state
terraform {
  # Required Terraform version for refresh-only plan support
  required_version = ">= 1.5.0"
  # S3 backend with state locking for the payments platform
  backend "s3" {
    # State bucket for the production payments infrastructure
    bucket         = "fintech-corp-terraform-state-prod"
    # State file path for the payments database workspace
    key            = "payments-database/terraform.tfstate"
    # Primary region for state storage
    region         = "us-east-1"
    # Lock table to prevent concurrent modifications
    dynamodb_table = "terraform-state-locks-prod"
  }
}

# RDS cluster that we want to detect drift on
resource "aws_rds_cluster" "payments_db" {
  # Cluster identifier for the payments transaction database
  cluster_identifier = "payments-db-production"
  # Aurora PostgreSQL engine for transaction processing
  engine             = "aurora-postgresql"
  # Engine version validated by the DBA team
  engine_version     = "15.4"
  # Backup retention: 30 days for PCI compliance
  # If someone changes this via console, plan will detect drift
  backup_retention_period = 30
  # Deletion protection must stay enabled in production
  deletion_protection = true
  # Preferred maintenance window outside peak transaction hours
  preferred_maintenance_window = "sun:03:00-sun:04:00"
}

# Lifecycle rule to ignore drift on specific attributes
resource "aws_autoscaling_group" "payments_api_fleet" {
  # ASG name following the organization convention
  name                = "payments-api-fleet-production"
  # Desired capacity managed by autoscaling policies, not Terraform
  desired_capacity    = 6
  # Minimum instances for baseline transaction processing
  min_size            = 3
  # Maximum instances during peak shopping events
  max_size            = 24
  # Launch template for the payments API container hosts
  launch_template {
    # Reference the payments API launch template
    id      = aws_launch_template.payments_api.id
    # Always use the latest validated AMI version
    version = "$Latest"
  }
  # Ignore drift on desired_capacity because autoscaling changes it
  lifecycle {
    # Prevent Terraform from reverting autoscaler decisions
    ignore_changes = [desired_capacity]
  }
}

# Refresh-only plan command to detect drift without proposing changes
# terraform plan -refresh-only -out=drift-report.tfplan

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│                Terraform Plan Drift Detection Flow             │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  Phase 1: Refresh (Drift Detection)                           │
│  ┌──────────────┐    ReadResource    ┌──────────────────┐    │
│  │  State File    │    RPC calls       │  Cloud Provider   │    │
│  │               │──────────────────→│  APIs             │    │
│  │  payments_db: │                    │                   │    │
│  │  retention=30 │←──────────────────│  Actual: ret=7    │    │
│  │               │   API Response     │  (console change) │    │
│  └──────┬───────┘                    └──────────────────┘    │
│         │                                                     │
│         │ Update in-memory state                              │
│         ↓                                                     │
│  ┌──────────────┐                                             │
│  │  Refreshed     │                                             │
│  │  State         │  payments_db: retention=7 (drift detected) │
│  └──────┬───────┘                                             │
│         │                                                     │
│  Phase 2: Diff (Plan Generation)                              │
│         │                                                     │
│         ↓                                                     │
│  ┌──────────────┐    Compare    ┌──────────────────┐         │
│  │  Refreshed     │─────────────→│  Configuration    │         │
│  │  State         │              │  (main.tf)        │         │
│  │  retention=7  │              │  retention=30    │         │
│  └──────────────┘              └──────────────────┘         │
│         │                                                     │
│         ↓                                                     │
│  ┌───────────────────────────────────────────────────┐       │
│  │  Plan Output:                                      │       │
│  │  ~ aws_rds_cluster.payments_db                     │       │
│  │    ~ backup_retention_period: 7 → 30               │       │
│  │    (drift will be corrected on apply)              │       │
│  └───────────────────────────────────────────────────┘       │
│                                                               │
│  Limitations:                                                 │
│  ┌────────────────────────────────────────────────┐          │
│  │  ✗ Cannot detect unmanaged resources            │          │
│  │  ✗ Write-only attributes invisible to refresh   │          │
│  │  ✗ Eventual consistency → false drift           │          │
│  │  ✗ Rate limits slow large-scale refresh         │          │
│  └────────────────────────────────────────────────┘          │
└───────────────────────────────────────────────────────────────┘

How do you implement a Terraform CI/CD pipeline with plan approval workflow?

advancedgeneralterraform

▼

Quick Answer

Implement a CI/CD pipeline that runs terraform plan on pull requests, posts the plan output as a PR comment for review, requires manual approval before apply, and uses remote state locking to prevent concurrent operations. Use OIDC authentication, separate plan and apply stages, and implement policy-as-code gates with Sentinel or OPA.

Detailed Answer

A production-grade Terraform CI/CD pipeline must solve five problems: authentication without long-lived credentials, plan visibility for reviewers, approval gates before destructive changes, concurrency control to prevent conflicting applies, and policy enforcement to catch compliance violations before they reach infrastructure. Think of it like a surgical operation workflow: the surgeon (engineer) proposes an operation (plan), it gets reviewed by a board (PR review), approved by an authority (manual approval gate), and only then executed in a controlled environment (apply) with safeguards (state locking). Authentication should use OIDC federation. GitHub Actions, GitLab CI, and CircleCI all support OIDC tokens that can be exchanged for short-lived AWS credentials via STS AssumeRoleWithWebIdentity. This eliminates the need to store AWS access keys as CI secrets, which is a common audit finding. The OIDC trust policy should be scoped to specific repositories and branches to prevent unauthorized access. The pipeline structure typically has three stages. The first stage runs on every pull request: terraform init, terraform validate, terraform fmt -check, and terraform plan -out=plan.tfplan. The plan output is captured and posted as a PR comment using a tool like tfcmt or a custom script that parses the plan JSON output. This gives reviewers visibility into exactly what will change. The second stage is the approval gate. For non-production environments, this might be automatic after PR merge. For production, it requires explicit manual approval. In GitHub Actions, this is implemented using environments with required reviewers. In GitLab, it is a manual job gate. The approval should be from someone other than the PR author (four-eyes principle) and ideally from a platform engineering team member who understands the blast radius. The third stage runs terraform apply using the saved plan file. This is critical: never re-run plan during apply, because infrastructure may have changed between the plan and apply stages. The saved plan file ensures exactly what was reviewed gets applied. After apply, the pipeline should post the apply output back to the PR or a Slack channel for visibility. Policy-as-code adds guardrails. HashiCorp Sentinel (Terraform Cloud/Enterprise) or Open Policy Agent (open source) evaluate the plan against organizational policies: no public S3 buckets, all RDS instances must have encryption, all security groups must have descriptions. These checks run after plan but before approval, catching violations early. Production gotchas include handling plan file expiration (plan files reference specific provider plugin versions and state serial numbers, so they expire when state changes), managing workspace-level parallelism (only one pipeline should operate on a workspace at a time), and dealing with long-running applies that exceed CI timeout limits. Some teams implement a Terraform-specific lock in Redis or DynamoDB beyond the state lock, to queue pipeline runs at the workspace level.

Code Example

# GitHub Actions workflow for Terraform CI/CD with approval gates
# File: .github/workflows/terraform-payments-infra.yml
name: Payments Infrastructure Terraform Pipeline

# Trigger on pull requests targeting the main branch
on:
  pull_request:
    # Only run when infrastructure code changes
    paths:
      - 'infrastructure/payments/**'
  push:
    branches:
      - main
    paths:
      - 'infrastructure/payments/**'

# OIDC token permissions for AWS authentication
permissions:
  # Allow requesting OIDC JWT tokens from GitHub
  id-token: write
  # Allow posting plan output as PR comments
  pull-requests: write
  # Allow reading repository contents
  contents: read

# Prevent concurrent runs on the same branch/PR
concurrency:
  # Group by workflow name and PR number or branch
  group: terraform-payments-${{ github.event.pull_request.number || github.ref }}
  # Cancel in-progress plan runs but never cancel apply
  cancel-in-progress: ${{ github.event_name == 'pull_request' }}

jobs:
  # Plan stage: runs on every pull request
  terraform-plan:
    # Use the latest Ubuntu runner for consistency
    runs-on: ubuntu-latest
    # Only run plan on pull requests, not on merge
    if: github.event_name == 'pull_request'
    steps:
      # Checkout the payments infrastructure code
      - uses: actions/checkout@v4
      # Configure AWS credentials via OIDC federation
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          # OIDC role scoped to this repository and branch
          role-to-assume: arn:aws:iam::111111111111:role/GitHubActions-TerraformPlan
          # Region for API calls and state backend
          aws-region: us-east-1
      # Install the pinned Terraform version
      - uses: hashicorp/setup-terraform@v3
        with:
          # Version locked to match team standard
          terraform_version: 1.7.4
      # Initialize Terraform with backend configuration
      - name: Terraform Init
        # Run init in the payments infrastructure directory
        run: terraform init -input=false
        working-directory: infrastructure/payments
      # Run format check to enforce code style
      - name: Terraform Format Check
        # Fail the pipeline if code is not formatted
        run: terraform fmt -check -recursive
        working-directory: infrastructure/payments
      # Generate the execution plan and save to file
      - name: Terraform Plan
        # Save plan to file for use in apply stage
        run: terraform plan -input=false -out=payments.tfplan
        working-directory: infrastructure/payments
      # Post plan output as a PR comment for reviewers
      - name: Post Plan to PR
        # Use tfcmt for formatted plan comments
        run: tfcmt plan -- terraform show payments.tfplan
        working-directory: infrastructure/payments

  # Apply stage: runs after merge with manual approval
  terraform-apply:
    # Use the latest Ubuntu runner for consistency
    runs-on: ubuntu-latest
    # Only run apply on push to main (after PR merge)
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    # Require manual approval from the platform-admins team
    environment: production-payments
    steps:
      # Checkout the merged infrastructure code
      - uses: actions/checkout@v4
      # Configure AWS credentials with apply permissions
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          # Apply role has write permissions to production
          role-to-assume: arn:aws:iam::111111111111:role/GitHubActions-TerraformApply
          # Same region as the plan stage
          aws-region: us-east-1
      # Install the same Terraform version as plan stage
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.7.4
      # Initialize and apply the configuration
      - name: Terraform Init and Apply
        # Auto-approve because approval happened via GitHub environment
        run: |
          terraform init -input=false
          terraform apply -input=false -auto-approve
        working-directory: infrastructure/payments

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│          Terraform CI/CD Pipeline with Approval Gates         │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌──────────────┐                                             │
│  │  Engineer     │                                             │
│  │  Opens PR     │                                             │
│  └──────┬───────┘                                             │
│         │                                                     │
│         ↓                                                     │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Stage 1: Plan (on PR)                            │        │
│  │  ┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐ │        │
│  │  │  init   │→│validate│→│fmt chk │→│  plan   │ │        │
│  │  └────────┘  └────────┘  └────────┘  └───┬────┘ │        │
│  └──────────────────────────────────────────┬───────┘        │
│                                              │                │
│         ┌────────────────────────────────────┘                │
│         ↓                                                     │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Policy Check (OPA / Sentinel)                    │        │
│  │  ┌────────────────────────────────────────┐      │        │
│  │  │  No public S3 buckets                   │      │        │
│  │  │  All RDS encrypted                      │      │        │
│  │  │  Security groups have descriptions      │      │        │
│  │  └────────────────────────────────────────┘      │        │
│  └──────────────────────────────────────────┬───────┘        │
│                                              │                │
│         ┌────────────────────────────────────┘                │
│         ↓                                                     │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Plan posted as PR comment (tfcmt)                │        │
│  │  Reviewer examines changes                        │        │
│  └──────────────────────────────────────────┬───────┘        │
│                                              │                │
│         ┌────────────────────────────────────┘                │
│         ↓                                                     │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Stage 2: Manual Approval                         │        │
│  │  (GitHub Environment: production-payments)        │        │
│  │  Required reviewers: platform-admins team         │        │
│  └──────────────────────────────────────────┬───────┘        │
│                                              │                │
│         ┌────────────────────────────────────┘                │
│         ↓                                                     │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Stage 3: Apply (on merge to main)                │        │
│  │  ┌────────┐  ┌─────────────────────────────────┐ │        │
│  │  │  init   │→│  apply -auto-approve              │ │        │
│  │  └────────┘  └─────────────────────────────────┘ │        │
│  └──────────────────────────────────────────┬───────┘        │
│                                              │                │
│         ┌────────────────────────────────────┘                │
│         ↓                                                     │
│  ┌──────────────────────────────────────────────────┐        │
│  │  Notify: Slack #payments-infra-changes            │        │
│  └──────────────────────────────────────────────────┘        │
└───────────────────────────────────────────────────────────────┘

How do terraform plan -refresh-only, import blocks, and moved blocks help architects detect and remediate infrastructure drift without destroying resources?

architectstateterraform

▼

Quick Answer

Terraform plan -refresh-only detects drift by comparing actual cloud state against the stored state file without proposing configuration changes. Import blocks bring unmanaged resources under Terraform control declaratively. Moved blocks refactor resource addresses in state without destroying and recreating infrastructure. Together they let architects reconcile drift, adopt existing resources, and restructure code safely.

Detailed Answer

Think of a warehouse inventory system. Drift detection is like a stock audit — you compare what the computer says is on the shelf to what is actually there. Import is like scanning a product that was placed on the shelf without being logged into the system. Move is like changing the shelf label without physically moving the product. All three keep the inventory accurate without throwing anything away. Infrastructure drift occurs when cloud resources are modified outside Terraform — through the console, CLI, another IaC tool, or automated processes like auto-scaling. terraform plan -refresh-only reads the current state of every managed resource from the cloud provider APIs and compares it to the stored state file. It shows what has changed in the real world without proposing any configuration-level changes. This is distinct from a regular terraform plan, which both refreshes state and compares it to the desired configuration. Running refresh-only plans on a schedule helps teams detect unauthorized changes before they cause incidents. Internally, refresh-only mode calls the same provider Read functions that a normal plan uses, but it stops after updating the in-memory state representation. It shows a diff between the previously stored state and the freshly read state, highlighting attributes that changed externally. If the operator approves the refresh with terraform apply -refresh-only, the state file is updated to match reality without making any infrastructure changes. Import blocks, introduced in Terraform 1.5, allow declarative imports in configuration files rather than the imperative terraform import CLI command. A resource block with an import block specifies the cloud resource ID, and terraform plan generates the configuration needed to manage it. Moved blocks tell Terraform that a resource has been renamed or restructured in the configuration — for example, moving from a flat resource to a module or changing a resource's for_each key — so it updates the state address rather than planning a destroy and create. At production scale, drift detection should be automated. Teams run terraform plan -refresh-only in CI on a daily schedule and alert on any detected drift. The plan output is stored as an artifact for audit trail. Import blocks are essential during brownfield adoption — when a company has existing infrastructure created manually or by CloudFormation and wants to manage it with Terraform. Without import blocks, the alternative is terraform import commands that must be run manually for each resource, which is error-prone and not version-controlled. Moved blocks are critical during refactoring: when a team restructures modules, renames resources for clarity, or converts single resources to for_each collections, moved blocks prevent Terraform from destroying the production database and recreating it. The non-obvious gotcha with refresh-only is that it only detects drift in resources Terraform already manages — it cannot find resources created outside Terraform. Teams need cloud-native tools like AWS Config or Azure Policy for complete drift coverage. With import blocks, the generated configuration may not match the team's coding standards and needs manual cleanup. With moved blocks, the from address must exactly match the current state address, including module paths and index keys, and a typo silently creates a new resource instead of moving the existing one. Architects should always run terraform plan after adding moved blocks and verify that no destroy/create actions appear.

Code Example

# Detect drift on the payments infrastructure without changing anything
terraform plan -refresh-only -out=drift-report.tfplan

# Review the drift report to see what changed externally
terraform show drift-report.tfplan

# Apply the refresh to update state to match reality (no infra changes)
terraform apply -refresh-only drift-report.tfplan

# Import an existing RDS instance that was created manually in the console
# payments-data/main.tf
import {
  # Specify the AWS resource ID of the existing database
  id = "payments-orders-prod"
  # Map it to this Terraform resource address
  to = aws_db_instance.orders
}

resource "aws_db_instance" "orders" {
  # Identifier matching the existing RDS instance name
  identifier     = "payments-orders-prod"
  # Instance class matching the existing configuration
  instance_class = "db.r6g.large"
  # Engine matching the existing database
  engine         = "postgres"
  # Engine version matching the existing database
  engine_version = "16.3"
  # Storage matching the existing allocation
  allocated_storage = 500
  # Prevent accidental deletion of the production database
  deletion_protection = true
  # Skip final snapshot only if you have other backup strategies
  skip_final_snapshot = false
  # Tag for operational identification
  tags = {
    Team        = "payments"
    Environment = "prod"
    ManagedBy   = "terraform"
  }
}

# Refactor a resource into a module without destroying it
# Use moved block to update the state address
moved {
  # Old address before modularization
  from = aws_db_instance.orders
  # New address inside the database module
  to   = module.orders_database.aws_db_instance.this
}

◈ Architecture Diagram

┌──────────┐
│ Cloud    │
│ (actual) │
└────┬─────┘
     │ refresh
┌────┴─────┐
│ State    │
│ (stored) │
└────┬─────┘
     │ compare
┌────┴─────┐
│ Config   │
│ (desired)│
└────┬─────┘
     │
┌────┴─────┐
│ Plan     │
│ (action) │
└──────────┘

What is the difference between terraform plan and terraform apply?

intermediategeneralterraform

▼

Quick Answer

Terraform plan is a dry-run that shows what changes Terraform would make without modifying any infrastructure, while terraform apply actually executes those changes. Plan reads state and config, computes a diff, and outputs it; apply performs that diff against real cloud APIs.

Detailed Answer

Understanding the difference between plan and apply is fundamental, but the internals reveal much more than just 'one previews, the other executes.' Think of terraform plan like a restaurant printing a receipt before charging your card — it shows exactly what will happen so you can review it. terraform apply is when the charge actually goes through. When you run terraform plan, Terraform performs several steps internally. First, it loads the configuration by parsing all .tf files in the current directory and resolving module sources. Second, it reads the current state file to understand what resources already exist. Third, it performs a state refresh — making API calls to every cloud provider to check the actual status of each managed resource and updating the in-memory state with real attributes. This refresh step is crucial because someone might have manually changed a security group rule outside of Terraform. Fourth, Terraform builds a dependency graph of all resources, computes the diff between desired state (your HCL) and actual state (refreshed), and produces an execution plan showing creates, updates, and destroys with specific attribute changes. The plan output uses a clear notation: + for create, ~ for update in-place, - for destroy, and -/+ for destroy-then-recreate (also called a forced replacement). When you see -/+ next to your production database, that is the moment you should stop and investigate — it means Terraform wants to destroy and recreate that resource, which could mean data loss. terraform apply by default runs a plan first and asks for confirmation before proceeding. Once confirmed, Terraform walks the dependency graph and makes real API calls to create, update, or destroy resources. It processes independent resources in parallel (up to 10 by default, configurable with -parallelism) and sequential resources in dependency order. After each resource operation completes, Terraform immediately writes the updated state file, ensuring that even if apply is interrupted midway, the state reflects what was actually created. A critical production practice is using saved plan files. You run terraform plan -out=tfplan to save the plan to a binary file, review it, and then run terraform apply tfplan. This guarantees that what you reviewed is exactly what gets applied — no re-computation, no changes from someone else's commit sneaking in between plan and apply. In CI/CD pipelines, this two-stage approach is essential. The plan stage runs in a pull request for review, and the apply stage runs only after merge using the exact saved plan. One subtle gotcha: terraform apply without a saved plan will re-compute the plan at apply time, meaning the infrastructure could have changed between when you reviewed the plan output and when apply runs. In fast-moving environments with multiple teams, this gap can cause surprises. Always use saved plans in production workflows.

Code Example

# Step 1: Run plan and save the output to a binary plan file
# The -out flag saves the computed plan for exact replay during apply
# terraform plan -out=payments-deploy-2024-03-15.tfplan

# Step 2: Review the plan output carefully before applying
# Look for any -/+ (destroy and recreate) on stateful resources
# terraform show payments-deploy-2024-03-15.tfplan

# Step 3: Apply the exact saved plan without re-computation
# terraform apply payments-deploy-2024-03-15.tfplan

# Production CI/CD pipeline example using saved plans
# This is typically in a Makefile or CI script

# Variable definitions for the payments infrastructure deployment
variable "db_instance_class" {
  # The RDS instance size for the payments database
  description = "Instance class for the payments RDS cluster"
  # Enforce string type to prevent accidental numeric input
  type = string
  # Default to a production-grade instance size
  default = "db.r6g.xlarge"
}

# RDS cluster that plan will evaluate and apply will create/update
resource "aws_rds_cluster" "payments_db" {
  # Unique cluster identifier following the naming convention
  cluster_identifier = "payments-db-prod-us-east-1"
  # Use Aurora PostgreSQL for the payments database engine
  engine = "aurora-postgresql"
  # Pin to a specific engine version to avoid surprise upgrades
  engine_version = "15.4"
  # Place the database in the payments VPC private subnets
  db_subnet_group_name = aws_db_subnet_group.payments_db_subnets.name
  # Use the payments database security group for network access control
  vpc_security_group_ids = [aws_security_group.payments_db_sg.id]
  # Master username for the database administrator account
  master_username = "payments_admin"
  # Pull the password from AWS Secrets Manager, never hardcode
  master_password = data.aws_secretsmanager_secret_version.db_password.secret_string
  # Enable deletion protection to prevent accidental terraform destroy
  deletion_protection = true
  # Skip final snapshot only in dev; always snapshot in prod
  skip_final_snapshot = false
  # Name the final snapshot with a timestamp for recovery
  final_snapshot_identifier = "payments-db-final-${formatdate("YYYY-MM-DD", timestamp())}"
  # Tags for cost allocation and ownership tracking
  tags = {
    Service     = "payments-processing"
    Environment = "production"
    BackupTier  = "critical"
  }
}

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────┐
│                    terraform plan                         │
│                                                           │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐             │
│  │ Load HCL │──→│ Read     │──→│ Refresh  │             │
│  │ Config   │   │ State    │   │ via API  │             │
│  └──────────┘   └──────────┘   └────┬─────┘             │
│                                     │                    │
│                              ┌──────▼──────┐             │
│                              │ Compute     │             │
│                              │ Diff        │             │
│                              └──────┬──────┘             │
│                                     │                    │
│                              ┌──────▼──────┐             │
│                              │ Plan Output │             │
│                              │ + create    │             │
│                              │ ~ update    │             │
│                              │ - destroy   │             │
│                              └──────┬──────┘             │
└─────────────────────────────────────┼─────────────────────┘
                                      │
                            ┌─────────▼──────────┐
                            │ Saved Plan File    │
                            │ (.tfplan binary)   │
                            └─────────┬──────────┘
                                      │
┌─────────────────────────────────────┼─────────────────────┐
│                    terraform apply  │                     │
│                              ┌──────▼──────┐             │
│                              │ Walk Dep    │             │
│                              │ Graph       │             │
│                              └──────┬──────┘             │
│                                     │                    │
│                    ┌────────────────┬┴────────────────┐   │
│              ┌─────▼─────┐  ┌──────▼─────┐  ┌────────▼┐ │
│              │ Create    │  │ Update     │  │ Destroy │ │
│              │ Resources │  │ Resources  │  │ Removed │ │
│              └─────┬─────┘  └──────┬─────┘  └────────┬┘ │
│                    └────────────────┼─────────────────┘   │
│                              ┌──────▼──────┐             │
│                              │ Write State │             │
│                              └─────────────┘             │
└───────────────────────────────────────────────────────────┘