24 interview questions · docker, kubernetes, terraform
Quick Answer
Least-privilege RBAC in EKS maps IAM roles to Kubernetes Roles and ClusterRoles through aws-auth ConfigMap or EKS access entries. Dev teams get namespace-scoped Roles, CI/CD pipelines use dedicated ServiceAccounts with deploy-only permissions, and production admin access uses break-glass procedures with time-bound credentials and full audit logging.
Detailed Answer
Think of RBAC in EKS like a bank's physical access control system. A teller can access their counter and the shared vault during business hours. A security guard can access the camera room and patrol areas but cannot open customer safe deposit boxes. The branch manager has a master key stored in a sealed envelope that requires two signatures to open and is only used during emergencies. Each person has exactly the permissions they need for their daily job, and any elevation requires approval, logging, and time limits. EKS RBAC follows the same principle across three distinct personas: developers, pipelines, and production administrators. For development teams, the foundation is namespace isolation. Each team or application gets its own Kubernetes namespace, and a Role scoped to that namespace grants only the verbs and resources the team needs. A developer working on the payments-api service needs permission to view pods, logs, events, and ConfigMaps in the payments namespace but should never modify Deployments directly or access secrets in the fraud-detection namespace. EKS maps these permissions through IAM role assumption: developers authenticate via AWS SSO, assume an IAM role like payments-dev-role, and the aws-auth ConfigMap or EKS access entries map that IAM role to a Kubernetes Group bound to the appropriate Role. This separation means revoking a developer's access is a single IAM policy change, not a cluster-level operation. CI/CD pipelines require a different permission model because they are non-interactive, automated, and high-privilege by nature. A Jenkins or GitHub Actions pipeline that deploys to Kubernetes needs permission to create and update Deployments, Services, and ConfigMaps, but it should never read Secrets directly, modify RBAC bindings, or access other namespaces. Each pipeline gets a dedicated Kubernetes ServiceAccount with a Role that permits only the resources it deploys. The ServiceAccount token is delivered through IAM Roles for Service Accounts (IRSA) rather than static long-lived tokens, ensuring credentials rotate automatically and can be audited through CloudTrail. Pipeline permissions are further restricted by resource names where possible: a payments-api pipeline can only update the payments-api Deployment, not the fraud-detector Deployment in the same namespace. Production administrator access in a banking environment follows the break-glass model. Day-to-day operations should not require cluster-admin access. Instead, SRE teams have read-only ClusterRoles that allow viewing resources across all namespaces for monitoring and troubleshooting. When a genuine emergency requires elevated access, engineers request temporary credentials through a privileged access management system like CyberArk or AWS IAM Identity Center with a defined session duration. The break-glass IAM role maps to a cluster-admin ClusterRoleBinding, and every action taken with that role is logged to CloudTrail, Kubernetes audit logs, and the organization's SIEM. After the incident window closes, the session expires automatically. The production gotcha that catches many teams is RBAC drift and stale bindings. Over months, teams accumulate RoleBindings for departed employees, decommissioned pipelines, and experimental namespaces. Without periodic access reviews, the RBAC surface area grows silently. Banking regulators like the OCC require quarterly access certifications, so mature teams automate RBAC auditing by exporting all RoleBindings and ClusterRoleBindings, mapping them to active IAM identities, and flagging orphaned or overly broad bindings. Another common mistake is granting wildcard permissions on resources or verbs during initial setup and never narrowing them. A ClusterRole with resources: ['*'] and verbs: ['*'] is functionally equivalent to cluster-admin and defeats the entire purpose of RBAC.
Code Example
# Namespace-scoped Role for payments dev team (read-only, no secrets)
# rbac-payments-dev.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: payments-dev-readonly
namespace: payments
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "configmaps", "events"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
---
# Bind IAM-mapped group to the Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: payments-dev-binding
namespace: payments
subjects:
- kind: Group
name: payments-dev-team # Mapped from IAM role via aws-auth
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: payments-dev-readonly
apiGroup: rbac.authorization.k8s.io
# CI/CD pipeline ServiceAccount with deploy-only permissions
# rbac-cicd-payments.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: cicd-payments-deployer
namespace: payments
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/cicd-payments-deployer
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cicd-deployer
namespace: payments
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
resourceNames: ["payments-api"] # Restrict to specific deployment
verbs: ["get", "patch", "update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create", "update"]
# Break-glass ClusterRole for production emergencies
# rbac-breakglass.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: breakglass-admin
annotations:
audit.bank.com/justification: "emergency-access-only"
audit.bank.com/max-duration: "2h"
subjects:
- kind: Group
name: sre-breakglass # Time-bound IAM role via AWS SSO
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
# Audit: find all ClusterRoleBindings granting cluster-admin
kubectl get clusterrolebindings -o json | \
jq '.items[] | select(.roleRef.name=="cluster-admin") | .metadata.name'◈ Architecture Diagram
┌─────────────────────────────────────────────────────┐ │ EKS Cluster │ ├─────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ aws-auth ┌───────────────────┐ │ │ │ AWS SSO │────────────→│ K8s Group Mapping │ │ │ │ IAM Role │ └─────────┬─────────┘ │ │ └──────────┘ │ │ │ ↓ │ │ ┌────────────────┐ ┌────────────────────────┐ │ │ │ Dev Team │ │ payments namespace │ │ │ │ (read-only) │→ │ Role: pods,logs,events │ │ │ └────────────────┘ └────────────────────────┘ │ │ │ │ ┌────────────────┐ ┌────────────────────────┐ │ │ │ CI/CD Pipeline │ │ payments namespace │ │ │ │ (IRSA token) │→ │ Role: deploy only │ │ │ └────────────────┘ └────────────────────────┘ │ │ │ │ ┌────────────────┐ ┌────────────────────────┐ │ │ │ SRE Break-Glass│ │ cluster-admin │ │ │ │ (time-bound) │→ │ 2hr session + audit │ │ │ └────────────────┘ └────────────────────────┘ │ └─────────────────────────────────────────────────────┘
Quick Answer
EKS uses IAM for authentication and Kubernetes RBAC for authorization. IAM roles or SSO identities are mapped to Kubernetes groups via the aws-auth ConfigMap or EKS access entries, then ClusterRoleBindings grant permissions to those groups. L1 gets read-only, L2 gets namespace-scoped edit, Developers get deploy permissions, and Admins get cluster-admin.
Detailed Answer
Think of a hospital with badge-based access. Your employee badge (IAM identity) gets you through the front door (authentication), but which rooms you can enter depends on your department and clearance level (RBAC authorization). A nurse can access patient rooms but not the pharmacy vault. A doctor can access both. The badge system and the room access system are separate but connected. In EKS, authentication and authorization are separate layers. Authentication answers 'who are you?' using AWS IAM — users, roles, or SSO identities present AWS credentials to the EKS API server. Authorization answers 'what can you do?' using Kubernetes RBAC — Roles, ClusterRoles, RoleBindings, and ClusterRoleBindings define permissions. The bridge between them is the EKS access entry system (or the legacy aws-auth ConfigMap) which maps IAM principals to Kubernetes usernames and groups. The implementation involves three steps. First, create IAM roles for each access level: eks-l1-readonly, eks-l2-support, eks-developer, eks-admin. Second, map each IAM role to a Kubernetes group using EKS access entries (preferred) or the aws-auth ConfigMap. Third, create Kubernetes ClusterRoles and RoleBindings that grant appropriate permissions to each group. L1 support gets a ClusterRoleBinding to the built-in view ClusterRole (read-only across all namespaces). L2 support gets RoleBindings to the edit ClusterRole in specific namespaces. Developers get custom Roles with deploy permissions (create/update Deployments, Services, ConfigMaps) in their team namespaces. Admins get ClusterRoleBinding to cluster-admin. At production scale, teams integrate EKS with AWS SSO (IAM Identity Center) so users authenticate through their corporate identity provider. Permission sets in AWS SSO map to IAM roles, which map to Kubernetes groups. This creates a chain: corporate identity → SSO permission set → IAM role → Kubernetes group → RBAC permissions. Monitoring should include audit logs for who accessed what, periodic access reviews, and alerts on cluster-admin usage. The non-obvious gotcha is that the aws-auth ConfigMap is a single point of failure. If someone deletes or corrupts it, all IAM-based access to the cluster is lost (except the cluster creator's IAM principal). EKS access entries, the newer mechanism, are managed through the EKS API and are more resilient. Teams should also be aware that IAM permissions and Kubernetes RBAC are evaluated independently — having IAM access to the EKS API does not automatically grant Kubernetes permissions, and vice versa.
Code Example
# Create EKS access entry for L1 read-only support team aws eks create-access-entry \ --cluster-name payments-cluster \ --principal-arn arn:aws:iam::123456789012:role/eks-l1-readonly \ --kubernetes-groups l1-support # Maps IAM role to Kubernetes group # Associate read-only access policy for L1 aws eks associate-access-policy \ --cluster-name payments-cluster \ --principal-arn arn:aws:iam::123456789012:role/eks-l1-readonly \ --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \ --access-scope type=cluster # Grants read-only across all namespaces # Create Kubernetes RBAC for L2 support with edit access in payments namespace apiVersion: rbac.authorization.k8s.io/v1 # RBAC API group kind: RoleBinding # Namespace-scoped permission binding metadata: name: l2-support-edit # Descriptive binding name namespace: payments # L2 can edit resources in payments namespace only subjects: - kind: Group # References a Kubernetes group name: l2-support # Group name mapped from IAM role apiGroup: rbac.authorization.k8s.io # RBAC API group roleRef: kind: ClusterRole # References the built-in edit role name: edit # Allows create, update, delete of most resources apiGroup: rbac.authorization.k8s.io # RBAC API group # Verify what permissions a specific user has kubectl auth can-i list pods -n payments --as-group=l1-support # Should return yes (read-only) kubectl auth can-i delete pods -n payments --as-group=l1-support # Should return no (read-only cannot delete)
◈ Architecture Diagram
┌──────────┐
│ IAM Role │
└────┬─────┘
↓
┌──────────┐
│ Access │
│ Entry │
└────┬─────┘
↓
┌──────────┐
│ K8s Group│
└────┬─────┘
↓
┌──────────────────────┐
│ RBAC Bindings │
│ L1 → view │
│ L2 → edit (ns) │
│ Dev → deploy (ns) │
│ Admin → cluster-admin│
└──────────────────────┘Quick Answer
RBAC (Role-Based Access Control) in Kubernetes controls who can perform what actions on which resources. Roles and ClusterRoles define permissions (verbs on resources), RoleBindings and ClusterRoleBindings attach those permissions to subjects (Users, Groups, or ServiceAccounts), with Roles scoped to a namespace and ClusterRoles scoped cluster-wide.
Detailed Answer
Think of RBAC like the security system of a large office building. A Role is like a keycard that opens specific doors on a specific floor (namespace). A ClusterRole is like a master keycard that works across all floors. A RoleBinding is the act of issuing a keycard to a specific employee for a specific floor. A ServiceAccount is an employee badge for an automated system (like the mail robot) that needs to move through certain areas. Without RBAC, every employee would have a master key, which is the equivalent of running everything as cluster-admin. In Kubernetes, RBAC is one of several authorization modules (others include ABAC, Webhook, and Node authorization). It is enabled by default in most distributions and is the standard mechanism for controlling access to the API server. RBAC operates on four object types: Role (namespaced permissions), ClusterRole (cluster-wide permissions), RoleBinding (grants a Role or ClusterRole to subjects in a specific namespace), and ClusterRoleBinding (grants a ClusterRole to subjects across all namespaces). The API server evaluates RBAC rules on every request by checking if any binding grants the requesting subject the required verb on the requested resource. Internally, when a request hits the kube-apiserver, it passes through three stages: Authentication (who are you?), Authorization (are you allowed?), and Admission Control (any mutations or validations?). During the Authorization stage, the RBAC authorizer retrieves all RoleBindings and ClusterRoleBindings that reference the requesting subject. For each binding, it checks if the associated Role or ClusterRole contains a rule that matches the request's verb (get, list, create, update, patch, delete, watch), resource (pods, services, deployments), API group (apps, batch, networking.k8s.io), and optionally the specific resource name. If any rule matches, the request is allowed; if no rule matches across all bindings, the request is denied. Rules are additive only -- there are no deny rules in RBAC. At scale, RBAC management becomes complex. Large organizations use ClusterRoles as templates bound via RoleBindings in specific namespaces, allowing a single ClusterRole like 'namespace-admin' to be reused across hundreds of namespaces. Aggregated ClusterRoles (using aggregationRule with label selectors) allow CRD operators to automatically extend existing roles. ServiceAccounts are the primary identity for Pods: each namespace has a 'default' ServiceAccount, and Pods that do not specify a ServiceAccount use it. Since Kubernetes 1.24, ServiceAccount tokens are no longer auto-mounted as long-lived Secrets; instead, the TokenRequest API issues short-lived, audience-bound tokens projected into Pods via projected volumes. A non-obvious gotcha is that RoleBindings can reference ClusterRoles, which is actually a powerful pattern. You define the ClusterRole once and bind it in specific namespaces, scoping its permissions to that namespace. Without this pattern, you would need to duplicate Role definitions in every namespace. Another trap: the default ServiceAccount in each namespace often has no permissions (good), but many teams add permissions to the default ServiceAccount instead of creating dedicated ServiceAccounts per workload. This means any Pod in the namespace inherits those permissions, violating least privilege. The automountServiceAccountToken: false setting should be applied to the default ServiceAccount, and workload-specific ServiceAccounts should be created for Pods that actually need API access.
Code Example
# ServiceAccount for the payments processing service
apiVersion: v1
kind: ServiceAccount
metadata:
name: payments-processor-sa # Dedicated ServiceAccount for this workload
namespace: payments # Scoped to payments namespace
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/payments-s3 # IAM role for AWS access
automountServiceAccountToken: true # This SA needs API access
---
# ClusterRole defining permissions for reading secrets and configmaps
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: config-reader # Reusable ClusterRole for config reading
rules:
- apiGroups: [""] # Core API group (empty string)
resources: ["configmaps", "secrets"] # Can access ConfigMaps and Secrets
verbs: ["get", "list", "watch"] # Read-only operations
- apiGroups: [""] # Core API group
resources: ["pods"] # Can view Pod status
verbs: ["get", "list"] # Read-only, no watch needed
- apiGroups: ["apps"] # Apps API group
resources: ["deployments"] # Can view Deployment status
verbs: ["get", "list"] # Read-only access
---
# RoleBinding scoping the ClusterRole to the payments namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: payments-config-reader # Binding name
namespace: payments # Scoped to payments namespace only
subjects:
- kind: ServiceAccount # Bind to a ServiceAccount
name: payments-processor-sa # The dedicated SA created above
namespace: payments # SA's namespace
roleRef:
kind: ClusterRole # Reference a ClusterRole (not a Role)
name: config-reader # The ClusterRole defined above
apiGroup: rbac.authorization.k8s.io # RBAC API group
---
# Deployment using the dedicated ServiceAccount
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-processor # Payments processor Deployment
namespace: payments # In payments namespace
spec:
replicas: 3 # Three replicas for availability
selector:
matchLabels:
app: payments-processor # Pod selector
template:
metadata:
labels:
app: payments-processor # Pod label
spec:
serviceAccountName: payments-processor-sa # Use the dedicated SA, not default
automountServiceAccountToken: true # Mount the token for API access
containers:
- name: payments-processor # Main container
image: registry.internal.io/payments-processor:v2.0.3 # App image
---
# Lock down the default ServiceAccount to prevent accidental API access
apiVersion: v1
kind: ServiceAccount
metadata:
name: default # Override the default SA
namespace: payments # In payments namespace
automountServiceAccountToken: false # Do NOT auto-mount token for default SA◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐
│ User / │ │ Role │ │ Resources│
│ Service │ │ Binding │ │ pods,svc │
│ Account │───→│ │───→│ secrets │
└──────────┘ └────┬─────┘ └──────────┘
│
↓
┌──────────┐
│ Role / │
│ Cluster │
│ Role │
│ │
│ verbs: │
│ get,list │
│ create │
└──────────┘
Scope:
┌──────────┐ ┌──────────┐
│ Role │ │ Cluster │
│Namespace │ │ Role │
│ Scoped │ │ Global │
└──────────┘ └──────────┘Quick Answer
RBAC defines who can perform specific actions on resources within a namespace, so only authorized users have access and preventing unauthorized modifications.
Detailed Answer
Imagine you're managing a company where different departments have different levels of access to sensitive information. For example, HR has access to employee records, while IT controls the network infrastructure. RBAC in Kubernetes is like setting up these rules: defining roles based on job functions (e.g., editor, viewer) and then assigning permissions to specific users or groups (like department heads). This makes sure only authorized personnel can make changes or view sensitive data. Role-Based Access Control (RBAC) in Kubernetes allows administrators to define roles with specific permissions and bind these roles to users or groups. This mechanism restricts what actions a user can perform, such as creating, reading, updating, or deleting resources within a namespace. Kubernetes RBAC uses Role and ClusterRole objects that map to subjects (users/groups). These roles have associated policies defining allowed actions on various resource types. RoleBindings and ClusterRoleBindings link these roles to specific users or groups. When a user makes an API request, the Authorization component checks if they have the required permissions based on their role bindings. At scale, engineers need to configure RBAC policies carefully to balance security and usability. They use namespace-specific roles for better isolation between teams and projects. Monitoring tools like Open Policy Agent can enforce compliance with these policies by checking requests against defined rules. Common issues include overly permissive policies leading to accidental or malicious modifications, or complex permission hierarchies that are difficult to manage. A critical gotcha is the difference between namespace-scoped and cluster-wide roles. Namespace-scoped RBAC applies only within a single namespace, while ClusterRole can be used across all namespaces in the cluster. Misconfiguring role bindings at the wrong scope can lead to unintended access control issues.
Code Example
# Create a Role that allows reading pods in the payments namespace apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: pod-reader # Role name namespace: payments # Scoped to payments namespace rules: - apiGroups: [""] # Core API group resources: ["pods", "pods/log"] # Can read pods and their logs verbs: ["get", "list", "watch"] # Read-only operations # Bind the role to the dev-team group apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: dev-pod-reader namespace: payments subjects: - kind: Group name: dev-team # Kubernetes group apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: pod-reader # References the Role above apiGroup: rbac.authorization.k8s.io # Test what a user can do kubectl auth can-i list pods -n payments --as-group=dev-team
Quick Answer
IRSA (IAM Roles for Service Accounts) uses an OIDC identity provider to authenticate Kubernetes service account tokens with AWS IAM. The pod receives a projected service account token, presents it to AWS STS, and receives short-lived credentials scoped to a specific IAM role. No access keys are stored in secrets or environment variables.
Detailed Answer
Think of a hotel key card system. Instead of giving every guest a master key (hardcoded credentials), the front desk verifies your identity (OIDC provider), issues a card (short-lived token) that only opens your specific room (IAM role), and the card expires at checkout. If someone steals the card, it stops working soon and only ever opened one room. IRSA works the same way for pods accessing AWS services. Without IRSA, teams typically use one of three insecure patterns: storing AWS access keys in Kubernetes Secrets (which can leak through RBAC, etcd backups, or misconfigured pod access), assigning an IAM instance profile to the entire node (which gives every pod on that node the same permissions), or using tools like kube2iam that intercept the metadata endpoint (which adds complexity and latency). IRSA eliminates all three by giving each pod its own identity that AWS trusts directly. The mechanism works through several coordinated components. First, the EKS cluster has an OIDC provider registered with AWS IAM. This tells AWS to trust tokens issued by the Kubernetes API server. Second, an IAM role is created with a trust policy that specifies which Kubernetes service account in which namespace can assume it. The trust policy condition checks the OIDC issuer, the audience, and the subject (system:serviceaccount:namespace:sa-name). Third, the Kubernetes ServiceAccount is annotated with eks.amazonaws.com/role-arn pointing to the IAM role. Fourth, when a pod using that ServiceAccount starts, the EKS pod identity webhook injects a projected service account token volume and sets AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE environment variables. The AWS SDK in the application reads these, calls STS AssumeRoleWithWebIdentity, and receives temporary credentials. In production, IRSA provides least-privilege access at the pod level. The payments-api can access only the S3 bucket it needs and the SQS queue it consumes from, while the checkout-worker in the same namespace can access DynamoDB but not S3. If one pod is compromised, the blast radius is limited to its specific IAM role permissions. Tokens are automatically rotated (default expiry is 12 hours, configurable down to 15 minutes), and credential theft is detectable through CloudTrail. The non-obvious gotcha is that IRSA requires the application to use an AWS SDK version that supports web identity token authentication (SDK v2 or AWS SDK for Go v1.25+, Python boto3 1.9.220+). Legacy applications that only read from environment variables for static keys will not work without code changes. Another common issue is trust policy misconfiguration: if the namespace or service account name in the condition does not match exactly, AssumeRole fails silently and the pod falls back to node-level permissions or gets AccessDenied. EKS Pod Identity is the newer alternative that simplifies the trust policy setup but requires the EKS Pod Identity Agent DaemonSet.
Code Example
# Check if the EKS cluster has an OIDC provider configured aws eks describe-cluster --name production --query 'cluster.identity.oidc.issuer' # Verify the ServiceAccount has the IAM role annotation kubectl get sa payments-api -n payments -o yaml | grep eks.amazonaws.com/role-arn # Inspect a running pod to confirm IRSA environment variables are injected kubectl exec -n payments payments-api-7f8d9c-x4k -- env | grep AWS # Expected output: # AWS_ROLE_ARN=arn:aws:iam::123456789012:role/payments-api-s3-access # AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token # Check the projected token is mounted in the pod kubectl exec -n payments payments-api-7f8d9c-x4k -- ls /var/run/secrets/eks.amazonaws.com/serviceaccount/ # Verify the IAM role trust policy allows the correct service account aws iam get-role --role-name payments-api-s3-access --query 'Role.AssumeRolePolicyDocument' # Test that the pod can actually assume the role kubectl exec -n payments payments-api-7f8d9c-x4k -- aws sts get-caller-identity # If IRSA fails, check the OIDC provider is registered in IAM aws iam list-open-id-connect-providers
◈ Architecture Diagram
┌──────────────┐ 1. Token issued ┌──────────────────┐
│ K8s API │─────────────────────────▶│ Pod (payments) │
│ Server │ │ ServiceAccount │
└──────┬───────┘ └────────┬──────────┘
│ │
│ 2. OIDC validates │ 3. AssumeRoleWithWebIdentity
↓ ↓
┌──────────────┐ ┌──────────────────┐
│ AWS IAM │◀─────────────────────────│ AWS STS │
│ OIDC Provider│ trust policy check │ │
└──────────────┘ └────────┬──────────┘
│ 4. Temporary credentials
↓
┌──────────────────┐
│ AWS S3 / SQS │
└──────────────────┘Quick Answer
NetworkPolicies control pod-to-pod network traffic. RBAC controls who can perform what actions on which Kubernetes API resources. Pod Security Standards restrict what pods can do at runtime (privileged containers, host access, capabilities). Together, they form three layers: API access control, runtime restrictions, and network segmentation.
Detailed Answer
Think of securing a building. RBAC is the badge system that controls who can enter which floors and rooms (API permissions). Pod Security Standards are the building codes that prevent tenants from doing dangerous things like removing fire exits or storing explosives (runtime restrictions). NetworkPolicies are the internal walls and locked corridors that prevent someone on one floor from accessing another floor without authorization (network segmentation). Each layer addresses a different attack vector, and all three are needed for comprehensive security. RBAC (Role-Based Access Control) governs who can interact with the Kubernetes API and what operations they can perform. A Role defines permissions (verbs like get, list, create, delete on resources like pods, secrets, deployments) within a namespace. A ClusterRole defines permissions cluster-wide. RoleBindings and ClusterRoleBindings associate roles with users, groups, or service accounts. Without RBAC, a compromised service account could read Secrets from other namespaces, create privileged pods, or delete critical workloads. Properly scoped RBAC ensures the payments-api ServiceAccount can only read its own ConfigMaps and Secrets, not those belonging to other teams. Pod Security Standards (the replacement for the deprecated PodSecurityPolicy) define three levels: Privileged (unrestricted), Baseline (prevents known privilege escalations), and Restricted (heavily hardened). These are enforced through the Pod Security Admission controller using namespace labels. Restricted mode prevents running as root, using host networking, mounting hostPath volumes, adding Linux capabilities, and running privileged containers. This matters because a container escape from a privileged pod gives full root access to the host node, which compromises all pods on that node and potentially the entire cluster. NetworkPolicies, as the third layer, restrict which pods can communicate with which other pods and external systems. Even if an attacker compromises a pod, NetworkPolicies prevent lateral movement to the database, secrets store, or other microservices. Combined with RBAC preventing the compromised pod's ServiceAccount from reading other Secrets, and Pod Security preventing privilege escalation to the host, the blast radius of a single compromised container is contained to that container's existing data and network connections. In production, these three controls must be deployed together because each has blind spots. RBAC alone cannot prevent a pod from connecting to a database it should not access (that is a network concern). NetworkPolicies alone cannot prevent a pod from running as root and escaping to the host. Pod Security alone cannot prevent a compromised pod from calling the Kubernetes API to read secrets. Defense in depth means that bypassing one control does not grant full access. The non-obvious gotcha is that Pod Security Admission only warns or denies at pod creation time — it does not retroactively affect running pods. If you add Restricted enforcement to a namespace with existing non-compliant pods, those pods continue running until they are recreated. Another common gap is that RBAC for ServiceAccounts often starts too permissive (using default ServiceAccount with broad permissions) and is never tightened. Teams should create dedicated ServiceAccounts per workload with minimal permissions and disable token automounting for pods that do not need API access.
Code Example
# Check RBAC: what can the payments-api ServiceAccount do?
kubectl auth can-i --list --as=system:serviceaccount:payments:payments-api -n payments
# Verify Pod Security Admission labels on the namespace
kubectl get namespace payments -o yaml | grep pod-security
# Check if any pods are running as root (security concern)
kubectl get pods -n payments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].securityContext}{"\n"}{end}'
# List NetworkPolicies to verify segmentation exists
kubectl get networkpolicy -n payments
# Verify the ServiceAccount does NOT automount API tokens unnecessarily
kubectl get sa payments-api -n payments -o yaml | grep automount
# Check if any pod can reach a service it should not
kubectl exec -n checkout checkout-worker-5d7f -- curl -s --connect-timeout 3 http://payments-db.payments.svc:5432
# Create a restricted Role for the payments-api ServiceAccount
# ---
# apiVersion: rbac.authorization.k8s.io/v1
# kind: Role
# metadata:
# name: payments-api-role
# namespace: payments
# rules:
# - apiGroups: [""]
# resources: ["configmaps", "secrets"]
# verbs: ["get"]
# resourceNames: ["payments-api-config", "payments-api-secrets"]◈ Architecture Diagram
┌─────────────────────────────────────────────────┐ │ Defense in Depth Layers │ ├─────────────────────────────────────────────────┤ │ Layer 1: RBAC │ │ ┌──────────┐ ┌──────────┐ │ │ │ User │───▶│ K8s API │ who can do what? │ │ │ / SA │ │ │ │ │ └──────────┘ └──────────┘ │ ├─────────────────────────────────────────────────┤ │ Layer 2: Pod Security Standards │ │ ┌──────────┐ │ │ │ Pod │ no root, no hostPath, │ │ │ Runtime │ no privileged, drop caps │ │ └──────────┘ │ ├─────────────────────────────────────────────────┤ │ Layer 3: NetworkPolicy │ │ ┌──────┐ allowed ┌──────┐ blocked ┌──────┐ │ │ │Pod A │──────────▶│Pod B │ X │Pod C │ │ │ └──────┘ └──────┘ └──────┘ │ └─────────────────────────────────────────────────┘
Quick Answer
Configure Terraform Enterprise workspaces with scheduled plan-only runs (e.g., nightly) that detect differences between actual infrastructure and the Terraform state. Alert on drift via webhook notifications, categorize drift by severity, and either auto-remediate safe drifts or create tickets for manual review.
Detailed Answer
Think of drift detection like a nightly security guard doing rounds. The guard has a checklist of how every door, window, and safe should look. If something has changed — a window left open, a safe combination altered, a new lock installed — the guard reports it. Drift detection in Terraform works the same way: scheduled plan runs compare what actually exists in AWS against what Terraform expects, and any discrepancy is flagged for investigation. Infrastructure drift occurs when the actual state of cloud resources diverges from the Terraform-declared state. This happens through manual console changes (someone modifies a security group via the AWS console), changes by other tools (an automation script modifies a resource that Terraform also manages), auto-scaling events that modify resource attributes, and AWS service updates that change default behaviors. In a banking environment, drift is a compliance risk — if your Terraform code declares that an RDS instance has encryption enabled but someone disables it through the console, your compliance posture is degraded and your Terraform state does not reflect reality. Terraform Enterprise enables scheduled plan-only runs on workspaces. You configure a workspace to run terraform plan automatically at a set interval — typically nightly for production workspaces and weekly for non-production. The plan compares the current Terraform configuration and state against the actual infrastructure via provider API calls. If the plan detects changes (resources to update, create, or destroy), it means drift has occurred. TFE marks the run as 'planned and finished' with a non-empty plan, and you can configure webhook notifications to alert your team via Slack, PagerDuty, or a custom drift-tracking system. At enterprise scale, not all drift is equal. A changed tag is low-severity drift that might be auto-remediated. A modified security group rule is high-severity drift that requires immediate investigation — someone may have opened a port that violates PCI-DSS. A deleted resource is critical drift that needs urgent attention. Build a drift classification system: the webhook from TFE sends the plan summary to a Lambda function or custom service that parses the plan output, categorizes each change by resource type and attribute, assigns a severity level, and routes the notification appropriately. Low-severity drift creates a Jira ticket for the next sprint. High-severity drift pages the security team. Critical drift triggers an incident response. For drift remediation, there are two approaches. Auto-remediation configures TFE to automatically apply the plan when drift is detected, restoring infrastructure to the declared state. This is appropriate for low-risk drifts like tag changes or description updates, but dangerous for high-risk resources — auto-applying a plan that wants to recreate an RDS instance would cause downtime. Selective auto-remediation uses Sentinel policies to evaluate the drift plan: if the only changes are to tags and descriptions, auto-apply; if the plan includes any destroy or replace actions, block and alert. Manual remediation requires a human to review the drift, determine whether the Terraform code or the infrastructure should be updated, and either apply the plan or update the code to match the new reality. The biggest gotcha is drift detection generating noise that teams ignore. If your scheduled plans consistently show drift from resources that Terraform partially manages (like ASG instance counts that change with auto-scaling), the team learns to dismiss all drift alerts. Use lifecycle ignore_changes blocks in Terraform for attributes that are expected to drift (like ASG desired_count), and ensure your scheduled plans only flag genuine unauthorized changes. Another gotcha is the API rate limiting — running terraform plan across 200 workspaces simultaneously hammers the AWS API. Stagger your scheduled plans across the night, and use workspace tags to group and schedule them in batches. Finally, drift detection only catches drift in resources Terraform manages — resources created manually outside of Terraform are invisible. Complement TFE drift detection with AWS Config rules that detect unmanaged resources.
Code Example
# TFE workspace with scheduled drift detection
resource "tfe_workspace" "payments_infra_prod" {
name = "payments-infra-production"
organization = "bank-platform"
terraform_version = "1.7.0"
auto_apply = false
vcs_repo {
identifier = "bank/payments-infrastructure"
branch = "main"
oauth_token_id = var.github_oauth_id
}
}
# Scheduled plan-only run for nightly drift detection
resource "tfe_workspace_run_schedule" "payments_drift_check" {
workspace_id = tfe_workspace.payments_infra_prod.id
# Run plan every night at 2 AM ET (7 AM UTC)
cron_schedule = "0 7 * * *"
# Plan only — do not auto-apply
plan_only = true
}
# Webhook notification for drift alerts
resource "tfe_notification_configuration" "drift_alert" {
name = "drift-detection-alert"
enabled = true
workspace_id = tfe_workspace.payments_infra_prod.id
destination_type = "generic" # Custom webhook
url = "https://drift-handler.bank.internal/webhook"
triggers = [
"run:needs_attention", # Plan with changes detected
"run:errored", # Plan failed (possible API issue)
]
}
---
# Drift classification Lambda (triggered by TFE webhook)
# drift-handler/handler.py
import json
import boto3
def classify_drift(event):
"""Classify drift severity based on resource type and change type."""
plan_summary = event.get('plan_summary', {})
changes = plan_summary.get('resource_changes', [])
severity = 'low'
findings = []
for change in changes:
resource_type = change['type']
actions = change['actions']
# Critical: any destroy or replace action
if 'delete' in actions or 'replace' in actions:
severity = 'critical'
findings.append(f"CRITICAL: {resource_type} will be {actions}")
# High: security-related resources modified
elif resource_type in [
'aws_security_group_rule',
'aws_iam_policy',
'aws_iam_role_policy',
'aws_kms_key',
'aws_s3_bucket_policy'
]:
severity = max(severity, 'high')
findings.append(f"HIGH: {resource_type} drifted")
# Low: tags, descriptions, non-functional changes
else:
findings.append(f"LOW: {resource_type} drifted")
return severity, findings
def route_alert(severity, findings, workspace_name):
"""Route drift alerts based on severity."""
if severity == 'critical':
# Page security team immediately
pagerduty_alert(f"CRITICAL drift in {workspace_name}", findings)
create_jira_incident(workspace_name, findings)
elif severity == 'high':
# Slack alert to security channel
slack_alert('#security-ops', workspace_name, findings)
create_jira_ticket('HIGH', workspace_name, findings)
else:
# Low priority — create ticket for next sprint
create_jira_ticket('LOW', workspace_name, findings)
---
# Terraform lifecycle blocks to reduce drift noise
# Ignore expected drift from auto-scaling
resource "aws_autoscaling_group" "payments_api" {
# ... configuration ...
lifecycle {
ignore_changes = [
desired_capacity, # Changes with auto-scaling
target_group_arns, # Changes with blue-green deploys
]
}
}
# Ignore expected drift from external secret rotation
resource "aws_db_instance" "settlements_db" {
# ... configuration ...
lifecycle {
ignore_changes = [
password, # Rotated by Vault, not managed by Terraform
]
}
}◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐ │ Infrastructure Drift Detection Pipeline │ │ │ │ ┌──────────────────┐ │ │ │ TFE Workspace │ Scheduled: Nightly at 2 AM │ │ │ Plan-Only Run │──────────────────────────────┐ │ │ └──────────────────┘ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ terraform plan (read-only) │ │ │ │ │ │ │ │ Declared State ←──compare──→ Actual Infrastructure │ │ │ │ (Terraform code) (AWS API responses) │ │ │ └──────────────────────┬────────────────────────────────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ Changes Detected? │ │ │ └──┬──────────────┬───┘ │ │ No │ │ Yes │ │ ┌──────▼────┐ ┌──────▼───────────────────────────┐ │ │ │ No drift │ │ Webhook → Drift Classifier │ │ │ │ All good │ │ │ │ │ └───────────┘ │ ┌─────────┐ ┌────────┐ ┌─────┐ │ │ │ │ │CRITICAL │ │ HIGH │ │ LOW │ │ │ │ │ │Delete/ │ │SecGroup│ │Tags │ │ │ │ │ │Replace │ │IAM/KMS │ │Desc │ │ │ │ │ │→ Page │ │→ Slack │ │→Jira│ │ │ │ │ │ SecOps │ │ Alert │ │ Tkt │ │ │ │ │ └─────────┘ └────────┘ └─────┘ │ │ │ └──────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Quick Answer
Store state in an S3 bucket with versioning enabled, server-side encryption (SSE-KMS), and a DynamoDB table for state locking. Structure the bucket with environment-prefixed keys (prod/networking/terraform.tfstate) and restrict access using IAM policies scoped to each environment's prefix. Prevent corruption through locking, versioning for rollback, and CI/CD-only access patterns.
Detailed Answer
Terraform state file management is like managing a bank vault's ledger: the ledger (state file) records what is in every safe deposit box (cloud resource), and if the ledger is corrupted or leaked, you either lose track of assets or expose their locations to unauthorized parties. The storage, encryption, access control, and corruption prevention for state files must be treated with the same rigor as production database backups. The S3 backend is the standard for AWS-centric teams. The bucket itself requires several hardening measures: versioning enabled so you can recover previous state versions if an apply corrupts the current state, server-side encryption with a dedicated KMS key (not the default aws/s3 key) so you can audit and rotate encryption independently, public access blocked via the S3 Block Public Access settings, and a bucket policy that explicitly denies unencrypted uploads. The bucket should be in a dedicated SharedServices or Management account, separate from any workload account, so that workload account compromises cannot directly access state. The key structure within the bucket follows a hierarchy: {account-id-or-env}/{stack-name}/terraform.tfstate. For example: prod/networking/terraform.tfstate, prod/eks-cluster/terraform.tfstate, prod/payments-database/terraform.tfstate. This structure enables per-stack state isolation and per-environment IAM scoping. The Prod account's TerraformExecutionRole gets an S3 policy allowing s3:GetObject and s3:PutObject only on keys prefixed with prod/, while the Dev role can only access dev/. This prevents a Dev pipeline misconfiguration from overwriting Prod state. DynamoDB state locking prevents concurrent modifications. Create a single DynamoDB table (PAY_PER_REQUEST billing) with a partition key named LockID of type String. Every Terraform operation acquires a lock before modifying state by writing a conditional item to this table. If two engineers run terraform apply simultaneously on the same stack, the second operation receives a ConditionalCheckFailedException and waits. The lock record contains the operator's hostname, the operation type, and a timestamp, which helps diagnose stale locks from crashed CI pipelines. Corruption prevention goes beyond locking. S3 versioning provides a recovery path: if an apply fails midway and leaves state inconsistent, you can restore a previous version using aws s3api list-object-versions and aws s3api get-object with the desired VersionId. Terraform also writes a backup of the previous state locally before modifying it (terraform.tfstate.backup), though this is less useful in CI/CD where runners are ephemeral. For critical production stacks, enable S3 Replication to copy state to a bucket in another region for disaster recovery. The most dangerous corruption scenario is partial apply failure: Terraform creates some resources but crashes before writing updated state. The created resources become orphans — they exist in AWS but are not tracked by Terraform. Recovery requires manually importing the orphaned resources using terraform import or, in Terraform 1.5+, using import blocks. To reduce this risk, break large configurations into smaller stacks so each apply touches fewer resources, and use the -target flag only as a last resort since it creates partial state updates by design.
Code Example
# State backend configuration with full security hardening
terraform {
# S3 backend for remote state storage
backend "s3" {
# Dedicated state bucket in the SharedServices account
bucket = "valuemomentum-terraform-state-prod"
# Environment-prefixed key for access control scoping
key = "prod/payments-platform/networking/terraform.tfstate"
# Primary region for state storage
region = "us-east-1"
# DynamoDB table for state locking
dynamodb_table = "terraform-state-locks"
# Enable SSE-KMS encryption with a dedicated key
encrypt = true
# KMS key ARN for state file encryption
kms_key_id = "arn:aws:kms:us-east-1:555555555555:key/mrk-abc123"
# Use the SharedServices account profile for state access
profile = "valuemomentum-shared-services"
}
}
# S3 bucket for Terraform state (provisioned once by bootstrap)
resource "aws_s3_bucket" "terraform_state" {
# Bucket name following organization naming convention
bucket = "valuemomentum-terraform-state-prod"
# Prevent accidental deletion of the state bucket
force_destroy = false
tags = {
Purpose = "terraform-state-storage"
ManagedBy = "bootstrap-terraform"
}
}
# Enable versioning for state file recovery
resource "aws_s3_bucket_versioning" "state_versioning" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
# Enable versioning to recover from corruption
versioning_configuration {
status = "Enabled"
}
}
# Server-side encryption with dedicated KMS key
resource "aws_s3_bucket_server_side_encryption_configuration" "state_encryption" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
# Use KMS encryption instead of AES-256
sse_algorithm = "aws:kms"
# Dedicated KMS key for independent rotation and audit
kms_master_key_id = aws_kms_key.terraform_state_key.arn
}
# Enforce encryption on all objects including uploads
bucket_key_enabled = true
}
}
# Block all public access to the state bucket
resource "aws_s3_bucket_public_access_block" "state_public_block" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
# Block all forms of public access
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_locks" {
# Table name matching backend configuration
name = "terraform-state-locks"
# Pay-per-request to avoid capacity planning
billing_mode = "PAY_PER_REQUEST"
# LockID is the required partition key for Terraform
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
# Enable point-in-time recovery for lock table safety
point_in_time_recovery {
enabled = true
}
}
# IAM policy scoping Prod role to only prod/ state prefix
# data "aws_iam_policy_document" "prod_state_access" {
# statement {
# effect = "Allow"
# actions = ["s3:GetObject", "s3:PutObject"]
# resources = ["arn:aws:s3:::valuemomentum-terraform-state-prod/prod/*"]
# }
# statement {
# effect = "Deny"
# actions = ["s3:*"]
# resources = ["arn:aws:s3:::valuemomentum-terraform-state-prod/dev/*"]
# }
# }◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Security Architecture │ ├───────────────────────────────────────────────────────────────┤ │ │ │ SharedServices Account │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ S3: valuemomentum-terraform-state-prod │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ Versioning: Enabled │ │ │ │ │ │ Encryption: SSE-KMS (dedicated key) │ │ │ │ │ │ Public Access: Blocked │ │ │ │ │ │ Replication: us-east-1 → us-west-2 (DR) │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ Key Structure: │ │ │ │ ├── dev/ │ │ │ │ │ ├── networking/terraform.tfstate │ │ │ │ │ ├── eks-cluster/terraform.tfstate │ │ │ │ │ └── payments-db/terraform.tfstate │ │ │ │ ├── qa/ │ │ │ │ │ └── ... │ │ │ │ ├── uat/ │ │ │ │ │ └── ... │ │ │ │ └── prod/ ← Prod role can ONLY access this │ │ │ │ ├── networking/terraform.tfstate │ │ │ │ ├── eks-cluster/terraform.tfstate │ │ │ │ └── payments-db/terraform.tfstate │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ DynamoDB: terraform-state-locks │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ LockID (PK) │ Info │ Who │ Operation │ │ │ │ │ │ prod/net/... │ ... │ ci │ apply │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ Access Pattern: │ │ CI/CD Runner → OIDC → AssumeRole → Scoped S3 Access │ │ ┌──────────┐ ┌────────────┐ ┌──────────────┐ │ │ │ Pipeline │───→│ Prod Role │───→│ prod/* only │ │ │ │ (OIDC) │ │ (IAM) │ │ (S3 policy) │ │ │ └──────────┘ └────────────┘ └──────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Prevent cross-environment contamination through four layers: separate state files per environment with IAM-scoped access, provider configurations locked to specific AWS accounts via assume_role, module versioning with pinned tags so an untested module change cannot propagate, and CI/CD pipeline guardrails that validate the target environment before apply.
Detailed Answer
Preventing cross-environment contamination in Terraform is like building firewalls between apartments in a building: you need physical separation (state isolation), locked doors (IAM boundaries), independent utilities (provider configurations), and a building code (CI/CD guardrails) that prevents shortcuts through shared walls. The first layer is state file isolation. Each environment must have its own state file with its own backend configuration. Never share a state file between environments, even with workspaces, if the blast radius of corruption is unacceptable. The state file contains sensitive data including resource IDs, IP addresses, and sometimes plaintext outputs. An S3 bucket policy should restrict each environment's Terraform role to only its own key prefix: the prod role can access s3://state-bucket/prod/* but is explicitly denied s3://state-bucket/dev/*. This prevents a misconfigured prod pipeline from reading or overwriting dev state. The second layer is provider-level isolation. Each environment's provider block must assume a role in its specific AWS account. Even if someone accidentally passes the wrong tfvars file, the provider configuration ensures Terraform operates in the correct account. Add a validation check using the aws_caller_identity data source: compare the actual account ID against the expected one and fail early if they do not match. This catches the scenario where an engineer runs terraform apply with prod credentials but dev configuration, or vice versa. The third layer is module versioning. When environments share modules from a private registry or Git repository, use pinned version tags. Dev might use module version 2.3.0-rc1 while Prod uses 2.2.0 (the last stable release). Without version pinning, a module change pushed to the main branch immediately affects every environment that references source = "git::...?ref=main". This is the most common cause of accidental cross-environment impact: someone fixes a bug in a shared VPC module, the fix has a typo, and every environment that references the module head picks up the broken code on next apply. The fourth layer is CI/CD pipeline guardrails. The pipeline should validate environment consistency before plan: check that the workspace name matches the tfvars file, verify the AWS account ID matches the target environment, and confirm the Git branch is allowed to deploy to that environment (only main can deploy to prod). Implement a pre-plan script that runs aws sts get-caller-identity and compares the account against an expected value from the pipeline configuration. Remote state data sources are a particularly dangerous vector for cross-environment bleed. When a production EKS module reads the networking module's state via terraform_remote_state, it must reference the production networking state, not dev. Parameterize the remote state data source's backend configuration using the environment variable: data.terraform_remote_state.networking.config.key should resolve to prod/networking/terraform.tfstate, not a hardcoded path. A common gotcha is using terraform_remote_state with a hardcoded key that works in dev but points to prod state when someone copies the configuration without updating the key. The ultimate safeguard is defense in depth: even if one layer fails, the others prevent damage. If the IAM policy has a bug that allows dev access to prod state, the provider's assume_role still locks operations to the dev account. If the provider configuration is wrong, the account ID validation check fails before any resources are touched.
Code Example
# Account identity validation — fail fast on wrong account
# Fetch the actual AWS account identity
data "aws_caller_identity" "current" {}
# Validate the account ID matches the expected environment
locals {
# Map of expected account IDs per environment
expected_accounts = {
dev = "111111111111"
qa = "222222222222"
uat = "333333333333"
prod = "444444444444"
}
# Check if current account matches the target environment
account_validated = (
data.aws_caller_identity.current.account_id ==
local.expected_accounts[var.environment]
)
}
# Validation resource that fails plan if accounts mismatch
resource "null_resource" "account_validation" {
# This count trick fails if account does not match
count = local.account_validated ? 0 : "ERROR: Running in wrong AWS account"
}
# Remote state data source — parameterized per environment
data "terraform_remote_state" "networking" {
# S3 backend for reading the networking layer state
backend = "s3"
config = {
# Same state bucket as all other stacks
bucket = "valuemomentum-terraform-state-prod"
# Key parameterized by environment to prevent cross-env reads
key = "${var.environment}/networking/terraform.tfstate"
# Same region as the backend
region = "us-east-1"
}
}
# Use networking outputs safely scoped to the correct environment
resource "aws_eks_cluster" "payments_cluster" {
# Cluster name scoped to the environment
name = "payments-eks-${var.environment}"
version = "1.29"
role_arn = aws_iam_role.eks_cluster_role.arn
vpc_config {
# Subnet IDs from the SAME environment's networking state
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = var.environment == "prod" ? false : true
}
}
# Module versioning — pinned per environment
module "payments_vpc" {
# Pinned Git tag prevents untested changes from propagating
source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.2.0"
# In dev, you might test a release candidate:
# source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.3.0-rc1"
vpc_name = "payments-vpc-${var.environment}"
vpc_cidr = var.vpc_cidr
environment = var.environment
}
# CI/CD pre-plan validation script (run before terraform plan)
# #!/bin/bash
# EXPECTED_ACCOUNT=$(jq -r ".${ENVIRONMENT}" accounts.json)
# ACTUAL_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
# if [ "$EXPECTED_ACCOUNT" != "$ACTUAL_ACCOUNT" ]; then
# echo "FATAL: Expected account $EXPECTED_ACCOUNT but authenticated to $ACTUAL_ACCOUNT"
# exit 1
# fi◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐
│ Cross-Environment Protection Layers │
├───────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: State Isolation (IAM-Scoped) │
│ ┌──────────────┐ DENY ┌──────────────┐ │
│ │ Dev Role │─────X─────│ prod/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ALLOW ┌──────────────┐ │
│ │ Dev Role │───────────│ dev/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Layer 2: Provider Account Lock │
│ ┌──────────────────────────────────────────┐ │
│ │ provider "aws" { │ │
│ │ assume_role { │ │
│ │ role_arn = ".../${var.env}/Role" │ │
│ │ } │ │
│ │ } │ │
│ │ → Operations locked to target account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 3: Account ID Validation │
│ ┌──────────────────────────────────────────┐ │
│ │ aws_caller_identity.account_id │ │
│ │ == expected_accounts[var.environment] │ │
│ │ → FAIL FAST if wrong account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 4: Module Version Pinning │
│ ┌──────────────────────────────────────────┐ │
│ │ Dev: source = "...?ref=v2.3.0-rc1" │ │
│ │ Prod: source = "...?ref=v2.2.0" │ │
│ │ → Untested changes cannot reach prod │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 5: CI/CD Pipeline Guardrails │
│ ┌──────────────────────────────────────────┐ │
│ │ Branch → Environment mapping │ │
│ │ main → prod (requires approval) │ │
│ │ develop → dev (auto-apply) │ │
│ │ Pre-plan: sts get-caller-identity check │ │
│ └──────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘Quick Answer
Implement a multi-stage pipeline: PR triggers terraform plan with output posted as a PR comment, OPA/Sentinel policy checks validate compliance, manual approval gates (GitHub Environments with required reviewers) protect production, and merge-to-main triggers terraform apply using the saved plan file. Use OIDC for keyless authentication and concurrency controls to prevent parallel applies on the same stack.
Detailed Answer
A Terraform CI/CD pipeline is like an air traffic control system: every infrastructure change (flight) must file a plan (flight plan), get reviewed by controllers (PR reviewers), receive clearance (approval gate), and land on the correct runway (target environment) — all while preventing two planes from using the same runway simultaneously (state locking and concurrency control). The pipeline begins with authentication. Modern pipelines use OIDC federation instead of stored AWS credentials. GitHub Actions requests a JWT token from GitHub's OIDC provider, presents it to AWS STS via AssumeRoleWithWebIdentity, and receives short-lived credentials scoped to the Terraform execution role. The OIDC trust policy restricts which repositories, branches, and environments can assume the role: production apply roles should only be assumable by the main branch, while plan roles can be assumed by any branch. This eliminates long-lived access keys that could be exfiltrated from CI secrets. The plan stage runs on every pull request. It executes terraform init, terraform validate, terraform fmt -check, and terraform plan -out=plan.tfplan. The plan output is captured and posted as a PR comment using tools like tfcmt or the native GitHub Actions Terraform setup action. Reviewers see exactly what resources will be created, modified, or destroyed — including sensitive changes like security group rule modifications or IAM policy updates. The saved plan file is uploaded as a CI artifact for use in the apply stage. Policy-as-code gates run between plan and approval. Open Policy Agent (OPA) evaluates the plan JSON (terraform show -json plan.tfplan) against organizational policies: no S3 buckets without encryption, no security groups with 0.0.0.0/0 ingress on port 22, all RDS instances must have deletion protection in production. These checks are non-negotiable — a policy violation fails the pipeline regardless of who approves the PR. Sentinel serves the same purpose in Terraform Cloud/Enterprise environments. The approval gate differs by environment. Dev and QA may auto-apply on merge — the PR review itself is sufficient approval. UAT requires team lead approval via a GitHub Environment with one required reviewer. Production requires two approvals from the platform-admins team, with a 15-minute wait timer to prevent hasty approvals. These are configured as GitHub Environments with protection rules, which the apply job references via the environment keyword. The apply stage triggers after merge to main. Critically, it should use the saved plan file from the plan stage rather than re-running plan, because infrastructure may have changed between plan review and apply execution. If the saved plan is stale (state serial mismatch), Terraform rejects it and the pipeline must re-plan. After successful apply, the pipeline posts results to a Slack channel (#infra-changes-prod) and creates a GitHub deployment record for audit trail. Concurrency control prevents two merged PRs from applying simultaneously to the same stack. GitHub Actions concurrency groups scoped to the stack name (concurrency: group: terraform-payments-prod) ensure only one apply runs at a time. Queued runs wait for the current apply to complete. Combined with DynamoDB state locking, this provides two layers of concurrent modification prevention.
Code Example
# .github/workflows/terraform-payments.yml
# Multi-stage Terraform pipeline with OIDC and approval gates
name: Payments Infrastructure Pipeline
# Trigger on PRs and pushes to main affecting payments infra
on:
pull_request:
paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']
push:
branches: [main]
paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']
# OIDC permissions for keyless AWS authentication
permissions:
id-token: write
contents: read
pull-requests: write
# Prevent concurrent applies on the same stack
concurrency:
group: terraform-payments-prod-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
# Stage 1: Validate and plan on every PR
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
# Checkout the infrastructure code
- uses: actions/checkout@v4
# OIDC authentication — plan role (read-only)
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformPlan
aws-region: us-east-1
# Install pinned Terraform version
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Initialize the backend and download providers
- name: Init
run: terraform -chdir=infrastructure/envs/prod init -input=false
# Validate syntax and configuration
- name: Validate
run: terraform -chdir=infrastructure/envs/prod validate
# Format check to enforce style standards
- name: Format Check
run: terraform fmt -check -recursive infrastructure/
# Generate execution plan and save to file
- name: Plan
run: terraform -chdir=infrastructure/envs/prod plan -input=false -out=prod.tfplan
# Export plan as JSON for OPA policy evaluation
- name: Export Plan JSON
run: terraform -chdir=infrastructure/envs/prod show -json prod.tfplan > plan.json
# Run OPA policy checks against the plan
- name: OPA Policy Check
run: |
opa eval --data policies/ --input plan.json "data.terraform.deny[msg]" --fail-defined
# Post plan output as a PR comment for reviewers
- name: Comment Plan on PR
uses: borchero/terraform-plan-comment@v2
with:
working-directory: infrastructure/envs/prod
# Upload plan artifact for the apply stage
- uses: actions/upload-artifact@v4
with:
name: prod-tfplan
path: infrastructure/envs/prod/prod.tfplan
retention-days: 5
# Stage 2: Apply after merge with manual approval
apply:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
# Production environment with required approvers and wait timer
environment:
name: production-payments
url: https://console.aws.amazon.com/eks
steps:
- uses: actions/checkout@v4
# OIDC authentication — apply role (read-write)
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformApply
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Re-init and apply (saved plan may be stale after merge)
- name: Init and Apply
run: |
terraform -chdir=infrastructure/envs/prod init -input=false
terraform -chdir=infrastructure/envs/prod apply -input=false -auto-approve
# Notify team of successful deployment
- name: Slack Notification
if: success()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text": "Prod payments infra deployed by ${{ github.actor }}"}'◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform CI/CD Pipeline with Approval Gates │ ├───────────────────────────────────────────────────────────────┤ │ │ │ PR Opened │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 1: Plan (on every PR) │ │ │ │ │ │ │ │ OIDC → AssumeRole (Plan Role, read-only) │ │ │ │ ┌──────┐ ┌────────┐ ┌────┐ ┌──────┐ │ │ │ │ │ init │→│validate│→│fmt │→│ plan │ │ │ │ │ └──────┘ └────────┘ └────┘ └──┬───┘ │ │ │ │ │ │ │ │ │ ┌──────┴──────┐ │ │ │ │ │ plan.json │ │ │ │ │ └──────┬──────┘ │ │ │ │ ↓ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ OPA Policy Check │ │ │ │ │ │ - no public S3 │ │ │ │ │ │ - encryption on │ │ │ │ │ │ - tags required │ │ │ │ │ └────────┬───────────┘ │ │ │ │ ↓ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ PR Comment with │ │ │ │ │ │ plan output │ │ │ │ │ └────────────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ PR Approved + Merged to main │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 2: Manual Approval Gate │ │ │ │ GitHub Environment: production-payments │ │ │ │ Required reviewers: 2 from platform-admins │ │ │ │ Wait timer: 15 minutes │ │ │ └──────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 3: Apply (after approval) │ │ │ │ │ │ │ │ OIDC → AssumeRole (Apply Role, read-write) │ │ │ │ ┌──────┐ ┌────────────────┐ │ │ │ │ │ init │→│ apply │ │ │ │ │ └──────┘ │ -auto-approve │ │ │ │ │ └───────┬────────┘ │ │ │ │ ↓ │ │ │ │ ┌──────────────┐ │ │ │ │ │ Slack notify │ │ │ │ │ │ #infra-changes│ │ │ │ │ └──────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ Concurrency: group=terraform-payments-prod (1 at a time) │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Detect drift by running terraform plan -refresh-only to compare actual infrastructure against state without proposing changes. Remediate by either importing the manual change into Terraform (terraform import or import blocks), reverting the manual change by running terraform apply to converge back to the declared configuration, or updating the Terraform code to reflect the intentional change and then applying.
Detailed Answer
Terraform state drift is like someone rearranging furniture in a room that has a blueprint: the blueprint (state file) says the couch is by the window, but someone physically moved it to the center of the room (console change). Terraform detects this discrepancy during the refresh phase of plan and proposes moving the couch back to the window (converging to declared state). The question is whether the move was intentional or accidental — and that determines whether you update the blueprint or move the couch back. Drift detection happens during the refresh phase of terraform plan. For every resource tracked in state, Terraform calls the cloud provider's API to read the current configuration. If the API response differs from what state records, Terraform updates its in-memory state and then diffs that against your HCL configuration. The -refresh-only flag runs only the refresh phase without proposing configuration-driven changes, making it a pure drift detection scan. The output shows which attributes have drifted and their before/after values. There are three categories of drift, each requiring a different remediation strategy. The first is accidental drift: an engineer manually opened port 443 on a security group to debug a connectivity issue and forgot to revert it. The fix is to run terraform apply, which converges the security group back to the declared configuration, removing the manually added rule. This is Terraform's self-healing property — the declared state is the source of truth. The second is intentional drift: an operations engineer manually scaled up an RDS instance from db.r6g.xlarge to db.r6g.2xlarge during a traffic incident. The change was correct and should be preserved. The fix is to update the Terraform code to reflect the new instance class, then run terraform plan to verify the plan shows no changes (the code now matches reality). If you run apply without updating the code, Terraform would downgrade the instance back to the original size — potentially causing another outage. The third is untracked resource creation: someone created a new S3 bucket via the console that Terraform knows nothing about. Since Terraform only tracks resources in its state, it cannot detect untracked resources. Tools like AWS Config, Driftctl (now Snyk IaC), or CloudQuery scan the entire account and compare against Terraform state to find resources that exist but are not managed. Once identified, you either import the resource into Terraform using import blocks (Terraform 1.5+) or the terraform import command, or you delete the resource if it should not exist. Proactive drift prevention is better than reactive detection. Implement AWS Config rules that alert on configuration changes not made by the Terraform execution role. Set up CloudTrail-based alarms that trigger when console users modify resources tagged with ManagedBy=terraform. Use IAM policies that restrict console users to read-only access for Terraform-managed resource types. Schedule a daily terraform plan -refresh-only in CI that posts drift reports to a Slack channel — this catches drift within 24 hours instead of discovering it during the next deployment. The lifecycle meta-argument ignore_changes is the escape hatch for expected drift. Auto-scaling groups change desired_capacity based on scaling policies, ECS services change task_count, and some resources have attributes that are set once and then managed externally. Adding these attributes to ignore_changes tells Terraform to skip them during drift comparison, preventing false positives and accidental reverts of legitimate operational changes.
Code Example
# Drift detection and remediation workflow
# Step 1: Run refresh-only plan to detect drift without proposing changes
# terraform plan -refresh-only -out=drift-check.tfplan
# This shows which resources have drifted from their recorded state
# Step 2: Review the drift report
# terraform show drift-check.tfplan
# Example output:
# ~ aws_security_group_rule.payments_api_ingress
# from_port: 443 → 8080 (someone changed the port manually)
# Step 3a: Revert accidental drift — apply converges back to declared state
# terraform apply
# This restores the security group rule to port 443 as declared in code
# Step 3b: Adopt intentional drift — update code to match reality
resource "aws_rds_cluster" "payments_db" {
# Cluster identifier for the payments transaction database
cluster_identifier = "payments-db-prod"
# Updated instance class to match the manual scaling during incident
# Previously: db.r6g.xlarge — changed during traffic spike on 2026-06-15
engine = "aurora-postgresql"
engine_version = "15.4"
deletion_protection = true
backup_retention_period = 30
}
# Step 3c: Import untracked resources using import blocks (TF 1.5+)
import {
# S3 bucket created manually via console during incident response
to = aws_s3_bucket.payments_audit_logs
# The actual bucket name to import from AWS
id = "valuemomentum-payments-audit-logs-prod"
}
# Resource block to match the imported bucket's configuration
resource "aws_s3_bucket" "payments_audit_logs" {
# Bucket name matching the manually created bucket
bucket = "valuemomentum-payments-audit-logs-prod"
tags = {
Purpose = "audit-log-storage"
Environment = "prod"
ManagedBy = "terraform"
ImportedOn = "2026-06-20"
}
}
# Lifecycle ignore_changes for expected drift patterns
resource "aws_autoscaling_group" "payments_api_fleet" {
# ASG name following the naming convention
name = "payments-api-fleet-prod-use1"
# Baseline desired capacity — autoscaler adjusts this
desired_capacity = 6
# Minimum instances for SLA compliance
min_size = 3
# Maximum instances during peak events
max_size = 24
launch_template {
id = aws_launch_template.payments_api.id
version = "$Latest"
}
lifecycle {
# Ignore desired_capacity — managed by cluster autoscaler
# Ignore target_group_arns — managed by EKS ingress controller
ignore_changes = [desired_capacity, target_group_arns]
}
}
# Scheduled drift detection in CI (runs daily at 6 AM UTC)
# .github/workflows/drift-detection.yml
# name: Daily Drift Detection
# on:
# schedule:
# - cron: '0 6 * * *'
# jobs:
# detect-drift:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# - run: terraform -chdir=infrastructure/envs/prod init
# - run: terraform -chdir=infrastructure/envs/prod plan -refresh-only -detailed-exitcode
# # Exit code 2 means drift detected
# - if: failure()
# run: |
# curl -X POST $SLACK_WEBHOOK -d '{"text": "DRIFT DETECTED in prod payments infra"}'◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Drift Detection & Remediation │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────┐ Manual ┌──────────────────┐ │ │ │ Terraform │ Change │ AWS Console │ │ │ │ State File │ │ or CLI │ │ │ │ │ │ │ │ │ │ sg port: 443 │ │ sg port: 8080 │ │ │ │ (declared) │ │ (actual) │ │ │ └───────┬───────┘ └──────────────────┘ │ │ │ │ │ │ └──────────────┬─────────────────────┘ │ │ ↓ │ │ ┌──────────────────────┐ │ │ │ terraform plan │ │ │ │ -refresh-only │ │ │ │ │ │ │ │ DRIFT DETECTED: │ │ │ │ sg port: 443 → 8080 │ │ │ └──────────┬───────────┘ │ │ │ │ │ ┌──────────────┼──────────────┐ │ │ ↓ ↓ ↓ │ │ ┌──────────────┐┌─────────────┐┌──────────────────┐ │ │ │ Accidental ││ Intentional ││ Untracked │ │ │ │ Drift ││ Drift ││ Resource │ │ │ │ ││ ││ │ │ │ │ terraform ││ Update HCL ││ terraform import │ │ │ │ apply ││ to match ││ or delete the │ │ │ │ (revert to ││ reality, ││ resource │ │ │ │ declared) ││ then apply ││ │ │ │ └──────────────┘└─────────────┘└──────────────────┘ │ │ │ │ Proactive Detection: │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Daily CI Job (cron: 0 6 * * *) │ │ │ │ terraform plan -refresh-only -detailed-exitcode │ │ │ │ │ │ │ │ Exit 0 → No drift → all clear │ │ │ │ Exit 2 → Drift detected → Slack alert │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ Expected Drift (ignore_changes): │ │ ┌──────────────────────────────────────────────────┐ │ │ │ ASG desired_capacity → managed by autoscaler │ │ │ │ ECS task_count → managed by scaling policy │ │ │ │ ignore_changes = [desired_capacity] │ │ │ └──────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
State locking prevents concurrent modifications by acquiring a lock on the state file before any write operation. When using backends like S3+DynamoDB, Terraform creates a lock entry with a unique ID. If a second operator attempts a write while locked, Terraform returns a ConditionalCheckFailedException and blocks until the lock is released or the operator force-unlocks.
Detailed Answer
Terraform state locking is a concurrency control mechanism that prevents two or more operators from writing to the same state file simultaneously, which would cause state corruption and potentially orphaned cloud resources. Think of it like a database row-level lock: before Terraform can modify state, it must first acquire an exclusive lock, and any other process attempting to modify the same state must wait or fail. When you run terraform apply or terraform plan (with certain backends), Terraform sends a Lock request to the backend. For S3+DynamoDB, this means writing a record to the DynamoDB table with the state file's digest as the partition key and a unique LockID. The LockID contains the operator's hostname, the Terraform operation type, the workspace name, and a timestamp. DynamoDB's conditional write ensures atomicity: if a record already exists with that key, the write fails with a ConditionalCheckFailedException, and Terraform reports that the state is locked by another process. In production, lock conflicts arise in several scenarios. The most common is when two engineers run terraform apply on the same workspace concurrently. Terraform will display the lock holder's information including their username, operation type, and when the lock was acquired. The blocked operator must wait for the first operation to complete. Another common scenario is a CI/CD pipeline crash: if a pipeline runner dies mid-apply, the lock remains in DynamoDB as a stale lock. This requires manual intervention using terraform force-unlock with the lock ID. Force-unlock is dangerous in production because you cannot guarantee the previous operation completed cleanly. Before force-unlocking, you should verify the state of the actual infrastructure using AWS console or CLI, check the DynamoDB table directly to confirm the lock metadata, and review CloudTrail logs for any API calls made by the crashed process. A safer pattern is to implement lock timeouts in your CI/CD pipeline: wrap terraform apply in a timeout command and have the pipeline explicitly run terraform force-unlock only after confirming no infrastructure changes are in progress. Different backends handle locking differently. Consul uses its built-in session-based locking with TTL. Azure Blob Storage uses native blob leases. Google Cloud Storage uses object generation numbers for optimistic locking. Not all backends support locking; the local backend does via filesystem locks, but NFS-mounted local backends have notoriously unreliable locking, which is why teams migrate to remote backends. The etcd backend uses its compare-and-swap primitive. Understanding your backend's locking semantics is critical for disaster recovery planning, because a corrupted lock table can block all infrastructure changes across your organization.
Code Example
# Backend configuration with S3 state locking via DynamoDB
terraform {
# Define the S3 backend for remote state storage
backend "s3" {
# S3 bucket storing the payments platform state files
bucket = "fintech-corp-terraform-state-prod"
# State file path scoped to the payments VPC workspace
key = "infrastructure/payments-vpc/terraform.tfstate"
# AWS region where the state bucket resides
region = "us-east-1"
# DynamoDB table that manages state locks
dynamodb_table = "terraform-state-locks-prod"
# Enable server-side encryption for state at rest
encrypt = true
# Use the shared infrastructure AWS profile
profile = "fintech-infra-admin"
}
}
# DynamoDB table resource for state locking (provisioned separately)
resource "aws_dynamodb_table" "terraform_locks" {
# Table name matching the backend configuration reference
name = "terraform-state-locks-prod"
# Pay-per-request to avoid capacity planning for lock operations
billing_mode = "PAY_PER_REQUEST"
# LockID is the required partition key for Terraform state locks
hash_key = "LockID"
# Define the LockID attribute as a string type
attribute {
name = "LockID"
type = "S"
}
# Tag for cost allocation and ownership tracking
tags = {
Team = "platform-engineering"
Environment = "production"
ManagedBy = "terraform-bootstrap"
}
}
# Example: force-unlock command when a CI pipeline crashes
# terraform force-unlock 2b6a6738-5ef0-7c20-a036-48eb6273784f◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────┐ │ Terraform State Locking Flow │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ Lock Request ┌──────────────────┐ │ │ │ Operator A │──────────────────→│ DynamoDB Table │ │ │ │ terraform │ LockID: abc123 │ terraform-state │ │ │ │ apply │←──────────────────│ -locks-prod │ │ │ └──────────────┘ Lock Acquired └──────────────────┘ │ │ │ ↑ │ │ │ Write State │ Lock Request │ │ ↓ │ (BLOCKED) │ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ S3 Bucket │ │ Operator B │ │ │ │ fintech-corp │ │ terraform apply │ │ │ │ -terraform │ │ (waiting...) │ │ │ │ -state-prod │ └──────────────────┘ │ │ └──────────────┘ │ │ │ │ │ │ Apply Complete │ │ ↓ │ │ ┌──────────────┐ Unlock Request ┌──────────────────┐ │ │ │ Operator A │──────────────────→│ DynamoDB Table │ │ │ │ (finished) │ Delete LockID │ (lock released) │ │ │ └──────────────┘ └──────────────────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────┐ │ │ │ Operator B │ │ │ │ Lock Acquired │ │ │ │ (proceeds) │ │ │ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘
Quick Answer
Terraform plan detects drift by reading the current state of every resource via provider API calls and comparing it against the state file. It identifies differences as drift. However, it only checks attributes it manages, cannot detect out-of-band resource creation, misses resources not in state, and some providers do not report all attributes accurately.
Detailed Answer
Terraform plan's drift detection works through a refresh-then-diff process. Think of it like an inventory audit: Terraform reads the last known inventory (state file), physically checks every item in the warehouse (API calls to cloud providers), updates the inventory with actual findings (state refresh), and then compares the updated inventory against the blueprint (configuration). Any discrepancies between the refreshed state and the desired configuration become the plan. The refresh phase is where drift detection happens. For every resource tracked in the state file, Terraform calls the provider's ReadResource RPC method, which translates to cloud API calls. For an aws_rds_cluster.payments_db, this triggers a DescribeDBClusters API call. The provider compares the API response against the state file's recorded attributes. If the production database's backup_retention_period was changed from 30 to 7 via the AWS console, the refresh detects this as drift and updates the in-memory state. After refresh, Terraform diffs the refreshed state against the configuration. If your configuration says backup_retention_period = 30 but the refreshed state shows 7, the plan proposes changing it back to 30. This is Terraform's self-healing property: it converges actual infrastructure toward the declared configuration. However, the limitations are significant and often misunderstood in production. First, Terraform only detects drift on resources it manages. If someone creates an additional security group rule via the AWS console that is not in Terraform's state, Terraform has no knowledge of it. This is the 'unknown unknowns' problem: Terraform cannot detect resources it does not track. Second, not all providers report all attributes during refresh. Some cloud APIs return partial data, or certain attributes are write-only (like passwords). The AWS provider, for example, cannot detect drift on certain IAM policy document orderings because the API returns a canonicalized version that may not match the original. Third, the refresh phase can be slow and expensive. In a large infrastructure with thousands of resources, the refresh makes thousands of API calls, which can hit rate limits and take tens of minutes. Terraform 1.5 introduced the -refresh=false flag to skip refresh for faster plans, but this trades drift detection for speed. Fourth, eventual consistency in cloud APIs can cause false drift detection. After an AWS resource is created, the API may return stale data for seconds or minutes. Running plan immediately after apply can show phantom drift that resolves itself. Fifth, Terraform cannot detect drift on resource dependencies that are not explicitly modeled. If a VPC peering connection's route table was modified outside Terraform but the peering resource itself was not, Terraform might not detect the functional impact. Tools like AWS Config, CloudTrail-based drift detection, or Driftctl (now part of Snyk) fill these gaps by scanning entire accounts for unmanaged resources.
Code Example
# Demonstrating drift detection behavior with refresh configuration
# Backend configuration for the payments infrastructure state
terraform {
# Required Terraform version for refresh-only plan support
required_version = ">= 1.5.0"
# S3 backend with state locking for the payments platform
backend "s3" {
# State bucket for the production payments infrastructure
bucket = "fintech-corp-terraform-state-prod"
# State file path for the payments database workspace
key = "payments-database/terraform.tfstate"
# Primary region for state storage
region = "us-east-1"
# Lock table to prevent concurrent modifications
dynamodb_table = "terraform-state-locks-prod"
}
}
# RDS cluster that we want to detect drift on
resource "aws_rds_cluster" "payments_db" {
# Cluster identifier for the payments transaction database
cluster_identifier = "payments-db-production"
# Aurora PostgreSQL engine for transaction processing
engine = "aurora-postgresql"
# Engine version validated by the DBA team
engine_version = "15.4"
# Backup retention: 30 days for PCI compliance
# If someone changes this via console, plan will detect drift
backup_retention_period = 30
# Deletion protection must stay enabled in production
deletion_protection = true
# Preferred maintenance window outside peak transaction hours
preferred_maintenance_window = "sun:03:00-sun:04:00"
}
# Lifecycle rule to ignore drift on specific attributes
resource "aws_autoscaling_group" "payments_api_fleet" {
# ASG name following the organization convention
name = "payments-api-fleet-production"
# Desired capacity managed by autoscaling policies, not Terraform
desired_capacity = 6
# Minimum instances for baseline transaction processing
min_size = 3
# Maximum instances during peak shopping events
max_size = 24
# Launch template for the payments API container hosts
launch_template {
# Reference the payments API launch template
id = aws_launch_template.payments_api.id
# Always use the latest validated AMI version
version = "$Latest"
}
# Ignore drift on desired_capacity because autoscaling changes it
lifecycle {
# Prevent Terraform from reverting autoscaler decisions
ignore_changes = [desired_capacity]
}
}
# Refresh-only plan command to detect drift without proposing changes
# terraform plan -refresh-only -out=drift-report.tfplan◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform Plan Drift Detection Flow │ ├───────────────────────────────────────────────────────────────┤ │ │ │ Phase 1: Refresh (Drift Detection) │ │ ┌──────────────┐ ReadResource ┌──────────────────┐ │ │ │ State File │ RPC calls │ Cloud Provider │ │ │ │ │──────────────────→│ APIs │ │ │ │ payments_db: │ │ │ │ │ │ retention=30 │←──────────────────│ Actual: ret=7 │ │ │ │ │ API Response │ (console change) │ │ │ └──────┬───────┘ └──────────────────┘ │ │ │ │ │ │ Update in-memory state │ │ ↓ │ │ ┌──────────────┐ │ │ │ Refreshed │ │ │ │ State │ payments_db: retention=7 (drift detected) │ │ └──────┬───────┘ │ │ │ │ │ Phase 2: Diff (Plan Generation) │ │ │ │ │ ↓ │ │ ┌──────────────┐ Compare ┌──────────────────┐ │ │ │ Refreshed │─────────────→│ Configuration │ │ │ │ State │ │ (main.tf) │ │ │ │ retention=7 │ │ retention=30 │ │ │ └──────────────┘ └──────────────────┘ │ │ │ │ │ ↓ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ Plan Output: │ │ │ │ ~ aws_rds_cluster.payments_db │ │ │ │ ~ backup_retention_period: 7 → 30 │ │ │ │ (drift will be corrected on apply) │ │ │ └───────────────────────────────────────────────────┘ │ │ │ │ Limitations: │ │ ┌────────────────────────────────────────────────┐ │ │ │ ✗ Cannot detect unmanaged resources │ │ │ │ ✗ Write-only attributes invisible to refresh │ │ │ │ ✗ Eventual consistency → false drift │ │ │ │ ✗ Rate limits slow large-scale refresh │ │ │ └────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Implement a CI/CD pipeline that runs terraform plan on pull requests, posts the plan output as a PR comment for review, requires manual approval before apply, and uses remote state locking to prevent concurrent operations. Use OIDC authentication, separate plan and apply stages, and implement policy-as-code gates with Sentinel or OPA.
Detailed Answer
A production-grade Terraform CI/CD pipeline must solve five problems: authentication without long-lived credentials, plan visibility for reviewers, approval gates before destructive changes, concurrency control to prevent conflicting applies, and policy enforcement to catch compliance violations before they reach infrastructure. Think of it like a surgical operation workflow: the surgeon (engineer) proposes an operation (plan), it gets reviewed by a board (PR review), approved by an authority (manual approval gate), and only then executed in a controlled environment (apply) with safeguards (state locking). Authentication should use OIDC federation. GitHub Actions, GitLab CI, and CircleCI all support OIDC tokens that can be exchanged for short-lived AWS credentials via STS AssumeRoleWithWebIdentity. This eliminates the need to store AWS access keys as CI secrets, which is a common audit finding. The OIDC trust policy should be scoped to specific repositories and branches to prevent unauthorized access. The pipeline structure typically has three stages. The first stage runs on every pull request: terraform init, terraform validate, terraform fmt -check, and terraform plan -out=plan.tfplan. The plan output is captured and posted as a PR comment using a tool like tfcmt or a custom script that parses the plan JSON output. This gives reviewers visibility into exactly what will change. The second stage is the approval gate. For non-production environments, this might be automatic after PR merge. For production, it requires explicit manual approval. In GitHub Actions, this is implemented using environments with required reviewers. In GitLab, it is a manual job gate. The approval should be from someone other than the PR author (four-eyes principle) and ideally from a platform engineering team member who understands the blast radius. The third stage runs terraform apply using the saved plan file. This is critical: never re-run plan during apply, because infrastructure may have changed between the plan and apply stages. The saved plan file ensures exactly what was reviewed gets applied. After apply, the pipeline should post the apply output back to the PR or a Slack channel for visibility. Policy-as-code adds guardrails. HashiCorp Sentinel (Terraform Cloud/Enterprise) or Open Policy Agent (open source) evaluate the plan against organizational policies: no public S3 buckets, all RDS instances must have encryption, all security groups must have descriptions. These checks run after plan but before approval, catching violations early. Production gotchas include handling plan file expiration (plan files reference specific provider plugin versions and state serial numbers, so they expire when state changes), managing workspace-level parallelism (only one pipeline should operate on a workspace at a time), and dealing with long-running applies that exceed CI timeout limits. Some teams implement a Terraform-specific lock in Redis or DynamoDB beyond the state lock, to queue pipeline runs at the workspace level.
Code Example
# GitHub Actions workflow for Terraform CI/CD with approval gates
# File: .github/workflows/terraform-payments-infra.yml
name: Payments Infrastructure Terraform Pipeline
# Trigger on pull requests targeting the main branch
on:
pull_request:
# Only run when infrastructure code changes
paths:
- 'infrastructure/payments/**'
push:
branches:
- main
paths:
- 'infrastructure/payments/**'
# OIDC token permissions for AWS authentication
permissions:
# Allow requesting OIDC JWT tokens from GitHub
id-token: write
# Allow posting plan output as PR comments
pull-requests: write
# Allow reading repository contents
contents: read
# Prevent concurrent runs on the same branch/PR
concurrency:
# Group by workflow name and PR number or branch
group: terraform-payments-${{ github.event.pull_request.number || github.ref }}
# Cancel in-progress plan runs but never cancel apply
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
# Plan stage: runs on every pull request
terraform-plan:
# Use the latest Ubuntu runner for consistency
runs-on: ubuntu-latest
# Only run plan on pull requests, not on merge
if: github.event_name == 'pull_request'
steps:
# Checkout the payments infrastructure code
- uses: actions/checkout@v4
# Configure AWS credentials via OIDC federation
- uses: aws-actions/configure-aws-credentials@v4
with:
# OIDC role scoped to this repository and branch
role-to-assume: arn:aws:iam::111111111111:role/GitHubActions-TerraformPlan
# Region for API calls and state backend
aws-region: us-east-1
# Install the pinned Terraform version
- uses: hashicorp/setup-terraform@v3
with:
# Version locked to match team standard
terraform_version: 1.7.4
# Initialize Terraform with backend configuration
- name: Terraform Init
# Run init in the payments infrastructure directory
run: terraform init -input=false
working-directory: infrastructure/payments
# Run format check to enforce code style
- name: Terraform Format Check
# Fail the pipeline if code is not formatted
run: terraform fmt -check -recursive
working-directory: infrastructure/payments
# Generate the execution plan and save to file
- name: Terraform Plan
# Save plan to file for use in apply stage
run: terraform plan -input=false -out=payments.tfplan
working-directory: infrastructure/payments
# Post plan output as a PR comment for reviewers
- name: Post Plan to PR
# Use tfcmt for formatted plan comments
run: tfcmt plan -- terraform show payments.tfplan
working-directory: infrastructure/payments
# Apply stage: runs after merge with manual approval
terraform-apply:
# Use the latest Ubuntu runner for consistency
runs-on: ubuntu-latest
# Only run apply on push to main (after PR merge)
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
# Require manual approval from the platform-admins team
environment: production-payments
steps:
# Checkout the merged infrastructure code
- uses: actions/checkout@v4
# Configure AWS credentials with apply permissions
- uses: aws-actions/configure-aws-credentials@v4
with:
# Apply role has write permissions to production
role-to-assume: arn:aws:iam::111111111111:role/GitHubActions-TerraformApply
# Same region as the plan stage
aws-region: us-east-1
# Install the same Terraform version as plan stage
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Initialize and apply the configuration
- name: Terraform Init and Apply
# Auto-approve because approval happened via GitHub environment
run: |
terraform init -input=false
terraform apply -input=false -auto-approve
working-directory: infrastructure/payments◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform CI/CD Pipeline with Approval Gates │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Engineer │ │ │ │ Opens PR │ │ │ └──────┬───────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 1: Plan (on PR) │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │ init │→│validate│→│fmt chk │→│ plan │ │ │ │ │ └────────┘ └────────┘ └────────┘ └───┬────┘ │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Policy Check (OPA / Sentinel) │ │ │ │ ┌────────────────────────────────────────┐ │ │ │ │ │ No public S3 buckets │ │ │ │ │ │ All RDS encrypted │ │ │ │ │ │ Security groups have descriptions │ │ │ │ │ └────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Plan posted as PR comment (tfcmt) │ │ │ │ Reviewer examines changes │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 2: Manual Approval │ │ │ │ (GitHub Environment: production-payments) │ │ │ │ Required reviewers: platform-admins team │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 3: Apply (on merge to main) │ │ │ │ ┌────────┐ ┌─────────────────────────────────┐ │ │ │ │ │ init │→│ apply -auto-approve │ │ │ │ │ └────────┘ └─────────────────────────────────┘ │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Notify: Slack #payments-infra-changes │ │ │ └──────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Handle state corruption by first enabling S3 versioning for automatic backups, then recovering using terraform state pull to inspect the corrupted state, restoring from a previous version, or as a last resort using terraform import to rebuild state from existing infrastructure. Always maintain state backups and test recovery procedures regularly.
Detailed Answer
Terraform state corruption is one of the most dangerous operational incidents because the state file is the single source of truth mapping your configuration to real infrastructure. Think of it like a hospital's patient registry: if the registry is corrupted, you do not know which patient is in which room, and any action you take risks harming the wrong patient. State corruption can cause Terraform to destroy resources it thinks do not exist, create duplicates of resources it cannot find, or fail entirely with cryptic deserialization errors. State corruption occurs in several ways. The most common is a partial write during terraform apply: if the process is killed mid-apply (OOM killer, network interruption, CI timeout), the state file may contain a partially updated resource. Another common cause is manual state file editing with JSON syntax errors. Concurrent writes without proper locking can interleave state updates. Provider bugs can write malformed attribute values. And rarely, S3 eventual consistency (in older S3 behavior) could serve a stale state version. The first line of defense is prevention. Enable S3 versioning on your state bucket so every state write creates a new version, giving you point-in-time recovery. Enable MFA delete on the bucket to prevent accidental or malicious version deletion. Use DynamoDB state locking to prevent concurrent writes. Run terraform plan before apply to catch state inconsistencies early. Implement CI/CD pipelines that prevent direct state manipulation. When corruption occurs, follow a structured recovery procedure. First, immediately stop all Terraform operations on the affected workspace. If your CI/CD pipeline has queued runs, cancel them. This prevents compounding the corruption. Second, assess the damage. Run terraform state pull to download the current state file and inspect it. Check the JSON structure: is it valid JSON? Are the serial number and lineage fields present? Use jq to examine specific resources. Compare the state against your actual infrastructure using AWS CLI or console. Third, attempt recovery from backup. If S3 versioning is enabled, list previous versions with aws s3api list-object-versions, download a known-good version, and push it back using terraform state push. The state push command validates the state format and updates the serial number. Be careful with the -force flag: it skips lineage checking, which is a safety mechanism that prevents pushing state from the wrong workspace. Fourth, if no backup is available, perform selective state surgery. Use terraform state rm to remove the corrupted resource entries, then terraform import to re-import the existing infrastructure resources. This is tedious for large configurations but preserves the non-corrupted portions of state. For each imported resource, run terraform plan to verify the import produced a state entry that matches your configuration. Fifth, as a last resort for total state loss, you can rebuild the entire state from scratch using terraform import for every resource. This is the infrastructure equivalent of a bare-metal restore. Tools like terraformer and former2 can help by scanning your AWS account and generating import commands, but they require careful validation. After recovery, conduct a post-mortem. Implement additional safeguards: S3 bucket replication to a separate account for disaster recovery, automated state backup jobs that copy state to a different storage system, monitoring on state file size and serial number for anomaly detection, and regular recovery drills where you practice restoring state from backup in a non-production workspace.
Code Example
# State corruption recovery playbook
# Step 1: Pull and inspect the corrupted state
# Download the current state file for local inspection
# terraform state pull > corrupted-state-backup.json
# Step 2: Check S3 versioning for previous good state
# List all versions of the payments state file
# aws s3api list-object-versions \
# --bucket fintech-corp-terraform-state-prod \
# --prefix payments-database/terraform.tfstate \
# --query 'Versions[*].{VersionId:VersionId,Modified:LastModified,Size:Size}'
# Step 3: Download a known-good state version
# aws s3api get-object \
# --bucket fintech-corp-terraform-state-prod \
# --key payments-database/terraform.tfstate \
# --version-id "abc123def456" \
# recovered-state.json
# Step 4: Push the recovered state back
# terraform state push recovered-state.json
# Prevention: S3 bucket with versioning and replication
resource "aws_s3_bucket" "terraform_state" {
# Bucket name following the organization naming convention
bucket = "fintech-corp-terraform-state-prod"
# Prevent accidental deletion of the state bucket
force_destroy = false
# Tags for ownership and cost allocation
tags = {
Team = "platform-engineering"
Purpose = "terraform-state-storage"
Criticality = "critical"
}
}
# Enable versioning for point-in-time state recovery
resource "aws_s3_bucket_versioning" "terraform_state" {
# Reference the state storage bucket
bucket = aws_s3_bucket.terraform_state.id
# Enable versioning to retain all state file versions
versioning_configuration {
# Enabled status ensures every write creates a new version
status = "Enabled"
}
}
# Server-side encryption for state files at rest
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
# Reference the state storage bucket
bucket = aws_s3_bucket.terraform_state.id
# Encryption rule using AWS KMS for audit trail
rule {
# Apply encryption to all new state file versions
apply_server_side_encryption_by_default {
# Use a dedicated KMS key for state encryption
sse_algorithm = "aws:kms"
# KMS key managed by the platform engineering team
kms_master_key_id = aws_kms_key.terraform_state_encryption.arn
}
# Force encryption on all uploaded state files
bucket_key_enabled = true
}
}
# Cross-region replication for disaster recovery
resource "aws_s3_bucket_replication_configuration" "terraform_state_dr" {
# Source bucket for replication
bucket = aws_s3_bucket.terraform_state.id
# IAM role with permissions to replicate objects
role = aws_iam_role.terraform_state_replication.arn
# Replication rule for all state files
rule {
# Unique identifier for the replication rule
id = "state-dr-replication"
# Enable the replication rule
status = "Enabled"
# Replicate to a bucket in a different region
destination {
# DR bucket in us-west-2 for geographic redundancy
bucket = aws_s3_bucket.terraform_state_dr.arn
# Use the DR region's KMS key for encryption
storage_class = "STANDARD_IA"
}
}
}
# Lifecycle policy to manage state version retention
resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
# Reference the state storage bucket
bucket = aws_s3_bucket.terraform_state.id
# Rule to manage old state file versions
rule {
# Unique identifier for the lifecycle rule
id = "state-version-retention"
# Enable the lifecycle rule
status = "Enabled"
# Transition old versions to cheaper storage after 30 days
noncurrent_version_transition {
# Move to Glacier after 30 days for cost savings
noncurrent_days = 30
# Glacier storage for long-term state version retention
storage_class = "GLACIER"
}
# Delete very old versions after 365 days
noncurrent_version_expiration {
# Retain versions for one year for compliance audits
noncurrent_days = 365
}
}
}
# Import command for rebuilding state (example)
# terraform import aws_rds_cluster.payments_db payments-db-production
# terraform import aws_vpc.payments_network vpc-0a1b2c3d4e5f67890
# terraform import aws_security_group.payments_api_ingress sg-0a1b2c3d4e5f67890◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Recovery Workflow │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Corruption │ │ │ │ Detected! │ │ │ └──────┬───────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 1: Stop All Operations │ │ │ │ Cancel CI/CD queued runs │ │ │ │ Notify platform-engineering team │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 2: Assess Damage │ │ │ │ terraform state pull > corrupted-backup.json │ │ │ │ jq '.resources | length' corrupted-backup.json │ │ │ │ Compare against AWS CLI inventory │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌──────────────┬─────────────────────┘ │ │ ↓ ↓ │ │ ┌────────────┐ ┌───────────────┐ │ │ │ S3 Version │ │ No Backup │ │ │ │ Available? │ │ Available │ │ │ └──────┬─────┘ └───────┬───────┘ │ │ │ │ │ │ ↓ ↓ │ │ ┌────────────┐ ┌───────────────┐ │ │ │ Step 3A: │ │ Step 3B: │ │ │ │ Restore │ │ Rebuild │ │ │ │ from S3 │ │ from scratch │ │ │ │ version │ │ │ │ │ │ │ │ terraform │ │ │ │ aws s3api │ │ state rm │ │ │ │ get-object │ │ (corrupted) │ │ │ │ --version-id│ │ │ │ │ │ │ │ terraform │ │ │ │ terraform │ │ import │ │ │ │ state push │ │ (each resource│ │ │ │ │ │ from cloud) │ │ │ └──────┬─────┘ └───────┬───────┘ │ │ │ │ │ │ └───────┬───────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 4: Validate Recovery │ │ │ │ terraform plan (expect no changes) │ │ │ │ Verify resource count matches cloud inventory │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 5: Post-Mortem │ │ │ │ Document root cause │ │ │ │ Implement additional safeguards │ │ │ │ Schedule recovery drill for next quarter │ │ │ └──────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Remote state gives a team a shared, durable view of managed infrastructure, while state locking prevents two Terraform runs from changing the same state at the same time. If state is split around convenience instead of ownership and dependency boundaries, teams create hidden coupling, stale outputs, and unsafe apply ordering.
Detailed Answer
Think of Terraform state like the official city property registry. If two clerks update the same parcel records at the same time, one may overwrite the other's change and the registry becomes untrustworthy. Remote state puts the registry in a shared office instead of on one clerk's laptop, and locking is the sign on the desk that says one clerk is actively editing this registry right now. Without that sign, production changes become a race. In Terraform, state maps configuration resources to real infrastructure objects and stores attributes Terraform needs for planning. Local state is simple for one person, but it breaks down for teams because every operator needs the newest file and must coordinate manually. Remote backends such as HCP Terraform, S3 with locking, Azure Blob, or GCS move state to a shared system. Locking prevents concurrent plans or applies from corrupting the same state or producing plans from stale assumptions. The internal flow is: Terraform loads configuration, initializes the backend, reads the latest state, attempts to acquire a lock, refreshes resource data through providers, builds a dependency graph, produces a plan, and applies changes if approved. During apply, state is updated as resources change. If the process crashes, the lock may need cleanup, but force-unlocking should be treated like removing a warning tag from dangerous machinery: only do it after confirming no run is active. At scale, teams split state to reduce blast radius, improve performance, and align ownership. A network platform state might own VPCs, subnets, and transit gateways, while an application state consumes published subnet IDs. The dangerous split is by file size or folder convenience rather than lifecycle. If an app state creates security groups that the network state mutates, or two states manage different arguments on the same resource, Terraform cannot reason globally and drift becomes normal. The non-obvious gotcha is that remote state outputs are not a service discovery system. They expose snapshots from another state, and consumers may act on outputs that are valid syntactically but operationally stale. Senior engineers prefer stable contracts, narrow outputs, versioned modules, explicit ownership, and separate data stores for values consumed by non-Terraform systems. They also design emergency unlock procedures, backend access controls, and audit trails before the first production incident.
Code Example
terraform init -backend-config=env/prod.s3backend # Initializes the shared production backend instead of using local state. terraform plan -out=prod.tfplan # Creates an auditable plan from the latest locked remote state. terraform apply prod.tfplan # Applies exactly the reviewed plan while Terraform holds the backend lock. terraform force-unlock LOCK_ID # Removes a stale lock only after confirming no apply is still running. terraform state list # Lists resources owned by this state so ownership boundaries can be reviewed.
◈ Architecture Diagram
┌──────────┐
│ Engineer │
└────┬─────┘
↓
┌──────────┐
│ Backend │
└────┬─────┘
↓ lock
┌──────────┐
│ State │
└────┬─────┘
↓ graph
┌──────────┐
│ Plan │
└────┬─────┘
↓ apply
┌──────────┐
│ Cloud │
└──────────┘Quick Answer
At scale, Terraform state must be stored in remote backends like S3 with DynamoDB locking or Terraform Cloud, split into small blast-radius units by domain or environment, and isolated via workspaces or directory structure. State locking prevents concurrent applies from corrupting state, and state splitting ensures a single terraform apply cannot accidentally destroy unrelated infrastructure.
Detailed Answer
Think of a hospital records system. If every department writes to one giant patient file simultaneously, records get corrupted and the wrong medication gets administered. Splitting records by department, locking each file during edits, and storing everything in a central secure archive prevents these disasters. Terraform state management works the same way — it is the record of what infrastructure exists, and mismanaging it causes outages. Terraform state is a JSON file that maps every resource in your configuration to a real cloud object. When terraform plan runs, it reads the state to determine what exists, compares it to the desired configuration, and calculates the diff. If two engineers run terraform apply simultaneously against the same state, one overwrites the other's changes, causing state corruption where Terraform's view of the world no longer matches reality. Remote backends solve storage and collaboration: S3 stores the state file durably, DynamoDB provides a lock table so only one operation can modify state at a time, and versioning on the S3 bucket enables recovery from bad applies. Internally, when terraform apply starts, it sends a Lock request to the backend. For S3+DynamoDB, this writes a lock record to the DynamoDB table with a unique ID, the user's identity, and a timestamp. If another process already holds the lock, Terraform exits with an error. After the apply completes, Terraform writes the updated state to S3 and releases the lock. If a process crashes mid-apply, the lock remains until it expires or is manually force-unlocked with terraform force-unlock. Terraform Cloud handles locking internally and adds run queues so multiple plans can exist but only one apply executes at a time per workspace. At production scale, the critical architectural decision is state splitting. A monolithic state file containing the VPC, databases, Kubernetes clusters, DNS records, and application services means a single terraform apply can accidentally destroy the database while updating a DNS record. The recommended pattern is splitting state by blast radius: network foundations in one state, data layer in another, compute in another, and application configurations in their own states. Each state has its own backend configuration and can use terraform_remote_state data sources or outputs stored in SSM Parameter Store to share values. Workspaces can further separate environments (dev, staging, production) within the same configuration, but they should not be used as a substitute for proper state splitting — all workspaces in a configuration share the same codebase, backend, and permissions. The non-obvious gotcha is that terraform_remote_state creates a hard coupling between states, and if the upstream state is corrupted or the output names change, downstream plans break. Many mature teams replace terraform_remote_state with data sources that look up infrastructure by tags or names, or they store shared values in AWS SSM Parameter Store or HashiCorp Consul, which decouples state files completely. Another trap is that S3 bucket versioning does not protect against state file deletion — teams must also enable MFA Delete or use S3 Object Lock for regulatory environments.
Code Example
# backend.tf — Remote backend configuration for the payments data layer
terraform {
# Use S3 as the remote state storage backend
backend "s3" {
# S3 bucket dedicated to Terraform state files
bucket = "company-terraform-state-prod"
# State file path scoped to team and layer
key = "payments/data-layer/terraform.tfstate"
# AWS region for the state bucket
region = "us-east-1"
# DynamoDB table for state locking and consistency checking
dynamodb_table = "terraform-state-locks"
# Enable server-side encryption for state at rest
encrypt = true
# Use a specific KMS key for encryption
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/payments-tf-state-key"
}
}
# Reference outputs from the network layer state without tight coupling
# Using SSM Parameter Store instead of terraform_remote_state
data "aws_ssm_parameter" "vpc_id" {
# Parameter path set by the network layer's terraform apply
name = "/infrastructure/network/vpc-id"
}
data "aws_ssm_parameter" "private_subnet_ids" {
# Comma-separated subnet IDs stored by the network team
name = "/infrastructure/network/private-subnet-ids"
}
# Use the decoupled values in resource configuration
resource "aws_db_subnet_group" "payments" {
# Subnet group name for the payments database
name = "payments-db-subnets"
# Split the comma-separated parameter value into a list
subnet_ids = split(",", data.aws_ssm_parameter.private_subnet_ids.value)
# Tag for operational identification
tags = {
Team = "payments"
Layer = "data"
}
}
# DynamoDB lock table must exist before backend configuration
# This is typically created by a bootstrap state or manually
# aws dynamodb create-table \
# --table-name terraform-state-locks \
# --attribute-definitions AttributeName=LockID,AttributeType=S \
# --key-schema AttributeName=LockID,KeyType=HASH \
# --billing-mode PAY_PER_REQUEST◈ Architecture Diagram
┌──────────┐
│ tf apply │
└────┬─────┘
│
┌────┴─────┐
│ Lock │
│ DynamoDB │
└────┬─────┘
│
┌────┴─────┐
│ Read │
│ S3 State │
└────┬─────┘
│
┌────┴─────┐
│ Apply │
│ Changes │
└────┬─────┘
│
┌────┴─────┐
│ Write │
│ S3 State │
└────┬─────┘
│
┌────┴─────┐
│ Unlock │
└──────────┘Quick Answer
Terraform plan -refresh-only detects drift by comparing actual cloud state against the stored state file without proposing configuration changes. Import blocks bring unmanaged resources under Terraform control declaratively. Moved blocks refactor resource addresses in state without destroying and recreating infrastructure. Together they let architects reconcile drift, adopt existing resources, and restructure code safely.
Detailed Answer
Think of a warehouse inventory system. Drift detection is like a stock audit — you compare what the computer says is on the shelf to what is actually there. Import is like scanning a product that was placed on the shelf without being logged into the system. Move is like changing the shelf label without physically moving the product. All three keep the inventory accurate without throwing anything away. Infrastructure drift occurs when cloud resources are modified outside Terraform — through the console, CLI, another IaC tool, or automated processes like auto-scaling. terraform plan -refresh-only reads the current state of every managed resource from the cloud provider APIs and compares it to the stored state file. It shows what has changed in the real world without proposing any configuration-level changes. This is distinct from a regular terraform plan, which both refreshes state and compares it to the desired configuration. Running refresh-only plans on a schedule helps teams detect unauthorized changes before they cause incidents. Internally, refresh-only mode calls the same provider Read functions that a normal plan uses, but it stops after updating the in-memory state representation. It shows a diff between the previously stored state and the freshly read state, highlighting attributes that changed externally. If the operator approves the refresh with terraform apply -refresh-only, the state file is updated to match reality without making any infrastructure changes. Import blocks, introduced in Terraform 1.5, allow declarative imports in configuration files rather than the imperative terraform import CLI command. A resource block with an import block specifies the cloud resource ID, and terraform plan generates the configuration needed to manage it. Moved blocks tell Terraform that a resource has been renamed or restructured in the configuration — for example, moving from a flat resource to a module or changing a resource's for_each key — so it updates the state address rather than planning a destroy and create. At production scale, drift detection should be automated. Teams run terraform plan -refresh-only in CI on a daily schedule and alert on any detected drift. The plan output is stored as an artifact for audit trail. Import blocks are essential during brownfield adoption — when a company has existing infrastructure created manually or by CloudFormation and wants to manage it with Terraform. Without import blocks, the alternative is terraform import commands that must be run manually for each resource, which is error-prone and not version-controlled. Moved blocks are critical during refactoring: when a team restructures modules, renames resources for clarity, or converts single resources to for_each collections, moved blocks prevent Terraform from destroying the production database and recreating it. The non-obvious gotcha with refresh-only is that it only detects drift in resources Terraform already manages — it cannot find resources created outside Terraform. Teams need cloud-native tools like AWS Config or Azure Policy for complete drift coverage. With import blocks, the generated configuration may not match the team's coding standards and needs manual cleanup. With moved blocks, the from address must exactly match the current state address, including module paths and index keys, and a typo silently creates a new resource instead of moving the existing one. Architects should always run terraform plan after adding moved blocks and verify that no destroy/create actions appear.
Code Example
# Detect drift on the payments infrastructure without changing anything
terraform plan -refresh-only -out=drift-report.tfplan
# Review the drift report to see what changed externally
terraform show drift-report.tfplan
# Apply the refresh to update state to match reality (no infra changes)
terraform apply -refresh-only drift-report.tfplan
# Import an existing RDS instance that was created manually in the console
# payments-data/main.tf
import {
# Specify the AWS resource ID of the existing database
id = "payments-orders-prod"
# Map it to this Terraform resource address
to = aws_db_instance.orders
}
resource "aws_db_instance" "orders" {
# Identifier matching the existing RDS instance name
identifier = "payments-orders-prod"
# Instance class matching the existing configuration
instance_class = "db.r6g.large"
# Engine matching the existing database
engine = "postgres"
# Engine version matching the existing database
engine_version = "16.3"
# Storage matching the existing allocation
allocated_storage = 500
# Prevent accidental deletion of the production database
deletion_protection = true
# Skip final snapshot only if you have other backup strategies
skip_final_snapshot = false
# Tag for operational identification
tags = {
Team = "payments"
Environment = "prod"
ManagedBy = "terraform"
}
}
# Refactor a resource into a module without destroying it
# Use moved block to update the state address
moved {
# Old address before modularization
from = aws_db_instance.orders
# New address inside the database module
to = module.orders_database.aws_db_instance.this
}◈ Architecture Diagram
┌──────────┐
│ Cloud │
│ (actual) │
└────┬─────┘
│ refresh
┌────┴─────┐
│ State │
│ (stored) │
└────┬─────┘
│ compare
┌────┴─────┐
│ Config │
│ (desired)│
└────┬─────┘
│
┌────┴─────┐
│ Plan │
│ (action) │
└──────────┘Quick Answer
Terraform state is a JSON file that maps your HCL configuration to real-world infrastructure resources. Remote state stores this file in a shared backend like S3 or Terraform Cloud, enabling team collaboration, state locking, and disaster recovery.
Detailed Answer
Terraform state is the backbone of how Terraform understands what infrastructure it manages. Think of it like a warehouse inventory ledger: without it, workers would walk into the warehouse every day not knowing what is already on the shelves, what was ordered, or what needs restocking. The state file (terraform.tfstate) is that ledger — it records every resource Terraform has created, its current attributes, metadata about dependencies, and the mapping between your HCL resource blocks and the actual cloud API objects. Internally, the state file is a JSON document containing a version number, a serial counter that increments on every write, a lineage UUID that uniquely identifies a state chain, and an array of resource objects. Each resource entry stores the provider, the resource type, the resource name, the mode (managed or data), and the full set of attributes returned by the provider API after creation. When you run terraform plan, Terraform reads this state, calls the cloud APIs to refresh the actual status of each resource, and then computes the diff between desired (your HCL) and actual (the refreshed state). Without state, Terraform would have no way to know that aws_rds_instance.payments_db already exists and would try to create a duplicate every time. Local state works fine for a solo developer experimenting, but it becomes dangerous in production for several reasons. First, if two engineers run terraform apply simultaneously against local state files, they can create conflicting resources or corrupt state entirely — there is no locking mechanism. Second, if your laptop dies or someone accidentally deletes the state file, you lose the mapping between code and infrastructure, making it extremely difficult to recover. Third, local state may contain sensitive outputs like database passwords or API keys stored in plaintext on a developer workstation. Remote state solves all of these problems. When you configure a backend like S3 with DynamoDB locking, the state file lives in a durable, versioned object store. DynamoDB provides a distributed lock so that only one terraform apply can run at a time, preventing race conditions. S3 versioning gives you automatic backup of every state revision, so you can roll back if something goes wrong. Additionally, remote state enables the terraform_remote_state data source, which lets one Terraform project read outputs from another — for example, a networking project can export VPC IDs that an application project consumes. In production, teams typically enforce remote state from day one using a backend configuration block, require encryption at rest and in transit, restrict access via IAM policies, and enable state locking. Ignoring remote state is one of the most common causes of Terraform disasters in growing organizations.
Code Example
# Configure S3 backend with DynamoDB locking for the payments infrastructure
terraform {
# Use the S3 backend to store state remotely
backend "s3" {
# S3 bucket dedicated to Terraform state files
bucket = "fintech-terraform-state-prod"
# Path within the bucket for this specific project's state
key = "payments-platform/us-east-1/terraform.tfstate"
# AWS region where the S3 bucket lives
region = "us-east-1"
# DynamoDB table used for state locking to prevent concurrent applies
dynamodb_table = "fintech-terraform-locks"
# Encrypt the state file at rest using AES-256
encrypt = true
# Use a specific KMS key for encryption instead of default S3 key
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/payments-state-key"
}
}
# Read remote state from the networking project to get VPC details
data "terraform_remote_state" "networking" {
# Use the S3 backend to read another project's state
backend = "s3"
config = {
# Same state bucket but different key path for the networking project
bucket = "fintech-terraform-state-prod"
# The networking team's state file location
key = "networking/us-east-1/terraform.tfstate"
# Region must match the bucket's region
region = "us-east-1"
}
}
# Use the VPC ID from the networking project's remote state
resource "aws_db_subnet_group" "payments_db_subnets" {
# Name the subnet group after the payments database
name = "payments-db-subnet-group"
# Pull private subnet IDs from the networking project's outputs
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
# Tag for cost tracking and ownership
tags = {
Team = "payments-backend"
Service = "payments-db"
}
}◈ Architecture Diagram
┌──────────────────────────────────────────────────────────┐
│ Developer Workstation │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ main.tf │ │ variables.tf │ │
│ │ (HCL Config) │ │ (Inputs) │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ terraform plan │ │
│ │ terraform apply │ │
│ └────────┬────────┘ │
└───────────────────┼──────────────────────────────────────┘
│
┌──────────▼──────────┐
│ S3 Backend │
│ ┌───────────────┐ │
│ │ .tfstate file │ │
│ │ (encrypted) │ │
│ └───────────────┘ │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ DynamoDB Lock │
│ ┌───────────────┐ │
│ │ LockID: hash │ │
│ │ Who: engineer │ │
│ │ Created: ts │ │
│ └───────────────┘ │
└─────────────────────┘Quick Answer
Terraform plan is a dry-run that shows what changes Terraform would make without modifying any infrastructure, while terraform apply actually executes those changes. Plan reads state and config, computes a diff, and outputs it; apply performs that diff against real cloud APIs.
Detailed Answer
Understanding the difference between plan and apply is fundamental, but the internals reveal much more than just 'one previews, the other executes.' Think of terraform plan like a restaurant printing a receipt before charging your card — it shows exactly what will happen so you can review it. terraform apply is when the charge actually goes through. When you run terraform plan, Terraform performs several steps internally. First, it loads the configuration by parsing all .tf files in the current directory and resolving module sources. Second, it reads the current state file to understand what resources already exist. Third, it performs a state refresh — making API calls to every cloud provider to check the actual status of each managed resource and updating the in-memory state with real attributes. This refresh step is crucial because someone might have manually changed a security group rule outside of Terraform. Fourth, Terraform builds a dependency graph of all resources, computes the diff between desired state (your HCL) and actual state (refreshed), and produces an execution plan showing creates, updates, and destroys with specific attribute changes. The plan output uses a clear notation: + for create, ~ for update in-place, - for destroy, and -/+ for destroy-then-recreate (also called a forced replacement). When you see -/+ next to your production database, that is the moment you should stop and investigate — it means Terraform wants to destroy and recreate that resource, which could mean data loss. terraform apply by default runs a plan first and asks for confirmation before proceeding. Once confirmed, Terraform walks the dependency graph and makes real API calls to create, update, or destroy resources. It processes independent resources in parallel (up to 10 by default, configurable with -parallelism) and sequential resources in dependency order. After each resource operation completes, Terraform immediately writes the updated state file, ensuring that even if apply is interrupted midway, the state reflects what was actually created. A critical production practice is using saved plan files. You run terraform plan -out=tfplan to save the plan to a binary file, review it, and then run terraform apply tfplan. This guarantees that what you reviewed is exactly what gets applied — no re-computation, no changes from someone else's commit sneaking in between plan and apply. In CI/CD pipelines, this two-stage approach is essential. The plan stage runs in a pull request for review, and the apply stage runs only after merge using the exact saved plan. One subtle gotcha: terraform apply without a saved plan will re-compute the plan at apply time, meaning the infrastructure could have changed between when you reviewed the plan output and when apply runs. In fast-moving environments with multiple teams, this gap can cause surprises. Always use saved plans in production workflows.
Code Example
# Step 1: Run plan and save the output to a binary plan file
# The -out flag saves the computed plan for exact replay during apply
# terraform plan -out=payments-deploy-2024-03-15.tfplan
# Step 2: Review the plan output carefully before applying
# Look for any -/+ (destroy and recreate) on stateful resources
# terraform show payments-deploy-2024-03-15.tfplan
# Step 3: Apply the exact saved plan without re-computation
# terraform apply payments-deploy-2024-03-15.tfplan
# Production CI/CD pipeline example using saved plans
# This is typically in a Makefile or CI script
# Variable definitions for the payments infrastructure deployment
variable "db_instance_class" {
# The RDS instance size for the payments database
description = "Instance class for the payments RDS cluster"
# Enforce string type to prevent accidental numeric input
type = string
# Default to a production-grade instance size
default = "db.r6g.xlarge"
}
# RDS cluster that plan will evaluate and apply will create/update
resource "aws_rds_cluster" "payments_db" {
# Unique cluster identifier following the naming convention
cluster_identifier = "payments-db-prod-us-east-1"
# Use Aurora PostgreSQL for the payments database engine
engine = "aurora-postgresql"
# Pin to a specific engine version to avoid surprise upgrades
engine_version = "15.4"
# Place the database in the payments VPC private subnets
db_subnet_group_name = aws_db_subnet_group.payments_db_subnets.name
# Use the payments database security group for network access control
vpc_security_group_ids = [aws_security_group.payments_db_sg.id]
# Master username for the database administrator account
master_username = "payments_admin"
# Pull the password from AWS Secrets Manager, never hardcode
master_password = data.aws_secretsmanager_secret_version.db_password.secret_string
# Enable deletion protection to prevent accidental terraform destroy
deletion_protection = true
# Skip final snapshot only in dev; always snapshot in prod
skip_final_snapshot = false
# Name the final snapshot with a timestamp for recovery
final_snapshot_identifier = "payments-db-final-${formatdate("YYYY-MM-DD", timestamp())}"
# Tags for cost allocation and ownership tracking
tags = {
Service = "payments-processing"
Environment = "production"
BackupTier = "critical"
}
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────┐
│ terraform plan │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Load HCL │──→│ Read │──→│ Refresh │ │
│ │ Config │ │ State │ │ via API │ │
│ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Compute │ │
│ │ Diff │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Plan Output │ │
│ │ + create │ │
│ │ ~ update │ │
│ │ - destroy │ │
│ └──────┬──────┘ │
└─────────────────────────────────────┼─────────────────────┘
│
┌─────────▼──────────┐
│ Saved Plan File │
│ (.tfplan binary) │
└─────────┬──────────┘
│
┌─────────────────────────────────────┼─────────────────────┐
│ terraform apply │ │
│ ┌──────▼──────┐ │
│ │ Walk Dep │ │
│ │ Graph │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────────┬┴────────────────┐ │
│ ┌─────▼─────┐ ┌──────▼─────┐ ┌────────▼┐ │
│ │ Create │ │ Update │ │ Destroy │ │
│ │ Resources │ │ Resources │ │ Removed │ │
│ └─────┬─────┘ └──────┬─────┘ └────────┬┘ │
│ └────────────────┼─────────────────┘ │
│ ┌──────▼──────┐ │
│ │ Write State │ │
│ └─────────────┘ │
└───────────────────────────────────────────────────────────┘Quick Answer
Multi-stage builds keep build tools out of runtime images. Cache ordering speeds up builds. Pinning base images helps reproducibility, but you must rebuild often and scan for vulnerabilities so old layers don't hide security holes.
Detailed Answer
Think of a restaurant kitchen that uses mixers, cutting boards, and raw ingredients to make a meal, but only sends the finished plate to the customer. A bad container image ships the entire kitchen to the table. A good multi-stage Docker build uses one stage as the kitchen and a second stage as the clean plate. The final image has only what the app needs to run, which cuts size, attack surface, and surprises in production. Docker's build best practices call for multi-stage builds, smart base image choices, a solid .dockerignore file, skipping unnecessary packages, using the build cache wisely, and rebuilding images regularly. The real concern is not just size. Every extra package, shell, compiler, credential, or leftover file in the final image gives attackers something to inspect or exploit. Smaller runtime images are easier to scan, faster to transfer, quicker to start, and simpler to reason about when things go wrong. The build process runs Dockerfile instructions in order and can reuse cached layers when an instruction and its inputs have not changed. This is why you copy stable dependency files before frequently changing source code. With BuildKit, teams can also use cache mounts and secret mounts so dependency downloads go faster and credentials never become image layers. Multi-stage builds then copy only selected files from builder stages into the final runtime stage. In production pipelines, engineers pin base images by digest for repeatable builds, scan images for known vulnerabilities, generate SBOMs (software bills of materials), sign artifacts, and rebuild even when app code has not changed. Rebuilds matter because pinning a tag or digest freezes the base layer, while security patches land in newer versions. Good teams track image size, critical CVE counts, startup time, pull latency, and rollback success rates as key metrics. The tricky part is that reproducibility and freshness pull in opposite directions. Pinning python:3.12-slim@sha256:... makes builds predictable, but it also locks in vulnerabilities until someone bumps the digest. Floating tags pick up patches automatically, but they can change under you and create builds you cannot reproduce. Senior engineers solve this with automated dependency-update PRs, signed digests, scheduled CI rebuilds, and policy gates. The goal is to make image supply-chain safety boring and routine rather than heroic and manual.
Code Example
docker buildx build --pull --tag registry.internal/payments-api:2026.06.18 --file Dockerfile . # Builds with a refreshed base image tag and a traceable release tag.
docker history registry.internal/payments-api:2026.06.18 # Inspects final layers to confirm build tools and secrets were not copied into runtime.
docker scout cves registry.internal/payments-api:2026.06.18 # Scans the image for known vulnerabilities before promotion.
docker image inspect registry.internal/payments-api:2026.06.18 --format '{{json .RepoDigests}}' # Captures immutable digests for deployment manifests.
docker push registry.internal/payments-api:2026.06.18 # Publishes the reviewed image to the internal registry.◈ Architecture Diagram
┌──────────┐
│ Source │
└────┬─────┘
↓
┌──────────┐
│ Builder │
└────┬─────┘
↓ copy
┌──────────┐
│ Runtime │
└────┬─────┘
↓ scan
┌──────────┐
│ Registry │
└────┬─────┘
↓
┌──────────┐
│ Deploy │
└──────────┘Quick Answer
Multi-stage builds use multiple FROM lines to separate build tools from runtime artifacts, so the final image has only the compiled binary and minimal OS libraries. Ordering dependency installs before source code copies maximizes cache hits and avoids full rebuilds when only app code changes.
Detailed Answer
Think of a woodworking shop. You need saws, clamps, sandpaper, and a workbench to build a cabinet, but the customer only gets the finished cabinet. They do not take home the saw. A multi-stage Docker build works the same way: one stage has all the build tools, and the final stage holds only the finished product. In Docker, a multi-stage build uses multiple FROM instructions in a single Dockerfile. Each FROM begins a new stage with its own base image and filesystem. Intermediate stages can install compilers, download dependencies, run tests, and produce artifacts. The final stage starts from a tiny base like distroless or alpine and copies only the compiled binaries or bundled assets from earlier stages using COPY --from. This means the production image never contains gcc, npm, pip, or any build toolchain, eliminating hundreds of megabytes and thousands of CVE-carrying packages from the runtime image. Under the hood, Docker and BuildKit process each stage as an independent node in the build graph. BuildKit can run independent stages in parallel, which is a major speed advantage over the legacy builder. When the Dockerfile is ordered correctly (base image first, dependency manifest copy second, dependency install third, source code copy fourth, build fifth), BuildKit reuses cached layers for everything up to the point where content changes. Since dependency manifests like package.json or go.sum change far less often than source code, this ordering means most CI builds only rebuild the final compilation step instead of re-downloading all dependencies. At production scale, teams running 50 or more microservices through CI see dramatic results. A payments-api image that was 1.2 GB with a single-stage node build drops to 85 MB with a multi-stage build using distroless as the final base. CI time drops from 8 minutes to 2 minutes because dependency layers are cached. Security scanners report 90 percent fewer vulnerabilities because the final image has no compilers, shells, or package managers. Teams should also use .dockerignore to exclude test fixtures, documentation, and local configs from the build context, which prevents unnecessary cache busting and reduces context transfer time. The tricky gotcha is that COPY --from references are position-based by default (stage 0, stage 1), which breaks silently when someone adds a new stage. Always name stages with AS and reference by name. Another trap is copying an entire directory from the build stage instead of specific artifacts, which can accidentally include build caches, test output, or sensitive files in the production image. Architects should also know that multi-stage builds do not automatically clean up intermediate images in CI. BuildKit's garbage collection handles this, but disk pressure on CI runners can still build up if max-storage is not configured.
Code Example
# Dockerfile for payments-api using multi-stage build # Stage 1: Install dependencies in a full Node image FROM node:22-bookworm AS deps # Set the working directory for dependency installation WORKDIR /build # Copy only the dependency manifests first to maximize cache hits COPY package.json package-lock.json ./ # Install production dependencies with exact versions from lockfile RUN npm ci --production # Stage 2: Build the application with dev dependencies FROM node:22-bookworm AS builder # Set the working directory for the build process WORKDIR /build # Copy all dependency manifests for full install including dev deps COPY package.json package-lock.json ./ # Install all dependencies including TypeScript compiler and test tools RUN npm ci # Copy source code after dependencies to preserve layer cache COPY src/ ./src/ # Copy TypeScript config for compilation COPY tsconfig.json ./ # Compile TypeScript to JavaScript in the dist directory RUN npm run build # Stage 3: Production image with only runtime artifacts FROM gcr.io/distroless/nodejs22-debian12 AS production # Set a non-root user for security hardening USER 1000 # Set the working directory for the application WORKDIR /app # Copy only production node_modules from the deps stage COPY --from=deps /build/node_modules ./node_modules/ # Copy only the compiled JavaScript from the builder stage COPY --from=builder /build/dist ./dist/ # Expose the API port for documentation and container networking EXPOSE 8080 # Run the compiled application entry point CMD ["dist/server.js"]
◈ Architecture Diagram
┌──────────┐
│ deps │
│ npm ci │
└────┬─────┘
│
┌────┴─────┐
│ builder │
│ compile │
└────┬─────┘
│ COPY --from
┌────┴─────┐
│production│
│distroless│
└──────────┘Quick Answer
Secure Docker images use multi-stage builds to exclude build tools from the final image, run as non-root users with explicit UIDs, start from minimal base images like Distroless or Alpine, pin dependencies to digests, and drop all unnecessary capabilities. This reduces the attack surface from hundreds of exploitable packages to a minimal runtime footprint.
Detailed Answer
Think of building a secure Docker image like constructing a bank vault room. During construction, workers bring in welding equipment, power tools, scaffolding, and raw materials. Once the vault is complete, every construction tool is removed from the room. The vault door is keyed to specific authorized personnel, not a master key. The room contains only what is needed for its purpose: reinforced walls, a locking mechanism, and a ventilation system. If a thief breaks in, they find no tools to use against the vault itself. Multi-stage Docker builds follow the same principle: build tools exist only during construction and never ship to production. A multi-stage Dockerfile separates the build environment from the runtime environment using multiple FROM statements. The first stage installs compilers, package managers, testing frameworks, and build dependencies needed to compile the application. The second stage starts from a minimal base image and copies only the compiled binary or application artifacts from the build stage. For a Java banking application, the build stage might use a full JDK image with Maven, while the runtime stage uses a Distroless Java image that contains only the JRE and no shell, package manager, or system utilities. This dramatically reduces the number of packages that vulnerability scanners flag and eliminates tools that attackers could use for post-exploitation activities like installing malware or pivoting to other services. Running containers as non-root is a fundamental security control that prevents container breakout exploits from gaining host-level root access. The Dockerfile creates a dedicated application user with a specific numeric UID and GID, changes ownership of application files to that user, and switches to that user with the USER directive before the ENTRYPOINT. In banking environments, the specific UID matters because it must match file permissions on mounted volumes and satisfy Pod Security Standards that require runAsNonRoot in Kubernetes. Using numeric UIDs instead of usernames avoids dependency on /etc/passwd, which may not exist in Distroless images. The non-root user should have no shell assigned and no home directory beyond what the application needs. Minimal base images are the foundation of attack surface reduction. A standard Ubuntu base image contains over 100 installed packages including shells, text editors, network utilities, and package managers. An Alpine image reduces this to roughly 15 packages. A Google Distroless image contains only the application runtime and its direct dependencies, with no shell at all. For banking applications, Distroless is preferred for production because if an attacker gains code execution inside the container, they cannot open a shell, install tools, or inspect the filesystem interactively. When debugging is needed, teams use ephemeral debug containers through kubectl debug rather than shipping debug tools in production images. The production gotcha that catches many teams is the interaction between read-only root filesystems and application behavior. Many frameworks write temporary files, session data, or compilation caches to the filesystem at runtime. When the root filesystem is read-only, these writes fail and the application crashes. Teams must identify every path the application writes to and mount emptyDir volumes at those paths. Log files should go to stdout and stderr rather than filesystem paths. Another subtle issue is layer ordering in the Dockerfile: placing frequently changing instructions like COPY of application code after rarely changing instructions like dependency installation maximizes build cache utilization and reduces build times from minutes to seconds. In regulated environments, every base image must also be scanned and approved through the organization's software supply chain process before it can be used as a FROM source.
Code Example
# Secure multi-stage Dockerfile for payments-api (Spring Boot)
# Stage 1: Build — full JDK with Maven for compilation
FROM eclipse-temurin:17-jdk-alpine AS builder
WORKDIR /build
# Cache dependencies separately from application code
COPY pom.xml .
RUN mvn dependency:go-offline -B
# Copy source and build
COPY src/ src/
RUN mvn package -DskipTests -B && \
# Extract layered Spring Boot JAR for optimal Docker layers
java -Djarmode=layertools -jar target/payments-api.jar extract --destination extracted
# Stage 2: Runtime — minimal Distroless image (no shell, no pkg manager)
FROM gcr.io/distroless/java17-debian12:nonroot
# Labels for audit and compliance tracking
LABEL maintainer="[email protected]" \
app="payments-api" \
compliance="sox-pci" \
base-image="distroless-java17"
WORKDIR /app
# Copy Spring Boot layers in dependency order for cache efficiency
COPY --from=builder /build/extracted/dependencies/ ./
COPY --from=builder /build/extracted/spring-boot-loader/ ./
COPY --from=builder /build/extracted/snapshot-dependencies/ ./
COPY --from=builder /build/extracted/application/ ./
# Run as non-root user (UID 65532 is the nonroot user in Distroless)
USER 65532:65532
# Health check for Kubernetes readiness probes
EXPOSE 8080
ENTRYPOINT ["java", "-XX:MaxRAMPercentage=75.0", \
"-Djava.security.egd=file:/dev/./urandom", \
"org.springframework.boot.loader.launch.JarLauncher"]
# Compare image sizes to prove attack surface reduction
# docker images
# REPOSITORY TAG SIZE
# payments-api-full latest 580MB (JDK + Maven + OS tools)
# payments-api latest 210MB (Distroless JRE only)
# Verify no shell exists in the production image
# docker run --rm payments-api /bin/sh
# exec: "/bin/sh": stat /bin/sh: no such file or directory
# Scan the final image for vulnerabilities
trivy image --severity CRITICAL,HIGH ecr.bank.com/payments-api:v2.3.1◈ Architecture Diagram
┌─────────────────────────────────────────────┐ │ Multi-Stage Build │ │ │ │ Stage 1: Builder │ │ ┌────────────────────────────┐ │ │ │ JDK 17 + Maven │ │ │ │ Source Code │ │ │ │ Test Frameworks │ ← DISCARDED │ │ │ Build Tools │ │ │ │ OS Packages (580MB) │ │ │ └─────────────┬─────────────┘ │ │ │ COPY --from=builder │ │ ↓ (JAR only) │ │ Stage 2: Runtime │ │ ┌────────────────────────────┐ │ │ │ Distroless Java 17 │ │ │ │ payments-api.jar │ ← SHIPPED │ │ │ USER 65532 (non-root) │ │ │ │ No shell, no pkg mgr │ │ │ │ Read-only rootFS (210MB) │ │ │ └────────────────────────────┘ │ └─────────────────────────────────────────────┘
Quick Answer
A multi-stage build uses multiple FROM lines in one Dockerfile. You compile code in one stage and copy only the finished artifact into a tiny runtime image. This slashes image size and attack surface.
Detailed Answer
Think of a multi-stage build like a factory assembly line. In a car factory, welding robots, paint booths, and heavy machinery stay on the factory floor. Only the finished car rolls out and into the showroom. A multi-stage Docker build works the same way: all the bulky compilers, build tools, and source code stay in the build stage, while only the lean, finished binary ships in the final image. The key idea is separating what you need to build from what you need to run. A multi-stage Docker build is a Dockerfile pattern where you write multiple FROM instructions, each starting a fresh stage. The first stage usually pulls a full SDK or compiler image, installs dependencies, compiles source code, runs tests, and produces a deployable file. Later stages start from a tiny base image like alpine or distroless and use COPY --from to grab only the necessary files from earlier stages. The result is a final image that contains nothing except what the app needs to run, with no leftover build tools, package caches, or temporary files. Under the hood, Docker treats each stage as a separate build context with its own layer history. When Docker hits a second FROM instruction, it starts a clean image context while keeping previous stages in memory for reference. The COPY --from=builder command reaches into the filesystem of the named stage and pulls out specific paths. Each stage can use a completely different base image. For example, you might build with golang:1.21 and run with gcr.io/distroless/static-debian12. The build cache works per-stage, so changes to later stages do not force earlier stages to rebuild, which makes development iterations faster. In production, multi-stage builds are considered a must-have for several reasons. First, they shrink image size dramatically. A Go app built in a standard golang image might weigh 900MB, but the final distroless image with just the static binary can be under 15MB. Smaller images pull faster across registries, scale quicker in Kubernetes, cost less to store, and give security scanners far less to audit. Second, multi-stage builds fit cleanly into CI/CD pipelines because the entire build process lives inside the Dockerfile. No external build scripts or Makefiles needed. Third, they enable reproducible builds since every developer and every CI runner uses the exact same build environment defined in that first stage. A common mistake is copying too many files from the build stage. Using COPY --from=builder / /app/ instead of targeting a specific directory can accidentally pull in source code, credential files, or package caches that bloat the image and create security risks. Always copy only the exact artifact you need. Another subtle issue: build arguments (ARG) defined in one stage are not available in later stages. You must redeclare ARG after each FROM if you need the same value. Finally, intermediate stages are not always cleaned up automatically, so running docker image prune regularly is important to free disk space on build servers.
Code Example
# Stage 1: Build the payments-api Go binary FROM golang:1.21-alpine AS builder # Set the working directory inside the build container WORKDIR /src # Copy go.mod and go.sum first to leverage layer caching COPY go.mod go.sum ./ # Download dependencies (cached if go.mod/go.sum unchanged) RUN go mod download # Copy the entire source code into the build container COPY . . # Compile the payments-api binary with CGO disabled for static linking RUN CGO_ENABLED=0 GOOS=linux go build -o /payments-api ./cmd/server # Stage 2: Create a minimal production image FROM gcr.io/distroless/static-debian12 # Copy only the compiled binary from the builder stage COPY --from=builder /payments-api /payments-api # Copy the config file needed at runtime COPY --from=builder /src/config/production.yaml /config/production.yaml # Expose the port the payments-api listens on EXPOSE 8080 # Set the entrypoint to run the payments-api binary ENTRYPOINT ["/payments-api"]
◈ Architecture Diagram
┌─────────────────────────────┐
│ Stage 1: builder │
│ FROM golang:1.21-alpine │
│ │
│ ┌───────────────────────┐ │
│ │ Source Code + go.mod │ │
│ └───────────┬───────────┘ │
│ ↓ │
│ ┌───────────────────────┐ │
│ │ go build → /payments │ │
│ └───────────┬───────────┘ │
└──────────────┼──────────────┘
↓ COPY --from=builder
┌──────────────┼──────────────┐
│ Stage 2: runtime │
│ FROM distroless │
│ ↓ │
│ ┌───────────────────────┐ │
│ │ /payments (binary) │ │
│ │ /config/prod.yaml │ │
│ └───────────────────────┘ │
│ │
│ EXPOSE 8080 │
└─────────────────────────────┘