34 interview questions · azure, kubernetes, terraform
Quick Answer
Least-privilege RBAC in EKS maps IAM roles to Kubernetes Roles and ClusterRoles through aws-auth ConfigMap or EKS access entries. Dev teams get namespace-scoped Roles, CI/CD pipelines use dedicated ServiceAccounts with deploy-only permissions, and production admin access uses break-glass procedures with time-bound credentials and full audit logging.
Detailed Answer
Think of RBAC in EKS like a bank's physical access control system. A teller can access their counter and the shared vault during business hours. A security guard can access the camera room and patrol areas but cannot open customer safe deposit boxes. The branch manager has a master key stored in a sealed envelope that requires two signatures to open and is only used during emergencies. Each person has exactly the permissions they need for their daily job, and any elevation requires approval, logging, and time limits. EKS RBAC follows the same principle across three distinct personas: developers, pipelines, and production administrators. For development teams, the foundation is namespace isolation. Each team or application gets its own Kubernetes namespace, and a Role scoped to that namespace grants only the verbs and resources the team needs. A developer working on the payments-api service needs permission to view pods, logs, events, and ConfigMaps in the payments namespace but should never modify Deployments directly or access secrets in the fraud-detection namespace. EKS maps these permissions through IAM role assumption: developers authenticate via AWS SSO, assume an IAM role like payments-dev-role, and the aws-auth ConfigMap or EKS access entries map that IAM role to a Kubernetes Group bound to the appropriate Role. This separation means revoking a developer's access is a single IAM policy change, not a cluster-level operation. CI/CD pipelines require a different permission model because they are non-interactive, automated, and high-privilege by nature. A Jenkins or GitHub Actions pipeline that deploys to Kubernetes needs permission to create and update Deployments, Services, and ConfigMaps, but it should never read Secrets directly, modify RBAC bindings, or access other namespaces. Each pipeline gets a dedicated Kubernetes ServiceAccount with a Role that permits only the resources it deploys. The ServiceAccount token is delivered through IAM Roles for Service Accounts (IRSA) rather than static long-lived tokens, ensuring credentials rotate automatically and can be audited through CloudTrail. Pipeline permissions are further restricted by resource names where possible: a payments-api pipeline can only update the payments-api Deployment, not the fraud-detector Deployment in the same namespace. Production administrator access in a banking environment follows the break-glass model. Day-to-day operations should not require cluster-admin access. Instead, SRE teams have read-only ClusterRoles that allow viewing resources across all namespaces for monitoring and troubleshooting. When a genuine emergency requires elevated access, engineers request temporary credentials through a privileged access management system like CyberArk or AWS IAM Identity Center with a defined session duration. The break-glass IAM role maps to a cluster-admin ClusterRoleBinding, and every action taken with that role is logged to CloudTrail, Kubernetes audit logs, and the organization's SIEM. After the incident window closes, the session expires automatically. The production gotcha that catches many teams is RBAC drift and stale bindings. Over months, teams accumulate RoleBindings for departed employees, decommissioned pipelines, and experimental namespaces. Without periodic access reviews, the RBAC surface area grows silently. Banking regulators like the OCC require quarterly access certifications, so mature teams automate RBAC auditing by exporting all RoleBindings and ClusterRoleBindings, mapping them to active IAM identities, and flagging orphaned or overly broad bindings. Another common mistake is granting wildcard permissions on resources or verbs during initial setup and never narrowing them. A ClusterRole with resources: ['*'] and verbs: ['*'] is functionally equivalent to cluster-admin and defeats the entire purpose of RBAC.
Code Example
# Namespace-scoped Role for payments dev team (read-only, no secrets)
# rbac-payments-dev.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: payments-dev-readonly
namespace: payments
rules:
- apiGroups: [""]
resources: ["pods", "pods/log", "services", "configmaps", "events"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch"]
---
# Bind IAM-mapped group to the Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: payments-dev-binding
namespace: payments
subjects:
- kind: Group
name: payments-dev-team # Mapped from IAM role via aws-auth
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: Role
name: payments-dev-readonly
apiGroup: rbac.authorization.k8s.io
# CI/CD pipeline ServiceAccount with deploy-only permissions
# rbac-cicd-payments.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: cicd-payments-deployer
namespace: payments
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/cicd-payments-deployer
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cicd-deployer
namespace: payments
rules:
- apiGroups: ["apps"]
resources: ["deployments"]
resourceNames: ["payments-api"] # Restrict to specific deployment
verbs: ["get", "patch", "update"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["get", "create", "update"]
# Break-glass ClusterRole for production emergencies
# rbac-breakglass.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: breakglass-admin
annotations:
audit.bank.com/justification: "emergency-access-only"
audit.bank.com/max-duration: "2h"
subjects:
- kind: Group
name: sre-breakglass # Time-bound IAM role via AWS SSO
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: cluster-admin
apiGroup: rbac.authorization.k8s.io
# Audit: find all ClusterRoleBindings granting cluster-admin
kubectl get clusterrolebindings -o json | \
jq '.items[] | select(.roleRef.name=="cluster-admin") | .metadata.name'◈ Architecture Diagram
┌─────────────────────────────────────────────────────┐ │ EKS Cluster │ ├─────────────────────────────────────────────────────┤ │ │ │ ┌──────────┐ aws-auth ┌───────────────────┐ │ │ │ AWS SSO │────────────→│ K8s Group Mapping │ │ │ │ IAM Role │ └─────────┬─────────┘ │ │ └──────────┘ │ │ │ ↓ │ │ ┌────────────────┐ ┌────────────────────────┐ │ │ │ Dev Team │ │ payments namespace │ │ │ │ (read-only) │→ │ Role: pods,logs,events │ │ │ └────────────────┘ └────────────────────────┘ │ │ │ │ ┌────────────────┐ ┌────────────────────────┐ │ │ │ CI/CD Pipeline │ │ payments namespace │ │ │ │ (IRSA token) │→ │ Role: deploy only │ │ │ └────────────────┘ └────────────────────────┘ │ │ │ │ ┌────────────────┐ ┌────────────────────────┐ │ │ │ SRE Break-Glass│ │ cluster-admin │ │ │ │ (time-bound) │→ │ 2hr session + audit │ │ │ └────────────────┘ └────────────────────────┘ │ └─────────────────────────────────────────────────────┘
Quick Answer
Production-only failures are caused by environment differences: tighter resource quotas, stricter network policies, different secrets or certificates, higher traffic load, or missing IAM permissions. Prevention requires environment parity through identical Helm charts with per-env values, pre-production load testing, and promotion gates that verify production-specific dependencies.
Detailed Answer
Think of a car that runs perfectly on a test track but breaks down on a real highway. The test track has smooth roads, no traffic, and perfect weather. The highway has potholes, rush hour, and rain. The car itself did not change — the environment did. Production failures after Dev/UAT success follow the same pattern: the application code is identical, but the surrounding infrastructure, load, and security boundaries are different. The most common causes of production-only failures are resource constraints (production has stricter quotas or different instance types), network restrictions (production network policies block connections that were open in UAT), secret and certificate differences (production uses different database endpoints, API keys, or TLS certificates), external dependency behavior (third-party APIs rate-limit production traffic differently), and traffic volume (production handles 100x the requests that UAT sees, exposing concurrency bugs or connection pool exhaustion). To diagnose a production-only failure, compare the environment configurations side by side. Use kubectl diff to compare manifests between UAT and production. Check resource quotas, network policies, service mesh rules, and ingress configurations. Examine the application logs for connection errors to databases, caches, or external APIs. Use kubectl top to compare actual resource usage between environments. Check whether the production node pool has different instance types, kernel parameters, or container runtime versions. Prevention requires several practices. Use the same Helm chart or Kustomize base across all environments, varying only through values files. Implement a staging environment that mirrors production networking, security, and scale. Run automated smoke tests and integration tests after every deployment to every environment. Use canary deployments in production so that only a small percentage of traffic hits the new version initially. Maintain a deployment checklist that verifies production-specific prerequisites: database migrations completed, feature flags configured, secrets rotated, and external dependencies reachable. The non-obvious gotcha is that even with perfect environment parity, production can still fail due to data differences. A database migration that works on a small Dev dataset can lock tables for minutes on a production dataset with millions of rows. Connection pools that are adequate for UAT traffic can be exhausted under production load. Architects should include data volume and traffic simulation in pre-production testing, not just functional correctness.
Code Example
# Compare Helm values between UAT and Production to find differences
diff <(helm get values payments-api -n payments --output yaml --kube-context uat-cluster) \
<(helm get values payments-api -n payments --output yaml --kube-context prod-cluster) # Shows value differences
# Check resource quotas in production namespace
kubectl describe resourcequota -n payments --context prod-cluster # Shows CPU/memory limits vs used
# Compare network policies between environments
kubectl get networkpolicy -n payments --context uat-cluster -o yaml > /tmp/uat-netpol.yaml # Export UAT policies
kubectl get networkpolicy -n payments --context prod-cluster -o yaml > /tmp/prod-netpol.yaml # Export prod policies
diff /tmp/uat-netpol.yaml /tmp/prod-netpol.yaml # Highlights network policy differences
# Check if production pods can reach the database
kubectl exec -n payments deploy/payments-api --context prod-cluster -- \
nc -zv payments-db.internal.company.com 5432 # Tests database connectivity from the pod
# Compare actual resource usage between environments
kubectl top pods -n payments --context prod-cluster --sort-by=memory # Shows memory consumption in production◈ Architecture Diagram
┌─────┐ ┌─────┐ ┌─────┐
│ Dev │─→│ UAT │─→│ Prod│
│ ✓ │ │ ✓ │ │ ✗ │
└─────┘ └─────┘ └──┬──┘
↓
┌────────────┐
│ Compare │
│ ─ Quotas │
│ ─ NetPol │
│ ─ Secrets │
│ ─ Scale │
└────────────┘Quick Answer
Production incident troubleshooting follows a triage sequence: check pod status and events, examine container logs, verify resource usage and node health, inspect network connectivity, and correlate with monitoring dashboards. RCA documentation should capture timeline, root cause, contributing factors, resolution steps, and preventive actions in a blameless post-mortem format.
Detailed Answer
Think of a hospital emergency room. When a patient arrives, the team follows a triage protocol: check vitals first, then run diagnostics, then treat. They do not start with the most complex test — they start with the quickest indicators and narrow down. After the patient recovers, the team documents what happened, why, and how to improve response for future cases. Kubernetes incident response works the same way. In Kubernetes, production incident triage starts with the broadest view and narrows. First, check cluster health: are nodes Ready, is the control plane responsive? Then check the affected namespace: are pods Running, are services reachable? Then drill into the specific failing component: what do the pod events say, what do the container logs show, what are the resource use numbers? This top-down approach prevents wasting time on application logs when the real problem is a node running out of disk space. The diagnostic sequence uses specific commands at each level. For cluster health: kubectl get nodes and kubectl top nodes. For namespace health: kubectl get pods -n payments and kubectl get events -n payments --sort-by=lastTimestamp. For pod-level diagnosis: kubectl describe pod, kubectl logs with --previous for crashed containers, and kubectl exec for connectivity tests. For resource issues: kubectl top pods and checking against resource requests and limits. For networking: checking Service endpoints, NetworkPolicy rules, and DNS resolution from inside the pod. At production scale, RCA documentation follows a blameless post-mortem template. The document captures: incident summary (what happened, when, impact), timeline (when detected, when triaged, when resolved), root cause (the specific technical failure), contributing factors (what made detection or resolution slower), resolution steps (what was done to fix it), and preventive actions (what changes will prevent recurrence). Each preventive action should be a concrete ticket with an owner and deadline — not vague statements like 'improve monitoring.' Examples: 'Add PDB for payments-api with minAvailable=2' or 'Create OOMKilled alert with threshold at 80% memory use.' The non-obvious gotcha is that teams often fix the immediate symptom without addressing the root cause. A pod restarting due to OOMKilled might be fixed by increasing memory limits, but the real cause might be a memory leak in the application, an unbounded cache, or a goroutine leak. The RCA should distinguish between the immediate fix (increase limits) and the long-term fix (patch the memory leak). Without this distinction, the same incident recurs at a larger scale.
Code Example
# Step 1: Check cluster and node health
kubectl get nodes -o wide # Verify all nodes are Ready and check kernel/runtime versions
kubectl top nodes # Check CPU and memory pressure across nodes
# Step 2: Check affected namespace
kubectl get pods -n payments -o wide # See pod status, restarts, node placement
kubectl get events -n payments --sort-by='.lastTimestamp' | tail -30 # Recent events showing failures
# Step 3: Drill into the failing pod
kubectl describe pod payments-api-7d9f8b6c4-x2k9m -n payments # Shows events, conditions, resource usage
kubectl logs payments-api-7d9f8b6c4-x2k9m -n payments --previous --tail=100 # Logs from the crashed container
# Step 4: Check resource usage vs limits
kubectl top pod payments-api-7d9f8b6c4-x2k9m -n payments # Current CPU and memory usage
kubectl get pod payments-api-7d9f8b6c4-x2k9m -n payments -o jsonpath='{.spec.containers[0].resources}' # Configured limits
# Step 5: Check if OOMKilled was the termination reason
kubectl get pod payments-api-7d9f8b6c4-x2k9m -n payments -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}' # Shows OOMKilled if applicable
# Step 6: Document the incident in the RCA template
# Timeline: 14:23 alert fired → 14:25 on-call acknowledged → 14:31 root cause identified → 14:35 fix applied
# Root cause: payments-api memory leak in connection pool causing OOMKilled after 4 hours of operation
# Fix: increased memory limit from 512Mi to 1Gi (immediate), patched connection pool cleanup (permanent)◈ Architecture Diagram
┌──────────┐
│ Incident │
└────┬─────┘
↓
┌──────────┐
│ Triage │
│ Node→Pod │
└────┬─────┘
↓
┌──────────┐
│ Diagnose │
│ Logs+Evt │
└────┬─────┘
↓
┌──────────┐
│ Fix+RCA │
│ Prevent │
└──────────┘Quick Answer
Check the container's last termination reason and exit code: OOMKilled shows reason=OOMKilled with exit 137, connectivity failures show timeout errors in application logs with exit 1, and application bugs show stack traces or panic messages in logs with exit 1 or 2. The distinction comes from correlating exit codes, termination reasons, and log content.
Detailed Answer
Think of a car that keeps stalling. A mechanic checks three things in order: is it out of fuel (OOMKilled — out of memory), is the road blocked (connectivity — cannot reach a dependency), or is the engine itself broken (application bug). Each has a different diagnostic signature, and checking them in the right order saves time. In Kubernetes, every container termination has metadata that points to the cause. The container status records the exit code, the termination reason, and the termination message. OOMKilled is the clearest: Kubernetes sets the reason field to OOMKilled and the exit code to 137. This means the kernel's Out-Of-Memory killer terminated the process because it exceeded its cgroup memory limit. The container did not choose to exit — it was killed by the kernel. For connectivity failures, the exit code is typically 1 (generic application error) and the logs show timeout or connection refused messages when trying to reach a database, cache, or external API. The key diagnostic is checking the application logs for patterns like 'connection refused,' 'timeout,' 'no such host,' or 'TLS handshake failed.' You can verify by execing into the pod and testing connectivity manually with nc, curl, or nslookup to isolate whether it is a DNS, network policy, or service availability issue. For application bugs, the exit code is 1 or sometimes 2 (misuse), and the logs show stack traces, null pointer exceptions, panic messages, or assertion failures. These are predictable (deterministic) — the same input or configuration triggers the same crash. You can distinguish them from connectivity issues because the error occurs during request processing or startup logic, not during a connection attempt. The non-obvious gotcha is that OOMKilled can masquerade as an application bug if you only check logs. When the OOM killer strikes, the process is terminated immediately — there may be no log line because the application never got a chance to write one. If you see a container with exit code 137, zero log output, and high restart count, check the termination reason field directly. Also, a JVM application may show exit code 1 with a java.lang.OutOfMemoryError in logs if it hits the JVM heap limit before hitting the cgroup limit — this is an application-level OOM, not a kernel OOMKill, and the fix is different (increase JVM heap, not container memory limit).
Code Example
# Step 1: Check termination reason and exit code
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason, exitCode, startedAt, finishedAt
# Step 2: If exit code 137, confirm OOMKilled
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -i 'oom\|killed\|reason' # Confirms OOMKilled
# Step 3: Check memory usage vs limits for OOMKilled
kubectl top pod payments-api-7d9f8b6c4-abc12 -n payments # Current memory usage
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
-o jsonpath='{.spec.containers[0].resources.limits.memory}' # Configured memory limit
# Step 4: If exit code 1, check logs for connectivity vs application error
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=100 # Check for timeout/connection vs stack trace
# Step 5: Test connectivity from inside the pod
kubectl exec -n payments deploy/payments-api -- nc -zv payments-db.internal 5432 # Test database connectivity
kubectl exec -n payments deploy/payments-api -- nslookup redis-cache.payments.svc # Test DNS resolution
# Quick reference for exit codes:
# Exit 0 = Normal termination (container completed successfully)
# Exit 1 = Application error (check logs for stack trace or connection error)
# Exit 137 = SIGKILL (OOMKilled by kernel or killed by kubelet)
# Exit 143 = SIGTERM (graceful shutdown, often from liveness probe failure)◈ Architecture Diagram
┌──────────────┐
│ Pod Restart │
└──────┬───────┘
↓
┌──────────────┐
│ Exit Code? │
├──────┬───────┤
│ 137 │ 1 │
│ OOM │ Logs? │
└──┬───┴───┬───┘
↓ ↓
┌─────┐ ┌────────┐
│ OOM │ │Timeout?│
│Kill │ ├────┬───┤
└─────┘ │Yes │No │
↓ ↓
┌────┐┌────┐
│Conn││ Bug│
└────┘└────┘Quick Answer
EKS uses IAM for authentication and Kubernetes RBAC for authorization. IAM roles or SSO identities are mapped to Kubernetes groups via the aws-auth ConfigMap or EKS access entries, then ClusterRoleBindings grant permissions to those groups. L1 gets read-only, L2 gets namespace-scoped edit, Developers get deploy permissions, and Admins get cluster-admin.
Detailed Answer
Think of a hospital with badge-based access. Your employee badge (IAM identity) gets you through the front door (authentication), but which rooms you can enter depends on your department and clearance level (RBAC authorization). A nurse can access patient rooms but not the pharmacy vault. A doctor can access both. The badge system and the room access system are separate but connected. In EKS, authentication and authorization are separate layers. Authentication answers 'who are you?' using AWS IAM — users, roles, or SSO identities present AWS credentials to the EKS API server. Authorization answers 'what can you do?' using Kubernetes RBAC — Roles, ClusterRoles, RoleBindings, and ClusterRoleBindings define permissions. The bridge between them is the EKS access entry system (or the legacy aws-auth ConfigMap) which maps IAM principals to Kubernetes usernames and groups. The implementation involves three steps. First, create IAM roles for each access level: eks-l1-readonly, eks-l2-support, eks-developer, eks-admin. Second, map each IAM role to a Kubernetes group using EKS access entries (preferred) or the aws-auth ConfigMap. Third, create Kubernetes ClusterRoles and RoleBindings that grant appropriate permissions to each group. L1 support gets a ClusterRoleBinding to the built-in view ClusterRole (read-only across all namespaces). L2 support gets RoleBindings to the edit ClusterRole in specific namespaces. Developers get custom Roles with deploy permissions (create/update Deployments, Services, ConfigMaps) in their team namespaces. Admins get ClusterRoleBinding to cluster-admin. At production scale, teams integrate EKS with AWS SSO (IAM Identity Center) so users authenticate through their corporate identity provider. Permission sets in AWS SSO map to IAM roles, which map to Kubernetes groups. This creates a chain: corporate identity → SSO permission set → IAM role → Kubernetes group → RBAC permissions. Monitoring should include audit logs for who accessed what, periodic access reviews, and alerts on cluster-admin usage. The non-obvious gotcha is that the aws-auth ConfigMap is a single point of failure. If someone deletes or corrupts it, all IAM-based access to the cluster is lost (except the cluster creator's IAM principal). EKS access entries, the newer mechanism, are managed through the EKS API and are more resilient. Teams should also be aware that IAM permissions and Kubernetes RBAC are evaluated independently — having IAM access to the EKS API does not automatically grant Kubernetes permissions, and vice versa.
Code Example
# Create EKS access entry for L1 read-only support team aws eks create-access-entry \ --cluster-name payments-cluster \ --principal-arn arn:aws:iam::123456789012:role/eks-l1-readonly \ --kubernetes-groups l1-support # Maps IAM role to Kubernetes group # Associate read-only access policy for L1 aws eks associate-access-policy \ --cluster-name payments-cluster \ --principal-arn arn:aws:iam::123456789012:role/eks-l1-readonly \ --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSViewPolicy \ --access-scope type=cluster # Grants read-only across all namespaces # Create Kubernetes RBAC for L2 support with edit access in payments namespace apiVersion: rbac.authorization.k8s.io/v1 # RBAC API group kind: RoleBinding # Namespace-scoped permission binding metadata: name: l2-support-edit # Descriptive binding name namespace: payments # L2 can edit resources in payments namespace only subjects: - kind: Group # References a Kubernetes group name: l2-support # Group name mapped from IAM role apiGroup: rbac.authorization.k8s.io # RBAC API group roleRef: kind: ClusterRole # References the built-in edit role name: edit # Allows create, update, delete of most resources apiGroup: rbac.authorization.k8s.io # RBAC API group # Verify what permissions a specific user has kubectl auth can-i list pods -n payments --as-group=l1-support # Should return yes (read-only) kubectl auth can-i delete pods -n payments --as-group=l1-support # Should return no (read-only cannot delete)
◈ Architecture Diagram
┌──────────┐
│ IAM Role │
└────┬─────┘
↓
┌──────────┐
│ Access │
│ Entry │
└────┬─────┘
↓
┌──────────┐
│ K8s Group│
└────┬─────┘
↓
┌──────────────────────┐
│ RBAC Bindings │
│ L1 → view │
│ L2 → edit (ns) │
│ Dev → deploy (ns) │
│ Admin → cluster-admin│
└──────────────────────┘Quick Answer
RBAC (Role-Based Access Control) in Kubernetes controls who can perform what actions on which resources. Roles and ClusterRoles define permissions (verbs on resources), RoleBindings and ClusterRoleBindings attach those permissions to subjects (Users, Groups, or ServiceAccounts), with Roles scoped to a namespace and ClusterRoles scoped cluster-wide.
Detailed Answer
Think of RBAC like the security system of a large office building. A Role is like a keycard that opens specific doors on a specific floor (namespace). A ClusterRole is like a master keycard that works across all floors. A RoleBinding is the act of issuing a keycard to a specific employee for a specific floor. A ServiceAccount is an employee badge for an automated system (like the mail robot) that needs to move through certain areas. Without RBAC, every employee would have a master key, which is the equivalent of running everything as cluster-admin. In Kubernetes, RBAC is one of several authorization modules (others include ABAC, Webhook, and Node authorization). It is enabled by default in most distributions and is the standard mechanism for controlling access to the API server. RBAC operates on four object types: Role (namespaced permissions), ClusterRole (cluster-wide permissions), RoleBinding (grants a Role or ClusterRole to subjects in a specific namespace), and ClusterRoleBinding (grants a ClusterRole to subjects across all namespaces). The API server evaluates RBAC rules on every request by checking if any binding grants the requesting subject the required verb on the requested resource. Internally, when a request hits the kube-apiserver, it passes through three stages: Authentication (who are you?), Authorization (are you allowed?), and Admission Control (any mutations or validations?). During the Authorization stage, the RBAC authorizer retrieves all RoleBindings and ClusterRoleBindings that reference the requesting subject. For each binding, it checks if the associated Role or ClusterRole contains a rule that matches the request's verb (get, list, create, update, patch, delete, watch), resource (pods, services, deployments), API group (apps, batch, networking.k8s.io), and optionally the specific resource name. If any rule matches, the request is allowed; if no rule matches across all bindings, the request is denied. Rules are additive only -- there are no deny rules in RBAC. At scale, RBAC management becomes complex. Large organizations use ClusterRoles as templates bound via RoleBindings in specific namespaces, allowing a single ClusterRole like 'namespace-admin' to be reused across hundreds of namespaces. Aggregated ClusterRoles (using aggregationRule with label selectors) allow CRD operators to automatically extend existing roles. ServiceAccounts are the primary identity for Pods: each namespace has a 'default' ServiceAccount, and Pods that do not specify a ServiceAccount use it. Since Kubernetes 1.24, ServiceAccount tokens are no longer auto-mounted as long-lived Secrets; instead, the TokenRequest API issues short-lived, audience-bound tokens projected into Pods via projected volumes. A non-obvious gotcha is that RoleBindings can reference ClusterRoles, which is actually a powerful pattern. You define the ClusterRole once and bind it in specific namespaces, scoping its permissions to that namespace. Without this pattern, you would need to duplicate Role definitions in every namespace. Another trap: the default ServiceAccount in each namespace often has no permissions (good), but many teams add permissions to the default ServiceAccount instead of creating dedicated ServiceAccounts per workload. This means any Pod in the namespace inherits those permissions, violating least privilege. The automountServiceAccountToken: false setting should be applied to the default ServiceAccount, and workload-specific ServiceAccounts should be created for Pods that actually need API access.
Code Example
# ServiceAccount for the payments processing service
apiVersion: v1
kind: ServiceAccount
metadata:
name: payments-processor-sa # Dedicated ServiceAccount for this workload
namespace: payments # Scoped to payments namespace
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789:role/payments-s3 # IAM role for AWS access
automountServiceAccountToken: true # This SA needs API access
---
# ClusterRole defining permissions for reading secrets and configmaps
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: config-reader # Reusable ClusterRole for config reading
rules:
- apiGroups: [""] # Core API group (empty string)
resources: ["configmaps", "secrets"] # Can access ConfigMaps and Secrets
verbs: ["get", "list", "watch"] # Read-only operations
- apiGroups: [""] # Core API group
resources: ["pods"] # Can view Pod status
verbs: ["get", "list"] # Read-only, no watch needed
- apiGroups: ["apps"] # Apps API group
resources: ["deployments"] # Can view Deployment status
verbs: ["get", "list"] # Read-only access
---
# RoleBinding scoping the ClusterRole to the payments namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: payments-config-reader # Binding name
namespace: payments # Scoped to payments namespace only
subjects:
- kind: ServiceAccount # Bind to a ServiceAccount
name: payments-processor-sa # The dedicated SA created above
namespace: payments # SA's namespace
roleRef:
kind: ClusterRole # Reference a ClusterRole (not a Role)
name: config-reader # The ClusterRole defined above
apiGroup: rbac.authorization.k8s.io # RBAC API group
---
# Deployment using the dedicated ServiceAccount
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-processor # Payments processor Deployment
namespace: payments # In payments namespace
spec:
replicas: 3 # Three replicas for availability
selector:
matchLabels:
app: payments-processor # Pod selector
template:
metadata:
labels:
app: payments-processor # Pod label
spec:
serviceAccountName: payments-processor-sa # Use the dedicated SA, not default
automountServiceAccountToken: true # Mount the token for API access
containers:
- name: payments-processor # Main container
image: registry.internal.io/payments-processor:v2.0.3 # App image
---
# Lock down the default ServiceAccount to prevent accidental API access
apiVersion: v1
kind: ServiceAccount
metadata:
name: default # Override the default SA
namespace: payments # In payments namespace
automountServiceAccountToken: false # Do NOT auto-mount token for default SA◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐
│ User / │ │ Role │ │ Resources│
│ Service │ │ Binding │ │ pods,svc │
│ Account │───→│ │───→│ secrets │
└──────────┘ └────┬─────┘ └──────────┘
│
↓
┌──────────┐
│ Role / │
│ Cluster │
│ Role │
│ │
│ verbs: │
│ get,list │
│ create │
└──────────┘
Scope:
┌──────────┐ ┌──────────┐
│ Role │ │ Cluster │
│Namespace │ │ Role │
│ Scoped │ │ Global │
└──────────┘ └──────────┘Quick Answer
RBAC defines who can perform specific actions on resources within a namespace, so only authorized users have access and preventing unauthorized modifications.
Detailed Answer
Imagine you're managing a company where different departments have different levels of access to sensitive information. For example, HR has access to employee records, while IT controls the network infrastructure. RBAC in Kubernetes is like setting up these rules: defining roles based on job functions (e.g., editor, viewer) and then assigning permissions to specific users or groups (like department heads). This makes sure only authorized personnel can make changes or view sensitive data. Role-Based Access Control (RBAC) in Kubernetes allows administrators to define roles with specific permissions and bind these roles to users or groups. This mechanism restricts what actions a user can perform, such as creating, reading, updating, or deleting resources within a namespace. Kubernetes RBAC uses Role and ClusterRole objects that map to subjects (users/groups). These roles have associated policies defining allowed actions on various resource types. RoleBindings and ClusterRoleBindings link these roles to specific users or groups. When a user makes an API request, the Authorization component checks if they have the required permissions based on their role bindings. At scale, engineers need to configure RBAC policies carefully to balance security and usability. They use namespace-specific roles for better isolation between teams and projects. Monitoring tools like Open Policy Agent can enforce compliance with these policies by checking requests against defined rules. Common issues include overly permissive policies leading to accidental or malicious modifications, or complex permission hierarchies that are difficult to manage. A critical gotcha is the difference between namespace-scoped and cluster-wide roles. Namespace-scoped RBAC applies only within a single namespace, while ClusterRole can be used across all namespaces in the cluster. Misconfiguring role bindings at the wrong scope can lead to unintended access control issues.
Code Example
# Create a Role that allows reading pods in the payments namespace apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: pod-reader # Role name namespace: payments # Scoped to payments namespace rules: - apiGroups: [""] # Core API group resources: ["pods", "pods/log"] # Can read pods and their logs verbs: ["get", "list", "watch"] # Read-only operations # Bind the role to the dev-team group apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: dev-pod-reader namespace: payments subjects: - kind: Group name: dev-team # Kubernetes group apiGroup: rbac.authorization.k8s.io roleRef: kind: Role name: pod-reader # References the Role above apiGroup: rbac.authorization.k8s.io # Test what a user can do kubectl auth can-i list pods -n payments --as-group=dev-team
Quick Answer
IRSA (IAM Roles for Service Accounts) uses an OIDC identity provider to authenticate Kubernetes service account tokens with AWS IAM. The pod receives a projected service account token, presents it to AWS STS, and receives short-lived credentials scoped to a specific IAM role. No access keys are stored in secrets or environment variables.
Detailed Answer
Think of a hotel key card system. Instead of giving every guest a master key (hardcoded credentials), the front desk verifies your identity (OIDC provider), issues a card (short-lived token) that only opens your specific room (IAM role), and the card expires at checkout. If someone steals the card, it stops working soon and only ever opened one room. IRSA works the same way for pods accessing AWS services. Without IRSA, teams typically use one of three insecure patterns: storing AWS access keys in Kubernetes Secrets (which can leak through RBAC, etcd backups, or misconfigured pod access), assigning an IAM instance profile to the entire node (which gives every pod on that node the same permissions), or using tools like kube2iam that intercept the metadata endpoint (which adds complexity and latency). IRSA eliminates all three by giving each pod its own identity that AWS trusts directly. The mechanism works through several coordinated components. First, the EKS cluster has an OIDC provider registered with AWS IAM. This tells AWS to trust tokens issued by the Kubernetes API server. Second, an IAM role is created with a trust policy that specifies which Kubernetes service account in which namespace can assume it. The trust policy condition checks the OIDC issuer, the audience, and the subject (system:serviceaccount:namespace:sa-name). Third, the Kubernetes ServiceAccount is annotated with eks.amazonaws.com/role-arn pointing to the IAM role. Fourth, when a pod using that ServiceAccount starts, the EKS pod identity webhook injects a projected service account token volume and sets AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE environment variables. The AWS SDK in the application reads these, calls STS AssumeRoleWithWebIdentity, and receives temporary credentials. In production, IRSA provides least-privilege access at the pod level. The payments-api can access only the S3 bucket it needs and the SQS queue it consumes from, while the checkout-worker in the same namespace can access DynamoDB but not S3. If one pod is compromised, the blast radius is limited to its specific IAM role permissions. Tokens are automatically rotated (default expiry is 12 hours, configurable down to 15 minutes), and credential theft is detectable through CloudTrail. The non-obvious gotcha is that IRSA requires the application to use an AWS SDK version that supports web identity token authentication (SDK v2 or AWS SDK for Go v1.25+, Python boto3 1.9.220+). Legacy applications that only read from environment variables for static keys will not work without code changes. Another common issue is trust policy misconfiguration: if the namespace or service account name in the condition does not match exactly, AssumeRole fails silently and the pod falls back to node-level permissions or gets AccessDenied. EKS Pod Identity is the newer alternative that simplifies the trust policy setup but requires the EKS Pod Identity Agent DaemonSet.
Code Example
# Check if the EKS cluster has an OIDC provider configured aws eks describe-cluster --name production --query 'cluster.identity.oidc.issuer' # Verify the ServiceAccount has the IAM role annotation kubectl get sa payments-api -n payments -o yaml | grep eks.amazonaws.com/role-arn # Inspect a running pod to confirm IRSA environment variables are injected kubectl exec -n payments payments-api-7f8d9c-x4k -- env | grep AWS # Expected output: # AWS_ROLE_ARN=arn:aws:iam::123456789012:role/payments-api-s3-access # AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token # Check the projected token is mounted in the pod kubectl exec -n payments payments-api-7f8d9c-x4k -- ls /var/run/secrets/eks.amazonaws.com/serviceaccount/ # Verify the IAM role trust policy allows the correct service account aws iam get-role --role-name payments-api-s3-access --query 'Role.AssumeRolePolicyDocument' # Test that the pod can actually assume the role kubectl exec -n payments payments-api-7f8d9c-x4k -- aws sts get-caller-identity # If IRSA fails, check the OIDC provider is registered in IAM aws iam list-open-id-connect-providers
◈ Architecture Diagram
┌──────────────┐ 1. Token issued ┌──────────────────┐
│ K8s API │─────────────────────────▶│ Pod (payments) │
│ Server │ │ ServiceAccount │
└──────┬───────┘ └────────┬──────────┘
│ │
│ 2. OIDC validates │ 3. AssumeRoleWithWebIdentity
↓ ↓
┌──────────────┐ ┌──────────────────┐
│ AWS IAM │◀─────────────────────────│ AWS STS │
│ OIDC Provider│ trust policy check │ │
└──────────────┘ └────────┬──────────┘
│ 4. Temporary credentials
↓
┌──────────────────┐
│ AWS S3 / SQS │
└──────────────────┘Quick Answer
NetworkPolicies control pod-to-pod network traffic. RBAC controls who can perform what actions on which Kubernetes API resources. Pod Security Standards restrict what pods can do at runtime (privileged containers, host access, capabilities). Together, they form three layers: API access control, runtime restrictions, and network segmentation.
Detailed Answer
Think of securing a building. RBAC is the badge system that controls who can enter which floors and rooms (API permissions). Pod Security Standards are the building codes that prevent tenants from doing dangerous things like removing fire exits or storing explosives (runtime restrictions). NetworkPolicies are the internal walls and locked corridors that prevent someone on one floor from accessing another floor without authorization (network segmentation). Each layer addresses a different attack vector, and all three are needed for comprehensive security. RBAC (Role-Based Access Control) governs who can interact with the Kubernetes API and what operations they can perform. A Role defines permissions (verbs like get, list, create, delete on resources like pods, secrets, deployments) within a namespace. A ClusterRole defines permissions cluster-wide. RoleBindings and ClusterRoleBindings associate roles with users, groups, or service accounts. Without RBAC, a compromised service account could read Secrets from other namespaces, create privileged pods, or delete critical workloads. Properly scoped RBAC ensures the payments-api ServiceAccount can only read its own ConfigMaps and Secrets, not those belonging to other teams. Pod Security Standards (the replacement for the deprecated PodSecurityPolicy) define three levels: Privileged (unrestricted), Baseline (prevents known privilege escalations), and Restricted (heavily hardened). These are enforced through the Pod Security Admission controller using namespace labels. Restricted mode prevents running as root, using host networking, mounting hostPath volumes, adding Linux capabilities, and running privileged containers. This matters because a container escape from a privileged pod gives full root access to the host node, which compromises all pods on that node and potentially the entire cluster. NetworkPolicies, as the third layer, restrict which pods can communicate with which other pods and external systems. Even if an attacker compromises a pod, NetworkPolicies prevent lateral movement to the database, secrets store, or other microservices. Combined with RBAC preventing the compromised pod's ServiceAccount from reading other Secrets, and Pod Security preventing privilege escalation to the host, the blast radius of a single compromised container is contained to that container's existing data and network connections. In production, these three controls must be deployed together because each has blind spots. RBAC alone cannot prevent a pod from connecting to a database it should not access (that is a network concern). NetworkPolicies alone cannot prevent a pod from running as root and escaping to the host. Pod Security alone cannot prevent a compromised pod from calling the Kubernetes API to read secrets. Defense in depth means that bypassing one control does not grant full access. The non-obvious gotcha is that Pod Security Admission only warns or denies at pod creation time — it does not retroactively affect running pods. If you add Restricted enforcement to a namespace with existing non-compliant pods, those pods continue running until they are recreated. Another common gap is that RBAC for ServiceAccounts often starts too permissive (using default ServiceAccount with broad permissions) and is never tightened. Teams should create dedicated ServiceAccounts per workload with minimal permissions and disable token automounting for pods that do not need API access.
Code Example
# Check RBAC: what can the payments-api ServiceAccount do?
kubectl auth can-i --list --as=system:serviceaccount:payments:payments-api -n payments
# Verify Pod Security Admission labels on the namespace
kubectl get namespace payments -o yaml | grep pod-security
# Check if any pods are running as root (security concern)
kubectl get pods -n payments -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[*].securityContext}{"\n"}{end}'
# List NetworkPolicies to verify segmentation exists
kubectl get networkpolicy -n payments
# Verify the ServiceAccount does NOT automount API tokens unnecessarily
kubectl get sa payments-api -n payments -o yaml | grep automount
# Check if any pod can reach a service it should not
kubectl exec -n checkout checkout-worker-5d7f -- curl -s --connect-timeout 3 http://payments-db.payments.svc:5432
# Create a restricted Role for the payments-api ServiceAccount
# ---
# apiVersion: rbac.authorization.k8s.io/v1
# kind: Role
# metadata:
# name: payments-api-role
# namespace: payments
# rules:
# - apiGroups: [""]
# resources: ["configmaps", "secrets"]
# verbs: ["get"]
# resourceNames: ["payments-api-config", "payments-api-secrets"]◈ Architecture Diagram
┌─────────────────────────────────────────────────┐ │ Defense in Depth Layers │ ├─────────────────────────────────────────────────┤ │ Layer 1: RBAC │ │ ┌──────────┐ ┌──────────┐ │ │ │ User │───▶│ K8s API │ who can do what? │ │ │ / SA │ │ │ │ │ └──────────┘ └──────────┘ │ ├─────────────────────────────────────────────────┤ │ Layer 2: Pod Security Standards │ │ ┌──────────┐ │ │ │ Pod │ no root, no hostPath, │ │ │ Runtime │ no privileged, drop caps │ │ └──────────┘ │ ├─────────────────────────────────────────────────┤ │ Layer 3: NetworkPolicy │ │ ┌──────┐ allowed ┌──────┐ blocked ┌──────┐ │ │ │Pod A │──────────▶│Pod B │ X │Pod C │ │ │ └──────┘ └──────┘ └──────┘ │ └─────────────────────────────────────────────────┘
Quick Answer
Most production Kubernetes environments use a CI/CD pipeline (Jenkins, GitHub Actions, ArgoCD) that builds container images, pushes them to a registry, and either applies manifests directly or uses GitOps to reconcile desired state. The choice between push-based (kubectl apply in pipeline) and pull-based (ArgoCD watching git) defines the deployment model.
Detailed Answer
Think of a restaurant kitchen with a ticket system. In a push model, the waiter walks the order directly to the chef. In a pull model, the chef watches a ticket board and picks up new orders as they appear. Both get the food made, but the pull model means the chef always knows the current state of all orders without anyone having to push each one individually. In Kubernetes, deployments typically follow one of two patterns. Push-based CI/CD tools like Jenkins or GitHub Actions run kubectl apply or helm upgrade as a pipeline step, directly sending manifests to the cluster API server. Pull-based GitOps tools like ArgoCD or Flux watch a Git repository and automatically reconcile the cluster state with what is declared in the repo. Many teams use a hybrid: CI builds and pushes images, then updates a Git repo, which triggers ArgoCD to deploy. Internally, a typical pipeline has stages: code checkout, unit tests, Docker build with a tagged image (using git SHA or semantic version), image push to ECR or another registry, manifest update (either in-pipeline or via a Git commit to an infra repo), and deployment to the target cluster. For Kubernetes specifically, the Deployment controller handles rolling updates by creating a new ReplicaSet, scaling it up, and scaling the old one down. The pipeline may also run integration tests, smoke tests, or canary analysis after deployment. At production scale, teams separate their CI (build and test) from CD (deploy). The CI pipeline produces a versioned artifact. The CD pipeline or GitOps controller handles promotion across environments: dev, staging, UAT, production. Environment-specific values are managed through Helm values files, Kustomize overlays, or ArgoCD ApplicationSets. Teams monitor deployment success through rollout status checks, readiness probe results, and automated rollback on failure. The non-obvious gotcha is that push-based deployments can drift from the declared state if someone makes manual kubectl changes. GitOps tools detect and correct this drift automatically, but they add complexity around secret management and multi-cluster configuration. Teams that start with push-based pipelines often migrate to GitOps as the number of clusters and services grows beyond what manual pipeline management can handle reliably.
Code Example
# GitHub Actions workflow deploying to EKS via ArgoCD GitOps pattern
# Step 1: Build and push image in CI pipeline
docker build -t registry.company.com/payments-api:${GITHUB_SHA::8} . # Build image tagged with short git SHA
docker push registry.company.com/payments-api:${GITHUB_SHA::8} # Push to private ECR registry
# Step 2: Update the image tag in the GitOps repo
cd infra-manifests/payments-api/overlays/production # Navigate to production overlay
kustomize edit set image payments-api=registry.company.com/payments-api:${GITHUB_SHA::8} # Update image reference
git commit -am "deploy payments-api ${GITHUB_SHA::8}" # Commit the change
git push origin main # Push triggers ArgoCD sync
# Step 3: Monitor the rollout from ArgoCD or kubectl
kubectl rollout status deployment/payments-api -n payments --timeout=300s # Wait up to 5 minutes for rollout
kubectl get pods -n payments -l app=payments-api -o wide # Verify new pods are running on expected nodes◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐
│ CI Build│────→│ Registry │────→│ Git Repo │
│ + Test │ │ (ECR) │ │ manifests│
└──────────┘ └──────────┘ └────┬─────┘
↓
┌──────────┐
│ ArgoCD │
│ (sync) │
└────┬─────┘
↓
┌──────────┐
│ K8s │
│ Cluster │
└──────────┘Quick Answer
Common CI/CD deployment issues include image pull failures (wrong tag or registry auth), resource quota exhaustion, failing readiness probes blocking rollout, ConfigMap or Secret mismatches between environments, and RBAC permission errors. Systematic resolution involves checking Events, pod logs, describe output, and rollout history.
Detailed Answer
Think of moving into a new apartment. Common problems are the moving truck arriving at the wrong address (image pull errors), the apartment not having enough power outlets (resource limits), the door lock not matching your key (RBAC errors), and the furniture not fitting through the doorway (resource quota). Each problem looks different but follows a predictable troubleshooting pattern. In Kubernetes CI/CD, deployment failures cluster around a few categories. Image-related failures happen when the image tag does not exist in the registry, registry credentials are expired, or the image was pushed to a different repository than the manifest references. Resource failures occur when the namespace has a ResourceQuota and the new deployment exceeds CPU or memory limits. Configuration failures happen when a ConfigMap or Secret referenced by the pod does not exist in the target namespace or has different keys than the application expects. The troubleshooting sequence is predictable. First, check the Deployment rollout status with kubectl rollout status. If the rollout is stuck, describe the Deployment to see the events. Then check the ReplicaSet events to see why new pods are not being created. If pods exist but are not ready, check pod events with kubectl describe pod, then check container logs with kubectl logs. For image pull errors, the events will show ErrImagePull or ImagePullBackOff with a specific error message. For resource issues, the events will show FailedScheduling or quota exceeded messages. At production scale, the most impactful issues are deployments that pass in lower environments but fail in production. This usually happens because of environment-specific differences: different resource quotas, different network policies blocking connectivity, different secrets or certificates, or different node configurations. Teams prevent this by making environments as similar as possible, using the same Helm chart with different values, and running integration tests in a staging environment that mirrors production networking and security policies. The non-obvious gotcha is that a deployment can appear successful — kubectl rollout status reports completion — but the application is still broken. This happens when readiness probes are too lenient (checking only that the HTTP port is open, not that the application can actually serve requests). Teams should use deep health checks that verify database connectivity, downstream service availability, and application-specific readiness before marking a pod as ready.
Code Example
# Check deployment rollout status for stuck deployments kubectl rollout status deployment/payments-api -n payments --timeout=120s # Times out if rollout is stuck # Describe deployment to see events and conditions kubectl describe deployment payments-api -n payments | tail -20 # Shows recent events and replica status # Check why new pods are not scheduling kubectl get events -n payments --sort-by='.lastTimestamp' | grep -i 'fail\|error\|back' # Filter for failure events # Check image pull errors on a specific pod kubectl describe pod payments-api-7d9f8b6c4-x2k9m -n payments | grep -A5 'Events' # Shows ImagePullBackOff details # Check resource quota usage in the namespace kubectl describe quota -n payments # Shows used vs hard limits for CPU, memory, pods # View rollout history to compare with previous working version kubectl rollout history deployment/payments-api -n payments # Lists revision history # Rollback to the last known good revision kubectl rollout undo deployment/payments-api -n payments --to-revision=3 # Reverts to specific revision
◈ Architecture Diagram
┌──────────┐
│ Deploy │
└────┬─────┘
↓
┌──────────┐ ✗ ImagePull
│ Pod Start│──→ ✗ Quota
└────┬─────┘ ✗ ConfigMap
↓
┌──────────┐ ✗ Readiness
│ Probes │──→ ✗ Crash
└────┬─────┘
↓
┌──────────┐
│ Running │ ✓
└──────────┘Quick Answer
Pods are evicted when a node is under resource pressure — disk (DiskPressure), memory (MemoryPressure), or PID exhaustion (PIDPressure). The kubelet evicts pods based on QoS class priority: BestEffort first, then Burstable, then Guaranteed last. Diagnosis starts with kubectl describe node to check conditions and kubectl get events to find eviction reasons.
Detailed Answer
Think of an overcrowded bus. When the bus exceeds its weight limit, the driver must ask some passengers to leave. Passengers without tickets (BestEffort pods) are asked first, then those with partial tickets (Burstable pods), and finally full-fare passengers (Guaranteed pods) are the last to go. The bus driver does not choose randomly — there is a clear priority order based on who has the strongest claim to stay. In Kubernetes, pod eviction is the kubelet's mechanism for protecting node stability. When a node runs low on a critical resource — memory, temporary (ephemeral) storage, or process IDs — the kubelet begins evicting pods to reclaim that resource. This is different from preemption (which is the scheduler removing lower-priority pods to make room for higher-priority ones) and different from API-initiated eviction (which is used during node drain for maintenance). Internally, the kubelet monitors resource usage against configurable eviction thresholds. The default soft eviction threshold for memory is memory.available < 100Mi, and for disk is nodefs.available < 10%. When a threshold is breached, the kubelet sets the corresponding node condition (MemoryPressure, DiskPressure, PIDPressure) and begins ranking pods for eviction. The ranking uses QoS class: BestEffort pods (no resource requests or limits) are evicted first, Burstable pods (requests set but lower than limits) are evicted next based on how much they exceed their requests, and Guaranteed pods (requests equal limits for all containers) are evicted last. At production scale, the most common eviction cause is ephemeral storage exhaustion from container logs, emptyDir volumes, or container writable layers growing unbounded. Memory-based evictions happen when applications have memory leaks or when resource limits are set too low for actual workload requirements. Teams should monitor node conditions, set appropriate resource requests and limits to ensure critical pods get Guaranteed QoS, configure log rotation to prevent disk pressure, and use PodDisruptionBudgets to limit the impact of evictions on service availability. The non-obvious gotcha is that eviction thresholds have both soft and hard variants. Soft evictions give pods a grace period to terminate cleanly, while hard evictions kill pods immediately. If the hard eviction threshold is hit (e.g., memory.available < 50Mi), the kubelet kills pods without waiting for graceful shutdown, which can cause data loss or incomplete request processing. Architects should ensure hard thresholds are never reached by setting soft thresholds with enough buffer.
Code Example
# Check node conditions for resource pressure
kubectl describe node ip-10-0-1-42.ec2.internal | grep -A5 'Conditions' # Shows MemoryPressure, DiskPressure status
# Find eviction events in the namespace
kubectl get events -n payments --field-selector reason=Evicted --sort-by='.lastTimestamp' # Lists evicted pods with reasons
# Check which pod was evicted and why
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.reason}' # Shows 'Evicted'
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.message}' # Shows the resource that triggered eviction
# Check node resource usage
kubectl top node ip-10-0-1-42.ec2.internal # Shows current CPU and memory usage
# Check disk usage on the node (requires node access)
kubectl debug node/ip-10-0-1-42.ec2.internal -it --image=busybox -- df -h # Shows filesystem usage on the node
# Check QoS class of pods to understand eviction priority
kubectl get pods -n payments -o custom-columns='NAME:.metadata.name,QOS:.status.qosClass' # Shows BestEffort, Burstable, or Guaranteed
# Set proper resource requests equal to limits for Guaranteed QoS
# resources:
# requests:
# cpu: 250m # Request equals limit for Guaranteed QoS
# memory: 512Mi # Request equals limit for Guaranteed QoS
# limits:
# cpu: 250m # Matches request
# memory: 512Mi # Matches request◈ Architecture Diagram
┌──────────────────────────┐
│ Node Resource Pressure │
│ Memory < 100Mi │
└────────────┬─────────────┘
↓
┌──────────────────────────┐
│ Eviction Priority │
│ 1. BestEffort (first) │
│ 2. Burstable (next) │
│ 3. Guaranteed (last) │
└──────────────────────────┘Quick Answer
CrashLoopBackOff means the container starts, crashes, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5 minutes). Common causes are application startup errors, missing environment variables or secrets, misconfigured commands or entrypoints, failed health probes, and OOMKilled. Diagnosis uses kubectl logs --previous, kubectl describe pod, and checking exit codes.
Detailed Answer
Think of a light switch connected to a circuit breaker. You flip the switch (container starts), the circuit overloads (container crashes), and the breaker trips (Kubernetes waits before retrying). Each time you try again, the breaker waits longer before allowing another attempt. CrashLoopBackOff is Kubernetes telling you that the container keeps failing and the wait time between restarts is increasing. In Kubernetes, CrashLoopBackOff is not a separate error state — it is the backoff delay that kubelet applies after repeated container crashes. The container exits with a non-zero code, kubelet restarts it after 10 seconds, it crashes again, kubelet waits 20 seconds, then 40, then 80, capping at 300 seconds (5 minutes). The pod status shows CrashLoopBackOff during these waiting periods and Error or Completed when the container actually exits. The most common root causes fall into categories. Application errors: the application throws an unhandled exception during startup because a required database is unreachable, a configuration file is malformed, or a required API key is missing. Configuration errors: the container command or args field is wrong (pointing to a script that does not exist in the image), the image tag points to a version with a different entrypoint, or a required environment variable is not set. Resource errors: the container is OOMKilled immediately on startup because the memory limit is too low for the JVM heap or the application's baseline memory footprint. Probe errors: an aggressive liveness probe kills the container before it finishes starting up, especially for Java applications with long startup times. At production scale, the diagnostic sequence is: first check exit code with kubectl describe pod (exit code 1 = application error, 137 = OOMKilled/SIGKILL, 143 = SIGTERM). Then check previous container logs with kubectl logs --previous since the current container may have already crashed. Check whether the container image recently changed with kubectl rollout history. Verify that ConfigMaps, Secrets, and PersistentVolumeClaims referenced by the pod actually exist in the namespace. The non-obvious gotcha is that CrashLoopBackOff can be caused by a liveness probe that is too aggressive during startup. If the liveness probe starts checking before the application is ready and the initialDelaySeconds is too short, the probe fails, kubelet kills the container, it restarts, and the cycle continues. The fix is to use a startup probe with a longer timeout to protect the liveness probe during application initialization, or to increase the liveness probe's initialDelaySeconds and failureThreshold.
Code Example
# Check pod status and restart count
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments # Shows status CrashLoopBackOff and restart count
# Get the exit code to categorize the failure
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -A10 'Last State' # Exit code 1=app error, 137=OOMKilled
# Check logs from the PREVIOUS crashed container (critical — current container may already be dead)
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=50 # Shows why the last container died
# Check if required ConfigMaps and Secrets exist
kubectl get configmap payments-config -n payments # Verify ConfigMap exists
kubectl get secret payments-db-credentials -n payments # Verify Secret exists
# Check if the container command is correct by inspecting the image
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.spec.containers[0].command}' # Shows configured command
# Check if OOMKilled is the cause
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason and exit code
# Fix startup probe to prevent liveness probe from killing slow-starting apps
# startupProbe:
# httpGet:
# path: /health # Startup health endpoint
# port: 8080 # Application port
# failureThreshold: 30 # Allow 30 x 10s = 5 minutes to start
# periodSeconds: 10 # Check every 10 seconds during startup◈ Architecture Diagram
┌──────────┐
│ Start │
└────┬─────┘
↓
┌──────────┐
│ Crash │←─── Exit 1: App Error
│ (exit≠0) │←─── Exit 137: OOMKill
└────┬─────┘←─── Exit 143: Probe
↓
┌──────────┐
│ Backoff │
│10→20→40s │
└────┬─────┘
↓
┌──────────┐
│ Restart │
└──────────┘Quick Answer
EKS provides three node management options: Managed Node Groups (AWS manages EC2 lifecycle, patching, and scaling), self-managed nodes (you manage EC2 instances with custom AMIs), and EKS Auto Mode (AWS fully manages compute, networking, and storage add-ons). Managed Node Groups balance control with operational simplicity, while Auto Mode is the most hands-off approach.
Detailed Answer
Think of renting a car for a road trip. Self-managed nodes are like buying a car — you choose the model, handle maintenance, insurance, and repairs. Managed Node Groups are like a long-term lease — the dealer handles servicing, but you choose the model and drive it yourself. EKS Auto Mode is like a ride-sharing service — you just say where you want to go, and someone else handles the car, driving, and routing. EKS deployment typically starts with creating the control plane (API server, etcd, scheduler) which is fully managed by AWS. Then you choose how to run worker nodes. Managed Node Groups are the most common choice: you specify instance types, desired capacity, and AMI family, and AWS creates an Auto Scaling Group, handles AMI updates, and manages node lifecycle. Self-managed nodes give you full control — you create your own ASG with custom AMIs, custom bootstrap scripts, and custom instance configurations, but you manage patching and upgrades yourself. EKS Auto Mode, introduced in late 2024, takes automation further. Instead of specifying instance types and node groups, you let AWS choose the optimal compute for your workloads. Auto Mode manages the Kubernetes components typically installed as add-ons: kube-proxy, CoreDNS, VPC CNI, EBS CSI driver, and pod identity agent. It also handles GPU scheduling and Karpenter-style intelligent node provisioning. The tradeoff is less control over specific instance types and node configurations in exchange for significantly reduced operational burden. At production scale, most enterprise teams use Managed Node Groups with specific instance families defined per workload type: compute-optimized (c6i) for API services, memory-optimized (r6i) for caching layers, and GPU instances (p4d, g5) for ML workloads. Teams deploy EKS using Terraform with the terraform-aws-modules/eks module, which handles the VPC, security groups, IAM roles, OIDC provider, and node group configurations. Add-ons like VPC CNI, CoreDNS, and EBS CSI driver are managed as EKS add-ons with version pinning. The non-obvious gotcha is that Managed Node Groups handle rolling updates differently than you might expect. When you update the AMI or instance type, MNG creates new nodes and drains old ones, but it respects PodDisruptionBudgets. If a PDB blocks draining, the update stalls silently. Teams should monitor node group update status and set appropriate PDBs. With Auto Mode, you lose the ability to SSH into nodes or run DaemonSets for custom monitoring agents, which can be a blocker for teams with specific compliance or debugging requirements.
Code Example
# Deploy EKS with Managed Node Groups using eksctl
eksctl create cluster \
--name payments-cluster \
--region us-east-1 \
--version 1.31 \
--nodegroup-name general-workers \
--node-type m6i.xlarge \
--nodes 3 \
--nodes-min 2 \
--nodes-max 10 \
--managed # Creates AWS Managed Node Group
# Check node group status and AMI version
aws eks describe-nodegroup \
--cluster-name payments-cluster \
--nodegroup-name general-workers \
--query 'nodegroup.{Status:status,AMI:releaseVersion,InstanceTypes:instanceTypes}' # Shows current AMI and instance types
# Enable EKS Auto Mode on an existing cluster
aws eks update-cluster-config \
--name payments-cluster \
--compute-config enabled=true \
--kubernetes-network-config '{"elasticLoadBalancing":{"enabled":true}}' \
--storage-config '{"blockStorage":{"enabled":true}}' # Enables Auto Mode compute, networking, and storage
# List EKS add-ons and their versions
aws eks list-addons --cluster-name payments-cluster # Shows installed add-ons
aws eks describe-addon --cluster-name payments-cluster --addon-name vpc-cni --query 'addon.addonVersion' # Shows VPC CNI version◈ Architecture Diagram
┌────────────────────────────────────┐
│ EKS Control Plane │
│ (API Server, etcd, scheduler) │
└──────┬──────────┬──────────┬───────┘
↓ ↓ ↓
┌──────────┐┌──────────┐┌──────────┐
│ Managed ││Self- ││Auto Mode │
│ Node Grp ││Managed ││(fully │
│(AWS ASG) ││(your ASG)││ managed) │
└──────────┘└──────────┘└──────────┘Quick Answer
Deploy EKS using a layered module structure: a networking module for VPC/subnets, a cluster module for the EKS control plane with OIDC provider, and a node-groups module for managed node groups with launch templates. Each module has its own state file and communicates through remote state data sources or SSM parameters.
Detailed Answer
Deploying EKS with Terraform is like building a three-story building: the foundation is networking (VPC, subnets, NAT gateways), the structural frame is the EKS control plane (API server, etcd, OIDC provider), and the floors are the node groups (compute capacity where workloads actually run). Each layer depends on the one below it, and you should be able to rebuild any floor without demolishing the entire building. The networking module provisions a dedicated VPC with public subnets for load balancers, private subnets for worker nodes, and optionally isolated subnets for databases. EKS requires specific subnet tags: kubernetes.io/cluster/<cluster-name> = shared on all subnets, kubernetes.io/role/elb = 1 on public subnets for internet-facing ALBs, and kubernetes.io/role/internal-elb = 1 on private subnets for internal services. The module outputs subnet IDs and the VPC ID for consumption by the cluster module. NAT gateways should be deployed per-AZ in production for high availability, meaning three NAT gateways across us-east-1a, us-east-1b, and us-east-1c. The cluster module creates the EKS control plane using the aws_eks_cluster resource or the terraform-aws-modules/eks/aws community module. Critical configurations include the Kubernetes version (pin to a specific minor version like 1.29), the cluster endpoint access (private-only or public-and-private with CIDR restrictions), envelope encryption for secrets using a dedicated KMS key, and the OIDC provider for IAM Roles for Service Accounts (IRSA). The OIDC provider is frequently missed but essential: it enables pods to assume IAM roles without injecting AWS credentials, which is the only secure way to grant AWS access to workloads. The node groups module manages EKS managed node groups with launch templates. Production clusters typically need multiple node groups: a system node group (t3.xlarge, 3 nodes, taints for system workloads like CoreDNS and kube-proxy), an application node group (m5.2xlarge, 3-15 nodes with cluster autoscaler), and optionally a GPU node group (g4dn.xlarge for ML inference). Each node group uses a custom launch template to specify the AMI (Amazon EKS-optimized AMI), bootstrap arguments, block device mappings (100Gi gp3 root volume), and user data for kubelet configuration. Instance refresh policies ensure rolling updates when the launch template changes. In production, state separation between these modules is critical. If your node group Terraform runs into an error, you do not want it to affect the EKS control plane state. Use separate state files: one for networking, one for the cluster, and one per node group pool. Pass data between modules using terraform_remote_state data sources or aws_ssm_parameter lookups. This blast radius isolation means a botched node group change cannot accidentally destroy the control plane. A common gotcha is the chicken-and-egg problem with EKS add-ons. The aws-auth ConfigMap (which controls IAM-to-Kubernetes RBAC mapping) requires a running cluster, but node groups need the aws-auth ConfigMap to join the cluster. The solution is to use the EKS access entries API (available since EKS platform version eks.8) instead of managing aws-auth directly, or to use the kubernetes provider with the EKS cluster's endpoint and token to manage the ConfigMap in the same apply as the cluster creation.
Code Example
# modules/eks-cluster/main.tf — EKS control plane module
# Create the EKS cluster with private endpoint and envelope encryption
resource "aws_eks_cluster" "payments_cluster" {
# Cluster name following org naming convention
name = "payments-eks-${var.environment}"
# Pin to specific Kubernetes minor version
version = "1.29"
# IAM role for the EKS control plane service
role_arn = aws_iam_role.eks_cluster_role.arn
# VPC configuration for the control plane ENIs
vpc_config {
# Private subnets from the networking module
subnet_ids = var.private_subnet_ids
# Enable private API endpoint for in-VPC access
endpoint_private_access = true
# Restrict public endpoint to CI/CD runner CIDRs only
endpoint_public_access = true
# CIDR blocks allowed to reach the public endpoint
public_access_cidrs = ["10.0.0.0/8", "172.16.0.0/12"]
# Security group for additional control plane access rules
security_group_ids = [aws_security_group.eks_cluster_sg.id]
}
# Envelope encryption for Kubernetes secrets using KMS
encryption_config {
# Encrypt the secrets resource type stored in etcd
resources = ["secrets"]
provider {
# Dedicated KMS key for EKS secrets encryption
key_arn = aws_kms_key.eks_secrets_key.arn
}
}
# Enable all control plane logging for audit and troubleshooting
enabled_cluster_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
# Tags for cost allocation and ownership tracking
tags = {
Team = "payments-platform"
Environment = var.environment
ManagedBy = "terraform"
}
}
# OIDC provider for IAM Roles for Service Accounts (IRSA)
resource "aws_iam_openid_connect_provider" "eks_oidc" {
# OIDC issuer URL from the EKS cluster
url = aws_eks_cluster.payments_cluster.identity[0].oidc[0].issuer
# Audience for the STS AssumeRoleWithWebIdentity call
client_id_list = ["sts.amazonaws.com"]
# TLS certificate thumbprint for the OIDC provider
thumbprint_list = [data.tls_certificate.eks_oidc.certificates[0].sha1_fingerprint]
}
# modules/eks-nodegroups/main.tf — Managed node groups
resource "aws_eks_node_group" "application_nodes" {
# Reference the payments EKS cluster by name
cluster_name = var.cluster_name
# Node group name identifying workload type
node_group_name = "payments-app-nodes-${var.environment}"
# IAM role for EC2 instances in this node group
node_role_arn = aws_iam_role.node_group_role.arn
# Deploy nodes into private subnets only
subnet_ids = var.private_subnet_ids
# Instance types optimized for payment processing workloads
instance_types = ["m5.2xlarge"]
# Use AL2023 EKS-optimized AMI
ami_type = "AL2023_x86_64_STANDARD"
# Autoscaling configuration for the node group
scaling_config {
# Minimum nodes for baseline capacity
min_size = 3
# Desired nodes for normal transaction volume
desired_size = 6
# Maximum nodes for peak shopping events
max_size = 15
}
# Launch template for custom node configuration
launch_template {
# Reference the custom launch template
id = aws_launch_template.app_nodes.id
# Use the latest version of the launch template
version = aws_launch_template.app_nodes.latest_version
}
# Rolling update strategy to avoid downtime
update_config {
# Update 1 node at a time for safe rolling deploys
max_unavailable = 1
}
}
# Launch template for application node group
resource "aws_launch_template" "app_nodes" {
# Template name matching the node group convention
name_prefix = "payments-app-nodes-${var.environment}"
# 100Gi gp3 root volume for container images and logs
block_device_mappings {
device_name = "/dev/xvda"
ebs {
# 100GB root volume for container runtime storage
volume_size = 100
# gp3 for consistent baseline IOPS without cost of io2
volume_type = "gp3"
# Encrypt node volumes with the account default KMS key
encrypted = true
}
}
# Tag instances for cost tracking and identification
tag_specifications {
resource_type = "instance"
tags = {
Name = "payments-app-node-${var.environment}"
NodeGroup = "application"
Environment = var.environment
}
}
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ EKS Terraform Module Structure │ ├───────────────────────────────────────────────────────────────┤ │ │ │ Layer 1: Networking Module (state: networking.tfstate) │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ payments-vpc (10.0.0.0/16) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Public │ │ Public │ │ Public │ │ │ │ │ │ Subnet │ │ Subnet │ │ Subnet │ │ │ │ │ │ 1a (ALB) │ │ 1b (ALB) │ │ 1c (ALB) │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Private │ │ Private │ │ Private │ │ │ │ │ │ Subnet │ │ Subnet │ │ Subnet │ │ │ │ │ │ 1a (Nodes)│ │ 1b (Nodes)│ │ 1c (Nodes)│ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ NAT GW x3 (one per AZ for HA) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ outputs: vpc_id, subnet_ids │ │ ↓ │ │ Layer 2: Cluster Module (state: eks-cluster.tfstate) │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ payments-eks-prod │ │ │ │ ┌──────────────┐ ┌────────────┐ ┌─────────────┐ │ │ │ │ │ Control Plane │ │ OIDC │ │ KMS Key │ │ │ │ │ │ K8s 1.29 │ │ Provider │ │ (secrets │ │ │ │ │ │ API + etcd │ │ (for IRSA) │ │ encryption)│ │ │ │ │ └──────────────┘ └────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ outputs: cluster_name, oidc_arn, endpoint │ │ ↓ │ │ Layer 3: Node Groups (state: eks-nodegroups.tfstate) │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────────┐│ │ │ │ │ System │ │ Application│ │ GPU Nodes ││ │ │ │ │ Nodes │ │ Nodes │ │ (optional) ││ │ │ │ │ t3.xlarge │ │ m5.2xlarge │ │ g4dn.xlarge ││ │ │ │ │ 3 fixed │ │ 3-15 auto │ │ 0-4 auto ││ │ │ │ │ 100Gi gp3 │ │ 100Gi gp3 │ │ 100Gi gp3 ││ │ │ │ └────────────┘ └────────────┘ └────────────────┘│ │ │ └─────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Use a multi-account strategy with AWS Organizations where each environment (Dev, QA, UAT, Prod) gets its own AWS account under an organizational unit. Terraform manages this through assume-role provider configurations, per-account state files in a centralized S3 bucket with prefixed keys, and shared modules versioned in a private registry.
Detailed Answer
Structuring AWS accounts for multiple environments is like managing a hospital: you would never put the emergency room (production), the training lab (dev), the simulation center (QA), and the dress rehearsal ward (UAT) in the same building with shared keys and power circuits. AWS Organizations provides the building-per-department model, giving each environment complete blast radius isolation at the account boundary. The multi-account structure typically follows an organizational unit (OU) hierarchy. The root OU contains the management account (billing and Organizations API only — never deploy workloads here). Below it, you create OUs for Security (GuardDuty delegated admin, SecurityHub, CloudTrail aggregation), SharedServices (CI/CD runners, ECR registry, Route53 hosted zones, Terraform state bucket), and Workloads. The Workloads OU contains sub-OUs for NonProd (Dev, QA, UAT accounts) and Prod (production account). Service Control Policies (SCPs) at each OU level enforce guardrails: NonProd accounts cannot provision p4d.24xlarge instances or create public-facing resources, while the Prod OU has SCPs preventing deletion of CloudTrail logs or disabling encryption. In Terraform, each account is targeted through provider assume_role blocks. The CI/CD pipeline authenticates to the SharedServices account via OIDC, then assumes a TerraformExecutionRole in the target account. This role is provisioned by an account-baseline module that every new account receives. The role has permissions scoped to the services that environment needs — Dev gets broad permissions for experimentation, while Prod has tightly scoped permissions with explicit deny on destructive actions like deleting RDS clusters without snapshots. State file organization follows the account boundary. A single S3 bucket in the SharedServices account stores all state files, with key prefixes per account: s3://org-terraform-state/111111111111/networking/terraform.tfstate for the Dev account networking stack, s3://org-terraform-state/222222222222/networking/terraform.tfstate for Prod. Each account's TerraformExecutionRole has an S3 policy that restricts access to only its own prefix, preventing a Dev pipeline misconfiguration from reading or writing Prod state. The single-account approach — using naming conventions and tags to separate environments — is tempting for small teams but creates dangerous failure modes at scale. A single IAM policy mistake can grant Dev workloads access to Prod databases. Security groups in a shared VPC can be referenced across environments. Billing attribution becomes guesswork with cost allocation tags instead of per-account bills. Most critically, AWS service quotas are shared: a runaway Dev autoscaling group can exhaust EC2 limits and prevent Prod from scaling during a traffic spike. The gotcha with multi-account is cross-account resource sharing. VPC peering or Transit Gateway connects environments for legitimate data flows (QA reading anonymized Prod data, shared ECR images). Terraform must manage both sides of a peering connection: the requester in one account and the accepter in another, each with their own provider alias. This requires careful orchestration — apply the requester first, then the accepter — or use a two-phase apply with data sources that look up the peering connection ID.
Code Example
# organizations.tf — AWS Organizations account structure
# Create the organizational unit hierarchy
resource "aws_organizations_organizational_unit" "workloads" {
# Parent is the organization root
name = "Workloads"
# Attach to the root of the organization
parent_id = aws_organizations_organization.org.roots[0].id
}
resource "aws_organizations_organizational_unit" "nonprod" {
# Sub-OU for non-production environments
name = "NonProd"
# Parent is the Workloads OU
parent_id = aws_organizations_organizational_unit.workloads.id
}
resource "aws_organizations_organizational_unit" "prod" {
# Sub-OU for production environment with stricter SCPs
name = "Prod"
# Parent is the Workloads OU
parent_id = aws_organizations_organizational_unit.workloads.id
}
# Account definitions for each environment
resource "aws_organizations_account" "environments" {
# Create one account per environment using for_each
for_each = {
dev = { email = "[email protected]", ou = aws_organizations_organizational_unit.nonprod.id }
qa = { email = "[email protected]", ou = aws_organizations_organizational_unit.nonprod.id }
uat = { email = "[email protected]", ou = aws_organizations_organizational_unit.nonprod.id }
prod = { email = "[email protected]", ou = aws_organizations_organizational_unit.prod.id }
}
# Account name following organization convention
name = "valuemomentum-${each.key}"
# Unique root email per account (AWS requirement)
email = each.value.email
# Place account in the correct OU
parent_id = each.value.ou
# IAM role created in the new account for cross-account access
role_name = "TerraformExecutionRole"
# Prevent accidental account closure
close_on_deletion = false
}
# Provider configuration for targeting a specific environment
locals {
# Map environment names to account IDs
account_map = {
dev = aws_organizations_account.environments["dev"].id
qa = aws_organizations_account.environments["qa"].id
uat = aws_organizations_account.environments["uat"].id
prod = aws_organizations_account.environments["prod"].id
}
}
# Provider block for the target environment account
provider "aws" {
# Region standardized across all accounts
region = "us-east-1"
# Assume the execution role in the target environment account
assume_role {
# Construct role ARN from the environment variable
role_arn = "arn:aws:iam::${local.account_map[var.environment]}:role/TerraformExecutionRole"
# Session name for CloudTrail traceability
session_name = "terraform-${var.environment}-pipeline"
}
# Default tags applied to every resource in this account
default_tags {
tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = "valuemomentum-platform"
}
}
}
# SCP preventing destructive actions in production
resource "aws_organizations_policy" "prod_guardrails" {
# Policy name identifying its purpose
name = "prod-environment-guardrails"
# Description for audit and compliance documentation
description = "Prevent destructive actions in production accounts"
# SCP policy type for organizational guardrails
type = "SERVICE_CONTROL_POLICY"
# Policy document denying dangerous operations
content = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "PreventRDSDeletionWithoutSnapshot"
Effect = "Deny"
Action = ["rds:DeleteDBCluster", "rds:DeleteDBInstance"]
Resource = "*"
Condition = {
Bool = { "rds:SkipFinalSnapshot" = "true" }
}
}
]
})
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ AWS Organizations Multi-Account Structure │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ Root OU │ │ │ │ ┌─────────────────────────────────┐ │ │ │ │ │ Management Account (billing) │ │ │ │ │ └─────────────────────────────────┘ │ │ │ └──────────────┬──────────────────────────┘ │ │ ┌────────┴────────┬──────────────────┐ │ │ ↓ ↓ ↓ │ │ ┌───────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Security │ │ SharedSvcs │ │ Workloads OU │ │ │ │ OU │ │ OU │ │ │ │ │ │ │ │ │ │ ┌─────────┐ │ │ │ │ GuardDuty │ │ CI/CD runners│ │ │ NonProd │ │ │ │ │ SecHub │ │ ECR registry │ │ │ ┌─────┐│ │ │ │ │ CloudTrail│ │ TF State S3 │ │ │ │ Dev ││ │ │ │ │ │ │ Route53 │ │ │ │ QA ││ │ │ │ └───────────┘ └──────────────┘ │ │ │ UAT ││ │ │ │ │ │ └─────┘│ │ │ │ │ └─────────┘ │ │ │ │ ┌─────────┐ │ │ │ │ │ Prod │ │ │ │ │ │ ┌─────┐│ │ │ │ │ │ │Prod ││ │ │ │ │ │ └─────┘│ │ │ │ │ └─────────┘ │ │ │ └──────────────┘ │ │ │ │ State Isolation: │ │ s3://org-terraform-state/ │ │ ├── 111111111/dev/networking/terraform.tfstate │ │ ├── 222222222/qa/networking/terraform.tfstate │ │ ├── 333333333/uat/networking/terraform.tfstate │ │ └── 444444444/prod/networking/terraform.tfstate │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Store state in an S3 bucket with versioning enabled, server-side encryption (SSE-KMS), and a DynamoDB table for state locking. Structure the bucket with environment-prefixed keys (prod/networking/terraform.tfstate) and restrict access using IAM policies scoped to each environment's prefix. Prevent corruption through locking, versioning for rollback, and CI/CD-only access patterns.
Detailed Answer
Terraform state file management is like managing a bank vault's ledger: the ledger (state file) records what is in every safe deposit box (cloud resource), and if the ledger is corrupted or leaked, you either lose track of assets or expose their locations to unauthorized parties. The storage, encryption, access control, and corruption prevention for state files must be treated with the same rigor as production database backups. The S3 backend is the standard for AWS-centric teams. The bucket itself requires several hardening measures: versioning enabled so you can recover previous state versions if an apply corrupts the current state, server-side encryption with a dedicated KMS key (not the default aws/s3 key) so you can audit and rotate encryption independently, public access blocked via the S3 Block Public Access settings, and a bucket policy that explicitly denies unencrypted uploads. The bucket should be in a dedicated SharedServices or Management account, separate from any workload account, so that workload account compromises cannot directly access state. The key structure within the bucket follows a hierarchy: {account-id-or-env}/{stack-name}/terraform.tfstate. For example: prod/networking/terraform.tfstate, prod/eks-cluster/terraform.tfstate, prod/payments-database/terraform.tfstate. This structure enables per-stack state isolation and per-environment IAM scoping. The Prod account's TerraformExecutionRole gets an S3 policy allowing s3:GetObject and s3:PutObject only on keys prefixed with prod/, while the Dev role can only access dev/. This prevents a Dev pipeline misconfiguration from overwriting Prod state. DynamoDB state locking prevents concurrent modifications. Create a single DynamoDB table (PAY_PER_REQUEST billing) with a partition key named LockID of type String. Every Terraform operation acquires a lock before modifying state by writing a conditional item to this table. If two engineers run terraform apply simultaneously on the same stack, the second operation receives a ConditionalCheckFailedException and waits. The lock record contains the operator's hostname, the operation type, and a timestamp, which helps diagnose stale locks from crashed CI pipelines. Corruption prevention goes beyond locking. S3 versioning provides a recovery path: if an apply fails midway and leaves state inconsistent, you can restore a previous version using aws s3api list-object-versions and aws s3api get-object with the desired VersionId. Terraform also writes a backup of the previous state locally before modifying it (terraform.tfstate.backup), though this is less useful in CI/CD where runners are ephemeral. For critical production stacks, enable S3 Replication to copy state to a bucket in another region for disaster recovery. The most dangerous corruption scenario is partial apply failure: Terraform creates some resources but crashes before writing updated state. The created resources become orphans — they exist in AWS but are not tracked by Terraform. Recovery requires manually importing the orphaned resources using terraform import or, in Terraform 1.5+, using import blocks. To reduce this risk, break large configurations into smaller stacks so each apply touches fewer resources, and use the -target flag only as a last resort since it creates partial state updates by design.
Code Example
# State backend configuration with full security hardening
terraform {
# S3 backend for remote state storage
backend "s3" {
# Dedicated state bucket in the SharedServices account
bucket = "valuemomentum-terraform-state-prod"
# Environment-prefixed key for access control scoping
key = "prod/payments-platform/networking/terraform.tfstate"
# Primary region for state storage
region = "us-east-1"
# DynamoDB table for state locking
dynamodb_table = "terraform-state-locks"
# Enable SSE-KMS encryption with a dedicated key
encrypt = true
# KMS key ARN for state file encryption
kms_key_id = "arn:aws:kms:us-east-1:555555555555:key/mrk-abc123"
# Use the SharedServices account profile for state access
profile = "valuemomentum-shared-services"
}
}
# S3 bucket for Terraform state (provisioned once by bootstrap)
resource "aws_s3_bucket" "terraform_state" {
# Bucket name following organization naming convention
bucket = "valuemomentum-terraform-state-prod"
# Prevent accidental deletion of the state bucket
force_destroy = false
tags = {
Purpose = "terraform-state-storage"
ManagedBy = "bootstrap-terraform"
}
}
# Enable versioning for state file recovery
resource "aws_s3_bucket_versioning" "state_versioning" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
# Enable versioning to recover from corruption
versioning_configuration {
status = "Enabled"
}
}
# Server-side encryption with dedicated KMS key
resource "aws_s3_bucket_server_side_encryption_configuration" "state_encryption" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
# Use KMS encryption instead of AES-256
sse_algorithm = "aws:kms"
# Dedicated KMS key for independent rotation and audit
kms_master_key_id = aws_kms_key.terraform_state_key.arn
}
# Enforce encryption on all objects including uploads
bucket_key_enabled = true
}
}
# Block all public access to the state bucket
resource "aws_s3_bucket_public_access_block" "state_public_block" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
# Block all forms of public access
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_locks" {
# Table name matching backend configuration
name = "terraform-state-locks"
# Pay-per-request to avoid capacity planning
billing_mode = "PAY_PER_REQUEST"
# LockID is the required partition key for Terraform
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
# Enable point-in-time recovery for lock table safety
point_in_time_recovery {
enabled = true
}
}
# IAM policy scoping Prod role to only prod/ state prefix
# data "aws_iam_policy_document" "prod_state_access" {
# statement {
# effect = "Allow"
# actions = ["s3:GetObject", "s3:PutObject"]
# resources = ["arn:aws:s3:::valuemomentum-terraform-state-prod/prod/*"]
# }
# statement {
# effect = "Deny"
# actions = ["s3:*"]
# resources = ["arn:aws:s3:::valuemomentum-terraform-state-prod/dev/*"]
# }
# }◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Security Architecture │ ├───────────────────────────────────────────────────────────────┤ │ │ │ SharedServices Account │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ S3: valuemomentum-terraform-state-prod │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ Versioning: Enabled │ │ │ │ │ │ Encryption: SSE-KMS (dedicated key) │ │ │ │ │ │ Public Access: Blocked │ │ │ │ │ │ Replication: us-east-1 → us-west-2 (DR) │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ Key Structure: │ │ │ │ ├── dev/ │ │ │ │ │ ├── networking/terraform.tfstate │ │ │ │ │ ├── eks-cluster/terraform.tfstate │ │ │ │ │ └── payments-db/terraform.tfstate │ │ │ │ ├── qa/ │ │ │ │ │ └── ... │ │ │ │ ├── uat/ │ │ │ │ │ └── ... │ │ │ │ └── prod/ ← Prod role can ONLY access this │ │ │ │ ├── networking/terraform.tfstate │ │ │ │ ├── eks-cluster/terraform.tfstate │ │ │ │ └── payments-db/terraform.tfstate │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ DynamoDB: terraform-state-locks │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ LockID (PK) │ Info │ Who │ Operation │ │ │ │ │ │ prod/net/... │ ... │ ci │ apply │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ Access Pattern: │ │ CI/CD Runner → OIDC → AssumeRole → Scoped S3 Access │ │ ┌──────────┐ ┌────────────┐ ┌──────────────┐ │ │ │ Pipeline │───→│ Prod Role │───→│ prod/* only │ │ │ │ (OIDC) │ │ (IAM) │ │ (S3 policy) │ │ │ └──────────┘ └────────────┘ └──────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Prevent cross-environment contamination through four layers: separate state files per environment with IAM-scoped access, provider configurations locked to specific AWS accounts via assume_role, module versioning with pinned tags so an untested module change cannot propagate, and CI/CD pipeline guardrails that validate the target environment before apply.
Detailed Answer
Preventing cross-environment contamination in Terraform is like building firewalls between apartments in a building: you need physical separation (state isolation), locked doors (IAM boundaries), independent utilities (provider configurations), and a building code (CI/CD guardrails) that prevents shortcuts through shared walls. The first layer is state file isolation. Each environment must have its own state file with its own backend configuration. Never share a state file between environments, even with workspaces, if the blast radius of corruption is unacceptable. The state file contains sensitive data including resource IDs, IP addresses, and sometimes plaintext outputs. An S3 bucket policy should restrict each environment's Terraform role to only its own key prefix: the prod role can access s3://state-bucket/prod/* but is explicitly denied s3://state-bucket/dev/*. This prevents a misconfigured prod pipeline from reading or overwriting dev state. The second layer is provider-level isolation. Each environment's provider block must assume a role in its specific AWS account. Even if someone accidentally passes the wrong tfvars file, the provider configuration ensures Terraform operates in the correct account. Add a validation check using the aws_caller_identity data source: compare the actual account ID against the expected one and fail early if they do not match. This catches the scenario where an engineer runs terraform apply with prod credentials but dev configuration, or vice versa. The third layer is module versioning. When environments share modules from a private registry or Git repository, use pinned version tags. Dev might use module version 2.3.0-rc1 while Prod uses 2.2.0 (the last stable release). Without version pinning, a module change pushed to the main branch immediately affects every environment that references source = "git::...?ref=main". This is the most common cause of accidental cross-environment impact: someone fixes a bug in a shared VPC module, the fix has a typo, and every environment that references the module head picks up the broken code on next apply. The fourth layer is CI/CD pipeline guardrails. The pipeline should validate environment consistency before plan: check that the workspace name matches the tfvars file, verify the AWS account ID matches the target environment, and confirm the Git branch is allowed to deploy to that environment (only main can deploy to prod). Implement a pre-plan script that runs aws sts get-caller-identity and compares the account against an expected value from the pipeline configuration. Remote state data sources are a particularly dangerous vector for cross-environment bleed. When a production EKS module reads the networking module's state via terraform_remote_state, it must reference the production networking state, not dev. Parameterize the remote state data source's backend configuration using the environment variable: data.terraform_remote_state.networking.config.key should resolve to prod/networking/terraform.tfstate, not a hardcoded path. A common gotcha is using terraform_remote_state with a hardcoded key that works in dev but points to prod state when someone copies the configuration without updating the key. The ultimate safeguard is defense in depth: even if one layer fails, the others prevent damage. If the IAM policy has a bug that allows dev access to prod state, the provider's assume_role still locks operations to the dev account. If the provider configuration is wrong, the account ID validation check fails before any resources are touched.
Code Example
# Account identity validation — fail fast on wrong account
# Fetch the actual AWS account identity
data "aws_caller_identity" "current" {}
# Validate the account ID matches the expected environment
locals {
# Map of expected account IDs per environment
expected_accounts = {
dev = "111111111111"
qa = "222222222222"
uat = "333333333333"
prod = "444444444444"
}
# Check if current account matches the target environment
account_validated = (
data.aws_caller_identity.current.account_id ==
local.expected_accounts[var.environment]
)
}
# Validation resource that fails plan if accounts mismatch
resource "null_resource" "account_validation" {
# This count trick fails if account does not match
count = local.account_validated ? 0 : "ERROR: Running in wrong AWS account"
}
# Remote state data source — parameterized per environment
data "terraform_remote_state" "networking" {
# S3 backend for reading the networking layer state
backend = "s3"
config = {
# Same state bucket as all other stacks
bucket = "valuemomentum-terraform-state-prod"
# Key parameterized by environment to prevent cross-env reads
key = "${var.environment}/networking/terraform.tfstate"
# Same region as the backend
region = "us-east-1"
}
}
# Use networking outputs safely scoped to the correct environment
resource "aws_eks_cluster" "payments_cluster" {
# Cluster name scoped to the environment
name = "payments-eks-${var.environment}"
version = "1.29"
role_arn = aws_iam_role.eks_cluster_role.arn
vpc_config {
# Subnet IDs from the SAME environment's networking state
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = var.environment == "prod" ? false : true
}
}
# Module versioning — pinned per environment
module "payments_vpc" {
# Pinned Git tag prevents untested changes from propagating
source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.2.0"
# In dev, you might test a release candidate:
# source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.3.0-rc1"
vpc_name = "payments-vpc-${var.environment}"
vpc_cidr = var.vpc_cidr
environment = var.environment
}
# CI/CD pre-plan validation script (run before terraform plan)
# #!/bin/bash
# EXPECTED_ACCOUNT=$(jq -r ".${ENVIRONMENT}" accounts.json)
# ACTUAL_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
# if [ "$EXPECTED_ACCOUNT" != "$ACTUAL_ACCOUNT" ]; then
# echo "FATAL: Expected account $EXPECTED_ACCOUNT but authenticated to $ACTUAL_ACCOUNT"
# exit 1
# fi◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐
│ Cross-Environment Protection Layers │
├───────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: State Isolation (IAM-Scoped) │
│ ┌──────────────┐ DENY ┌──────────────┐ │
│ │ Dev Role │─────X─────│ prod/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ALLOW ┌──────────────┐ │
│ │ Dev Role │───────────│ dev/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Layer 2: Provider Account Lock │
│ ┌──────────────────────────────────────────┐ │
│ │ provider "aws" { │ │
│ │ assume_role { │ │
│ │ role_arn = ".../${var.env}/Role" │ │
│ │ } │ │
│ │ } │ │
│ │ → Operations locked to target account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 3: Account ID Validation │
│ ┌──────────────────────────────────────────┐ │
│ │ aws_caller_identity.account_id │ │
│ │ == expected_accounts[var.environment] │ │
│ │ → FAIL FAST if wrong account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 4: Module Version Pinning │
│ ┌──────────────────────────────────────────┐ │
│ │ Dev: source = "...?ref=v2.3.0-rc1" │ │
│ │ Prod: source = "...?ref=v2.2.0" │ │
│ │ → Untested changes cannot reach prod │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 5: CI/CD Pipeline Guardrails │
│ ┌──────────────────────────────────────────┐ │
│ │ Branch → Environment mapping │ │
│ │ main → prod (requires approval) │ │
│ │ develop → dev (auto-apply) │ │
│ │ Pre-plan: sts get-caller-identity check │ │
│ └──────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘Quick Answer
Implement a multi-stage pipeline: PR triggers terraform plan with output posted as a PR comment, OPA/Sentinel policy checks validate compliance, manual approval gates (GitHub Environments with required reviewers) protect production, and merge-to-main triggers terraform apply using the saved plan file. Use OIDC for keyless authentication and concurrency controls to prevent parallel applies on the same stack.
Detailed Answer
A Terraform CI/CD pipeline is like an air traffic control system: every infrastructure change (flight) must file a plan (flight plan), get reviewed by controllers (PR reviewers), receive clearance (approval gate), and land on the correct runway (target environment) — all while preventing two planes from using the same runway simultaneously (state locking and concurrency control). The pipeline begins with authentication. Modern pipelines use OIDC federation instead of stored AWS credentials. GitHub Actions requests a JWT token from GitHub's OIDC provider, presents it to AWS STS via AssumeRoleWithWebIdentity, and receives short-lived credentials scoped to the Terraform execution role. The OIDC trust policy restricts which repositories, branches, and environments can assume the role: production apply roles should only be assumable by the main branch, while plan roles can be assumed by any branch. This eliminates long-lived access keys that could be exfiltrated from CI secrets. The plan stage runs on every pull request. It executes terraform init, terraform validate, terraform fmt -check, and terraform plan -out=plan.tfplan. The plan output is captured and posted as a PR comment using tools like tfcmt or the native GitHub Actions Terraform setup action. Reviewers see exactly what resources will be created, modified, or destroyed — including sensitive changes like security group rule modifications or IAM policy updates. The saved plan file is uploaded as a CI artifact for use in the apply stage. Policy-as-code gates run between plan and approval. Open Policy Agent (OPA) evaluates the plan JSON (terraform show -json plan.tfplan) against organizational policies: no S3 buckets without encryption, no security groups with 0.0.0.0/0 ingress on port 22, all RDS instances must have deletion protection in production. These checks are non-negotiable — a policy violation fails the pipeline regardless of who approves the PR. Sentinel serves the same purpose in Terraform Cloud/Enterprise environments. The approval gate differs by environment. Dev and QA may auto-apply on merge — the PR review itself is sufficient approval. UAT requires team lead approval via a GitHub Environment with one required reviewer. Production requires two approvals from the platform-admins team, with a 15-minute wait timer to prevent hasty approvals. These are configured as GitHub Environments with protection rules, which the apply job references via the environment keyword. The apply stage triggers after merge to main. Critically, it should use the saved plan file from the plan stage rather than re-running plan, because infrastructure may have changed between plan review and apply execution. If the saved plan is stale (state serial mismatch), Terraform rejects it and the pipeline must re-plan. After successful apply, the pipeline posts results to a Slack channel (#infra-changes-prod) and creates a GitHub deployment record for audit trail. Concurrency control prevents two merged PRs from applying simultaneously to the same stack. GitHub Actions concurrency groups scoped to the stack name (concurrency: group: terraform-payments-prod) ensure only one apply runs at a time. Queued runs wait for the current apply to complete. Combined with DynamoDB state locking, this provides two layers of concurrent modification prevention.
Code Example
# .github/workflows/terraform-payments.yml
# Multi-stage Terraform pipeline with OIDC and approval gates
name: Payments Infrastructure Pipeline
# Trigger on PRs and pushes to main affecting payments infra
on:
pull_request:
paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']
push:
branches: [main]
paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']
# OIDC permissions for keyless AWS authentication
permissions:
id-token: write
contents: read
pull-requests: write
# Prevent concurrent applies on the same stack
concurrency:
group: terraform-payments-prod-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
# Stage 1: Validate and plan on every PR
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
# Checkout the infrastructure code
- uses: actions/checkout@v4
# OIDC authentication — plan role (read-only)
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformPlan
aws-region: us-east-1
# Install pinned Terraform version
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Initialize the backend and download providers
- name: Init
run: terraform -chdir=infrastructure/envs/prod init -input=false
# Validate syntax and configuration
- name: Validate
run: terraform -chdir=infrastructure/envs/prod validate
# Format check to enforce style standards
- name: Format Check
run: terraform fmt -check -recursive infrastructure/
# Generate execution plan and save to file
- name: Plan
run: terraform -chdir=infrastructure/envs/prod plan -input=false -out=prod.tfplan
# Export plan as JSON for OPA policy evaluation
- name: Export Plan JSON
run: terraform -chdir=infrastructure/envs/prod show -json prod.tfplan > plan.json
# Run OPA policy checks against the plan
- name: OPA Policy Check
run: |
opa eval --data policies/ --input plan.json "data.terraform.deny[msg]" --fail-defined
# Post plan output as a PR comment for reviewers
- name: Comment Plan on PR
uses: borchero/terraform-plan-comment@v2
with:
working-directory: infrastructure/envs/prod
# Upload plan artifact for the apply stage
- uses: actions/upload-artifact@v4
with:
name: prod-tfplan
path: infrastructure/envs/prod/prod.tfplan
retention-days: 5
# Stage 2: Apply after merge with manual approval
apply:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
# Production environment with required approvers and wait timer
environment:
name: production-payments
url: https://console.aws.amazon.com/eks
steps:
- uses: actions/checkout@v4
# OIDC authentication — apply role (read-write)
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformApply
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Re-init and apply (saved plan may be stale after merge)
- name: Init and Apply
run: |
terraform -chdir=infrastructure/envs/prod init -input=false
terraform -chdir=infrastructure/envs/prod apply -input=false -auto-approve
# Notify team of successful deployment
- name: Slack Notification
if: success()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text": "Prod payments infra deployed by ${{ github.actor }}"}'◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform CI/CD Pipeline with Approval Gates │ ├───────────────────────────────────────────────────────────────┤ │ │ │ PR Opened │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 1: Plan (on every PR) │ │ │ │ │ │ │ │ OIDC → AssumeRole (Plan Role, read-only) │ │ │ │ ┌──────┐ ┌────────┐ ┌────┐ ┌──────┐ │ │ │ │ │ init │→│validate│→│fmt │→│ plan │ │ │ │ │ └──────┘ └────────┘ └────┘ └──┬───┘ │ │ │ │ │ │ │ │ │ ┌──────┴──────┐ │ │ │ │ │ plan.json │ │ │ │ │ └──────┬──────┘ │ │ │ │ ↓ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ OPA Policy Check │ │ │ │ │ │ - no public S3 │ │ │ │ │ │ - encryption on │ │ │ │ │ │ - tags required │ │ │ │ │ └────────┬───────────┘ │ │ │ │ ↓ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ PR Comment with │ │ │ │ │ │ plan output │ │ │ │ │ └────────────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ PR Approved + Merged to main │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 2: Manual Approval Gate │ │ │ │ GitHub Environment: production-payments │ │ │ │ Required reviewers: 2 from platform-admins │ │ │ │ Wait timer: 15 minutes │ │ │ └──────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 3: Apply (after approval) │ │ │ │ │ │ │ │ OIDC → AssumeRole (Apply Role, read-write) │ │ │ │ ┌──────┐ ┌────────────────┐ │ │ │ │ │ init │→│ apply │ │ │ │ │ └──────┘ │ -auto-approve │ │ │ │ │ └───────┬────────┘ │ │ │ │ ↓ │ │ │ │ ┌──────────────┐ │ │ │ │ │ Slack notify │ │ │ │ │ │ #infra-changes│ │ │ │ │ └──────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ Concurrency: group=terraform-payments-prod (1 at a time) │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Detect drift by running terraform plan -refresh-only to compare actual infrastructure against state without proposing changes. Remediate by either importing the manual change into Terraform (terraform import or import blocks), reverting the manual change by running terraform apply to converge back to the declared configuration, or updating the Terraform code to reflect the intentional change and then applying.
Detailed Answer
Terraform state drift is like someone rearranging furniture in a room that has a blueprint: the blueprint (state file) says the couch is by the window, but someone physically moved it to the center of the room (console change). Terraform detects this discrepancy during the refresh phase of plan and proposes moving the couch back to the window (converging to declared state). The question is whether the move was intentional or accidental — and that determines whether you update the blueprint or move the couch back. Drift detection happens during the refresh phase of terraform plan. For every resource tracked in state, Terraform calls the cloud provider's API to read the current configuration. If the API response differs from what state records, Terraform updates its in-memory state and then diffs that against your HCL configuration. The -refresh-only flag runs only the refresh phase without proposing configuration-driven changes, making it a pure drift detection scan. The output shows which attributes have drifted and their before/after values. There are three categories of drift, each requiring a different remediation strategy. The first is accidental drift: an engineer manually opened port 443 on a security group to debug a connectivity issue and forgot to revert it. The fix is to run terraform apply, which converges the security group back to the declared configuration, removing the manually added rule. This is Terraform's self-healing property — the declared state is the source of truth. The second is intentional drift: an operations engineer manually scaled up an RDS instance from db.r6g.xlarge to db.r6g.2xlarge during a traffic incident. The change was correct and should be preserved. The fix is to update the Terraform code to reflect the new instance class, then run terraform plan to verify the plan shows no changes (the code now matches reality). If you run apply without updating the code, Terraform would downgrade the instance back to the original size — potentially causing another outage. The third is untracked resource creation: someone created a new S3 bucket via the console that Terraform knows nothing about. Since Terraform only tracks resources in its state, it cannot detect untracked resources. Tools like AWS Config, Driftctl (now Snyk IaC), or CloudQuery scan the entire account and compare against Terraform state to find resources that exist but are not managed. Once identified, you either import the resource into Terraform using import blocks (Terraform 1.5+) or the terraform import command, or you delete the resource if it should not exist. Proactive drift prevention is better than reactive detection. Implement AWS Config rules that alert on configuration changes not made by the Terraform execution role. Set up CloudTrail-based alarms that trigger when console users modify resources tagged with ManagedBy=terraform. Use IAM policies that restrict console users to read-only access for Terraform-managed resource types. Schedule a daily terraform plan -refresh-only in CI that posts drift reports to a Slack channel — this catches drift within 24 hours instead of discovering it during the next deployment. The lifecycle meta-argument ignore_changes is the escape hatch for expected drift. Auto-scaling groups change desired_capacity based on scaling policies, ECS services change task_count, and some resources have attributes that are set once and then managed externally. Adding these attributes to ignore_changes tells Terraform to skip them during drift comparison, preventing false positives and accidental reverts of legitimate operational changes.
Code Example
# Drift detection and remediation workflow
# Step 1: Run refresh-only plan to detect drift without proposing changes
# terraform plan -refresh-only -out=drift-check.tfplan
# This shows which resources have drifted from their recorded state
# Step 2: Review the drift report
# terraform show drift-check.tfplan
# Example output:
# ~ aws_security_group_rule.payments_api_ingress
# from_port: 443 → 8080 (someone changed the port manually)
# Step 3a: Revert accidental drift — apply converges back to declared state
# terraform apply
# This restores the security group rule to port 443 as declared in code
# Step 3b: Adopt intentional drift — update code to match reality
resource "aws_rds_cluster" "payments_db" {
# Cluster identifier for the payments transaction database
cluster_identifier = "payments-db-prod"
# Updated instance class to match the manual scaling during incident
# Previously: db.r6g.xlarge — changed during traffic spike on 2026-06-15
engine = "aurora-postgresql"
engine_version = "15.4"
deletion_protection = true
backup_retention_period = 30
}
# Step 3c: Import untracked resources using import blocks (TF 1.5+)
import {
# S3 bucket created manually via console during incident response
to = aws_s3_bucket.payments_audit_logs
# The actual bucket name to import from AWS
id = "valuemomentum-payments-audit-logs-prod"
}
# Resource block to match the imported bucket's configuration
resource "aws_s3_bucket" "payments_audit_logs" {
# Bucket name matching the manually created bucket
bucket = "valuemomentum-payments-audit-logs-prod"
tags = {
Purpose = "audit-log-storage"
Environment = "prod"
ManagedBy = "terraform"
ImportedOn = "2026-06-20"
}
}
# Lifecycle ignore_changes for expected drift patterns
resource "aws_autoscaling_group" "payments_api_fleet" {
# ASG name following the naming convention
name = "payments-api-fleet-prod-use1"
# Baseline desired capacity — autoscaler adjusts this
desired_capacity = 6
# Minimum instances for SLA compliance
min_size = 3
# Maximum instances during peak events
max_size = 24
launch_template {
id = aws_launch_template.payments_api.id
version = "$Latest"
}
lifecycle {
# Ignore desired_capacity — managed by cluster autoscaler
# Ignore target_group_arns — managed by EKS ingress controller
ignore_changes = [desired_capacity, target_group_arns]
}
}
# Scheduled drift detection in CI (runs daily at 6 AM UTC)
# .github/workflows/drift-detection.yml
# name: Daily Drift Detection
# on:
# schedule:
# - cron: '0 6 * * *'
# jobs:
# detect-drift:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# - run: terraform -chdir=infrastructure/envs/prod init
# - run: terraform -chdir=infrastructure/envs/prod plan -refresh-only -detailed-exitcode
# # Exit code 2 means drift detected
# - if: failure()
# run: |
# curl -X POST $SLACK_WEBHOOK -d '{"text": "DRIFT DETECTED in prod payments infra"}'◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Drift Detection & Remediation │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────┐ Manual ┌──────────────────┐ │ │ │ Terraform │ Change │ AWS Console │ │ │ │ State File │ │ or CLI │ │ │ │ │ │ │ │ │ │ sg port: 443 │ │ sg port: 8080 │ │ │ │ (declared) │ │ (actual) │ │ │ └───────┬───────┘ └──────────────────┘ │ │ │ │ │ │ └──────────────┬─────────────────────┘ │ │ ↓ │ │ ┌──────────────────────┐ │ │ │ terraform plan │ │ │ │ -refresh-only │ │ │ │ │ │ │ │ DRIFT DETECTED: │ │ │ │ sg port: 443 → 8080 │ │ │ └──────────┬───────────┘ │ │ │ │ │ ┌──────────────┼──────────────┐ │ │ ↓ ↓ ↓ │ │ ┌──────────────┐┌─────────────┐┌──────────────────┐ │ │ │ Accidental ││ Intentional ││ Untracked │ │ │ │ Drift ││ Drift ││ Resource │ │ │ │ ││ ││ │ │ │ │ terraform ││ Update HCL ││ terraform import │ │ │ │ apply ││ to match ││ or delete the │ │ │ │ (revert to ││ reality, ││ resource │ │ │ │ declared) ││ then apply ││ │ │ │ └──────────────┘└─────────────┘└──────────────────┘ │ │ │ │ Proactive Detection: │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Daily CI Job (cron: 0 6 * * *) │ │ │ │ terraform plan -refresh-only -detailed-exitcode │ │ │ │ │ │ │ │ Exit 0 → No drift → all clear │ │ │ │ Exit 2 → Drift detected → Slack alert │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ Expected Drift (ignore_changes): │ │ ┌──────────────────────────────────────────────────┐ │ │ │ ASG desired_capacity → managed by autoscaler │ │ │ │ ECS task_count → managed by scaling policy │ │ │ │ ignore_changes = [desired_capacity] │ │ │ └──────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Use a hub-and-spoke model with separate state files per account, shared modules for common patterns, and assume-role providers to manage cross-account resources. Structure repositories with account-level directories, environment-level workspaces, and a central module registry for VPC, IAM, and security baseline configurations.
Detailed Answer
Structuring Terraform for a multi-account AWS organization requires solving three interconnected problems: provider authentication across accounts, state isolation between accounts, and code reuse across similar environments. Think of it like managing a franchise: each store (account) runs the same playbook but has its own inventory (state), and headquarters (management account) needs oversight into all of them. The foundation is AWS Organizations with a well-defined account structure. Typically you have a management account (for Organizations API and billing), a security account (for GuardDuty, SecurityHub, CloudTrail aggregation), a shared-services account (for CI/CD, artifact repositories, DNS), and then workload accounts per environment or per team. Each account gets its own Terraform state file to ensure blast radius isolation: a misconfigured apply in the staging account cannot corrupt production state. For provider configuration, the recommended pattern is assume-role chaining. Your CI/CD pipeline authenticates to a central deployment account using OIDC (for GitHub Actions) or instance profiles (for Jenkins on EC2), then assumes environment-specific roles in target accounts. Each account has a TerraformExecutionRole with least-privilege permissions. The provider block uses assume_role with the target account's role ARN, and you parameterize the account ID using variables or a map lookup. The repository structure typically follows one of two patterns. The first is a monorepo with directory-per-account: infrastructure/accounts/production/, infrastructure/accounts/staging/, each with their own backend configuration and tfvars. The second is a module-based approach where a single root configuration uses for_each over a map of accounts to deploy baseline resources. The monorepo approach is simpler to reason about but leads to code duplication. The module approach is DRY but increases blast radius. Shared modules are the cornerstone of multi-account management. You create versioned modules for common patterns: a VPC module that enforces CIDR allocation from a central IPAM, a security-baseline module that deploys Config rules and GuardDuty, an IAM-baseline module that creates standard roles. These modules are published to a private Terraform registry or referenced via Git tags. Production gotchas are numerous. Cross-account resource references require careful handling: you cannot reference a resource in another account's state without remote state data sources or SSM Parameter Store lookups. VPC peering across accounts needs accepter-side resources managed by the accepter account's Terraform. Service Control Policies in the management account can block Terraform operations in child accounts if not carefully scoped. And state backend permissions must be tightly controlled: each account's Terraform role should only access its own state prefix in S3.
Code Example
# Provider configuration for multi-account AWS organization
# Map of account IDs for the fintech organization
locals {
# Central registry of all AWS account IDs by environment name
account_ids = {
production = "111111111111"
staging = "222222222222"
security = "333333333333"
shared-svcs = "444444444444"
}
# Construct the IAM role ARN for cross-account access
target_role_arn = "arn:aws:iam::${local.account_ids[var.environment]}:role/TerraformExecutionRole"
}
# Default provider assumes role into the target workload account
provider "aws" {
# Region standardized across the organization
region = "us-east-1"
# Assume the execution role in the target account
assume_role {
# Role ARN constructed from the environment variable
role_arn = local.target_role_arn
# Session name for CloudTrail audit traceability
session_name = "terraform-ci-${var.environment}"
}
# Default tags applied to every resource in this account
default_tags {
tags = {
Environment = var.environment
ManagedBy = "terraform"
CostCenter = "platform-engineering"
}
}
}
# Aliased provider for the shared-services account (DNS, ECR)
provider "aws" {
# Alias used when referencing shared-services resources
alias = "shared_services"
# Same region as the primary provider
region = "us-east-1"
# Assume role into the shared-services account
assume_role {
# Shared services account role for DNS and registry management
role_arn = "arn:aws:iam::${local.account_ids["shared-svcs"]}:role/TerraformDNSRole"
# Distinct session name for audit separation
session_name = "terraform-ci-shared-svcs"
}
}
# VPC module instantiation using the organization's standard module
module "payments_vpc" {
# Versioned module from the private Terraform registry
source = "app.terraform.io/fintech-corp/vpc/aws"
# Pin to a specific minor version for stability
version = "3.2.1"
# VPC name following the organization naming convention
vpc_name = "payments-vpc-${var.environment}"
# CIDR allocated from the central IPAM pool
vpc_cidr = var.vpc_cidr_blocks[var.environment]
# Enable DNS resolution for private hosted zone lookups
enable_dns_hostnames = true
# Deploy NAT gateways in each AZ for high availability
single_nat_gateway = false
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ AWS Organization Account Structure │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────┐ │ │ │ Management Account │ │ │ │ (Organizations API) │ │ │ │ SCPs, Billing │ │ │ └──────────┬──────────┘ │ │ │ │ │ ┌───────┴────────┬──────────────┬──────────────┐ │ │ ↓ ↓ ↓ ↓ │ │ ┌──────────┐ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ Security │ │ Shared │ │ Production│ │ Staging │ │ │ │ Account │ │ Services │ │ Account │ │ Account │ │ │ │ │ │ Account │ │ │ │ │ │ │ │ GuardDuty │ │ ECR, DNS │ │ payments │ │ payments │ │ │ │ SecHub │ │ CI/CD │ │ -vpc │ │ -vpc │ │ │ │ CloudTrail│ │ Artifacts │ │ user-auth │ │ user-auth │ │ │ └──────────┘ └─────┬─────┘ └───────────┘ └───────────┘ │ │ │ │ │ │ AssumeRole │ │ ↓ │ │ ┌──────────────┐ │ │ │ CI/CD Runner │ │ │ │ (OIDC Auth) │ │ │ │ │→ assume TerraformExecutionRole │ │ │ │→ per target account │ │ └──────────────┘ │ │ │ │ State Isolation: │ │ ┌────────────────────────────────────────────────────┐ │ │ │ s3://fintech-terraform-state/production/tfstate │ │ │ │ s3://fintech-terraform-state/staging/tfstate │ │ │ │ s3://fintech-terraform-state/security/tfstate │ │ │ └────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
At scale, Terraform state must be stored in remote backends like S3 with DynamoDB locking or Terraform Cloud, split into small blast-radius units by domain or environment, and isolated via workspaces or directory structure. State locking prevents concurrent applies from corrupting state, and state splitting ensures a single terraform apply cannot accidentally destroy unrelated infrastructure.
Detailed Answer
Think of a hospital records system. If every department writes to one giant patient file simultaneously, records get corrupted and the wrong medication gets administered. Splitting records by department, locking each file during edits, and storing everything in a central secure archive prevents these disasters. Terraform state management works the same way — it is the record of what infrastructure exists, and mismanaging it causes outages. Terraform state is a JSON file that maps every resource in your configuration to a real cloud object. When terraform plan runs, it reads the state to determine what exists, compares it to the desired configuration, and calculates the diff. If two engineers run terraform apply simultaneously against the same state, one overwrites the other's changes, causing state corruption where Terraform's view of the world no longer matches reality. Remote backends solve storage and collaboration: S3 stores the state file durably, DynamoDB provides a lock table so only one operation can modify state at a time, and versioning on the S3 bucket enables recovery from bad applies. Internally, when terraform apply starts, it sends a Lock request to the backend. For S3+DynamoDB, this writes a lock record to the DynamoDB table with a unique ID, the user's identity, and a timestamp. If another process already holds the lock, Terraform exits with an error. After the apply completes, Terraform writes the updated state to S3 and releases the lock. If a process crashes mid-apply, the lock remains until it expires or is manually force-unlocked with terraform force-unlock. Terraform Cloud handles locking internally and adds run queues so multiple plans can exist but only one apply executes at a time per workspace. At production scale, the critical architectural decision is state splitting. A monolithic state file containing the VPC, databases, Kubernetes clusters, DNS records, and application services means a single terraform apply can accidentally destroy the database while updating a DNS record. The recommended pattern is splitting state by blast radius: network foundations in one state, data layer in another, compute in another, and application configurations in their own states. Each state has its own backend configuration and can use terraform_remote_state data sources or outputs stored in SSM Parameter Store to share values. Workspaces can further separate environments (dev, staging, production) within the same configuration, but they should not be used as a substitute for proper state splitting — all workspaces in a configuration share the same codebase, backend, and permissions. The non-obvious gotcha is that terraform_remote_state creates a hard coupling between states, and if the upstream state is corrupted or the output names change, downstream plans break. Many mature teams replace terraform_remote_state with data sources that look up infrastructure by tags or names, or they store shared values in AWS SSM Parameter Store or HashiCorp Consul, which decouples state files completely. Another trap is that S3 bucket versioning does not protect against state file deletion — teams must also enable MFA Delete or use S3 Object Lock for regulatory environments.
Code Example
# backend.tf — Remote backend configuration for the payments data layer
terraform {
# Use S3 as the remote state storage backend
backend "s3" {
# S3 bucket dedicated to Terraform state files
bucket = "company-terraform-state-prod"
# State file path scoped to team and layer
key = "payments/data-layer/terraform.tfstate"
# AWS region for the state bucket
region = "us-east-1"
# DynamoDB table for state locking and consistency checking
dynamodb_table = "terraform-state-locks"
# Enable server-side encryption for state at rest
encrypt = true
# Use a specific KMS key for encryption
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/payments-tf-state-key"
}
}
# Reference outputs from the network layer state without tight coupling
# Using SSM Parameter Store instead of terraform_remote_state
data "aws_ssm_parameter" "vpc_id" {
# Parameter path set by the network layer's terraform apply
name = "/infrastructure/network/vpc-id"
}
data "aws_ssm_parameter" "private_subnet_ids" {
# Comma-separated subnet IDs stored by the network team
name = "/infrastructure/network/private-subnet-ids"
}
# Use the decoupled values in resource configuration
resource "aws_db_subnet_group" "payments" {
# Subnet group name for the payments database
name = "payments-db-subnets"
# Split the comma-separated parameter value into a list
subnet_ids = split(",", data.aws_ssm_parameter.private_subnet_ids.value)
# Tag for operational identification
tags = {
Team = "payments"
Layer = "data"
}
}
# DynamoDB lock table must exist before backend configuration
# This is typically created by a bootstrap state or manually
# aws dynamodb create-table \
# --table-name terraform-state-locks \
# --attribute-definitions AttributeName=LockID,AttributeType=S \
# --key-schema AttributeName=LockID,KeyType=HASH \
# --billing-mode PAY_PER_REQUEST◈ Architecture Diagram
┌──────────┐
│ tf apply │
└────┬─────┘
│
┌────┴─────┐
│ Lock │
│ DynamoDB │
└────┬─────┘
│
┌────┴─────┐
│ Read │
│ S3 State │
└────┬─────┘
│
┌────┴─────┐
│ Apply │
│ Changes │
└────┬─────┘
│
┌────┴─────┐
│ Write │
│ S3 State │
└────┬─────┘
│
┌────┴─────┐
│ Unlock │
└──────────┘Quick Answer
Provider aliases define multiple AWS provider configurations for different regions or accounts within a single Terraform configuration. For_each on modules with provider maps enables deploying the same infrastructure across regions or accounts without duplicating code. Architects use assume_role in provider blocks to cross account boundaries securely, and module-level providers pass the correct provider to each regional deployment.
Detailed Answer
Think of a restaurant chain opening locations in different cities. Rather than writing a completely new business plan for each city, the headquarters uses one standard restaurant blueprint and customizes the local supplier list, health department contact, and rental agreement per location. Provider aliases in Terraform work the same way — one infrastructure blueprint is deployed across regions and accounts by swapping the provider configuration. In AWS, multi-account architecture is the recommended pattern for blast-radius isolation: production, staging, shared services, security, and logging each live in separate AWS accounts under an AWS Organization. Multi-region deployment adds resilience and latency optimization by placing infrastructure closer to users. Without careful Terraform design, this creates an explosion of near-identical configuration files — one per account-region combination — that diverge over time and become unmaintainable. Internally, Terraform provider aliases allow declaring multiple instances of the same provider with different configurations. A provider block with alias = "us_west" and region = "us-west-2" coexists with the default provider using region = "us-east-1". Resources and modules reference a specific provider using the provider or providers argument. For cross-account access, each aliased provider uses assume_role to temporarily adopt an IAM role in the target account. The for_each meta-argument on modules, combined with a providers map, enables deploying the same module across multiple regions or accounts from a single configuration. Each module instance receives its own provider, which determines where the infrastructure is created. At production scale, the pattern involves a locals block that defines a map of regions or account-region pairs, a module block with for_each over that map, and a dynamic providers assignment that passes the correct aliased provider to each module instance. State should be split so that each account-region combination has its own state file — otherwise a single state file becomes a massive blast radius. The deployment pipeline should use separate workspaces or directories per account-region pair, with the provider configuration driven by workspace-specific variables. Teams should also use terraform_remote_state or SSM parameters to share outputs like VPC IDs across account-region boundaries. The non-obvious gotcha is that Terraform does not support for_each on provider blocks themselves — you cannot dynamically generate provider aliases from a map. Each provider alias must be declared statically in the configuration. This means if you add a new region, you must add a new provider alias block, which is a manual step that cannot be fully automated. Some teams work around this by using Terragrunt to generate provider blocks from a configuration file, or by using a code generator that produces the provider declarations. Another trap is that assume_role credentials expire during long applies — for large deployments, the role session duration must be set high enough (up to 12 hours for chained roles) or the apply will fail midway with an expired token error.
Code Example
# providers.tf — Static provider aliases for each target region
provider "aws" {
# Default provider for the primary region
region = "us-east-1"
# Assume a role in the production account
assume_role {
role_arn = "arn:aws:iam::111111111111:role/TerraformDeployRole"
session_name = "terraform-payments-prod"
}
}
provider "aws" {
# Aliased provider for the secondary region
alias = "us_west_2"
region = "us-west-2"
# Same account, different region
assume_role {
role_arn = "arn:aws:iam::111111111111:role/TerraformDeployRole"
session_name = "terraform-payments-prod-west"
}
}
provider "aws" {
# Aliased provider for the EU region in a separate account
alias = "eu_west_1"
region = "eu-west-1"
# Assume role in the EU production account
assume_role {
role_arn = "arn:aws:iam::222222222222:role/TerraformDeployRole"
session_name = "terraform-payments-eu-prod"
}
}
# main.tf — Deploy the same networking module to each region
locals {
# Map of regions to their provider references and CIDR allocations
regions = {
us_east_1 = { cidr = "10.1.0.0/16", provider_key = "aws" }
us_west_2 = { cidr = "10.2.0.0/16", provider_key = "aws.us_west_2" }
eu_west_1 = { cidr = "10.3.0.0/16", provider_key = "aws.eu_west_1" }
}
}
# Deploy VPC module to US East (default provider)
module "vpc_us_east_1" {
# Source from the internal registry with version pin
source = "app.terraform.io/company/vpc/aws"
version = "~> 3.1"
# Pass region-specific CIDR
vpc_cidr = "10.1.0.0/16"
environment = "prod"
region_name = "us-east-1"
}
# Deploy VPC module to US West with aliased provider
module "vpc_us_west_2" {
source = "app.terraform.io/company/vpc/aws"
version = "~> 3.1"
providers = {
# Pass the US West provider alias to the module
aws = aws.us_west_2
}
vpc_cidr = "10.2.0.0/16"
environment = "prod"
region_name = "us-west-2"
}
# Deploy VPC module to EU West with cross-account provider
module "vpc_eu_west_1" {
source = "app.terraform.io/company/vpc/aws"
version = "~> 3.1"
providers = {
# Pass the EU provider alias (different account + region)
aws = aws.eu_west_1
}
vpc_cidr = "10.3.0.0/16"
environment = "prod"
region_name = "eu-west-1"
}◈ Architecture Diagram
┌──────────────────────┐ │ Root Config │ │ │ │ provider aws │ │ provider aws.west │ │ provider aws.eu │ └──┬───────┬───────┬───┘ │ │ │ ↓ ↓ ↓ ┌──────┐┌──────┐┌──────┐ │US-E-1││US-W-2││EU-W-1│ │Acct A││Acct A││Acct B│ │VPC ││VPC ││VPC │ └──────┘└──────┘└──────┘
Quick Answer
Terraform Enterprise adds remote state management with RBAC, Sentinel policy-as-code for compliance enforcement, private module registries, team-based workspace access, and audit logging. Organizations need TFE when they require governance, collaboration at scale, and regulatory compliance that open-source cannot provide.
Detailed Answer
Think of open-source Terraform as a skilled carpenter with excellent tools who works alone from blueprints. Terraform Enterprise is a construction firm — it has the same carpenter, but adds project managers (RBAC), building inspectors (Sentinel policies), a parts catalog (private registry), apprentice supervision (workspace permissions), and a complete paper trail of every nail hammered (audit logs). A solo developer building a shed does not need a construction firm. A bank building a skyscraper absolutely does. Open-source Terraform provides the core functionality: HCL language for defining infrastructure, a provider ecosystem for interacting with cloud APIs, state management, plan and apply workflows, and module reuse. When a single engineer or small team manages infrastructure, open-source Terraform with a remote state backend (S3 + DynamoDB) works well. The limitations emerge at scale: who can run terraform apply against production? How do you enforce that all S3 buckets have encryption enabled? How do you share vetted modules across 20 teams without everyone copy-pasting and diverging? How do you prove to auditors that every infrastructure change was reviewed, approved, and logged? Open-source Terraform has no answers to these questions — it is a tool, not a platform. Terraform Enterprise addresses these gaps through several key features. Workspaces provide isolated environments with their own state, variables, and permissions — the payments-api infrastructure can be in one workspace with access limited to the payments team, while the settlements-db workspace is restricted to the database team. Sentinel policies run before every apply and enforce organizational rules as code — 'all RDS instances must have encryption at rest,' 'no IAM policies can use wildcard actions,' 'all resources must have cost-center and owner tags.' These policies are version-controlled, peer-reviewed, and provide automated compliance enforcement that auditors love. The private module registry lets the platform team publish vetted, hardened modules (like an approved EKS cluster configuration) that application teams consume — ensuring consistency without restricting autonomy. Remote execution in TFE solves the 'works on my machine' problem. Instead of engineers running terraform apply from their laptops (with their personal AWS credentials and whatever Terraform version they have installed), all plans and applies execute in TFE's managed runners with consistent Terraform versions, standardized provider credentials (injected via workspace variables), and no local state. This eliminates credential sprawl — engineers never need direct AWS access for infrastructure changes, because TFE holds the credentials and applies changes on their behalf. For banking, this is a massive security improvement: credentials are centralized, rotated, and never touch developer machines. In production at a bank, TFE becomes the control plane for all infrastructure changes. The typical workflow is: engineer creates a branch, makes infrastructure changes, opens a PR. TFE runs a speculative plan on the PR (visible as a GitHub check), showing exactly what will change. Team members review the plan diff alongside the code diff. After PR approval and merge, TFE runs the real plan and waits for workspace-level approval (configurable — some workspaces auto-apply, production workspaces require manual confirmation from an authorized approver). Sentinel policies gate the apply, and the entire execution is logged with timestamps, user identity, plan output, and apply results. These logs are retained for compliance and can be exported to SIEM systems for security monitoring. The gotcha is cost and operational overhead. TFE is expensive — it requires either a self-hosted installation (on-premise or in your cloud account) or a Terraform Cloud Business subscription. Self-hosted TFE needs its own infrastructure (compute, database, object storage), monitoring, backup, and upgrades. Many organizations start with Terraform Cloud (the SaaS version) for its lower operational burden and migrate to self-hosted TFE only when data residency requirements or network isolation mandates require it. Another common mistake is over-governing — applying strict Sentinel policies and manual approval gates to every workspace, including development and sandbox environments, which slows down experimentation. Use tiered governance: strict policies and approvals for production, advisory-only policies for staging, and minimal controls for development.
Code Example
# Terraform Enterprise workspace configuration via Terraform
# (yes, you manage TFE with Terraform itself)
resource "tfe_organization" "bank" {
name = "bank-platform"
email = "[email protected]"
}
# Private module registry - vetted EKS module
resource "tfe_registry_module" "eks_cluster" {
organization = tfe_organization.bank.name
vcs_repo {
display_identifier = "bank/terraform-aws-eks-cluster"
identifier = "bank/terraform-aws-eks-cluster"
oauth_token_id = var.github_oauth_token_id
}
}
# Production workspace with approval gate
resource "tfe_workspace" "payments_infra_prod" {
name = "payments-infra-production"
organization = tfe_organization.bank.name
terraform_version = "1.7.0" # Pinned version
auto_apply = false # Require manual approval
queue_all_runs = true
working_directory = "environments/production"
vcs_repo {
identifier = "bank/payments-infrastructure"
branch = "main"
oauth_token_id = var.github_oauth_token_id
}
}
# RBAC - only senior engineers can approve production applies
resource "tfe_team" "payments_admins" {
name = "payments-infra-admins"
organization = tfe_organization.bank.name
}
resource "tfe_team_access" "payments_prod" {
access = "write" # Can queue plans and approve applies
team_id = tfe_team.payments_admins.id
workspace_id = tfe_workspace.payments_infra_prod.id
}
resource "tfe_team_access" "payments_dev" {
access = "plan" # Can only view plans, not approve
team_id = tfe_team.payments_developers.id
workspace_id = tfe_workspace.payments_infra_prod.id
}
---
# Sentinel policy set attached to production workspaces
resource "tfe_policy_set" "pci_dss_compliance" {
name = "pci-dss-compliance"
organization = tfe_organization.bank.name
kind = "sentinel"
enforcement_mode = "hard-mandatory" # Cannot override
vcs_repo {
identifier = "bank/sentinel-policies"
branch = "main"
oauth_token_id = var.github_oauth_token_id
}
workspace_ids = [
tfe_workspace.payments_infra_prod.id,
tfe_workspace.settlements_infra_prod.id,
]
}
---
# Comparing open-source vs TFE workflow
# Open-source: engineer runs locally
# $ terraform init # Downloads providers to laptop
# $ terraform plan # Uses personal AWS credentials
# $ terraform apply # No approval gate, no audit log
# $ terraform state pull # State in S3, no RBAC on who reads it
# TFE: governed workflow
# 1. Engineer pushes to branch → TFE speculative plan on PR
# 2. Peer reviews plan output in GitHub PR check
# 3. Merge to main → TFE queues real plan
# 4. Sentinel policies evaluate → hard-mandatory must pass
# 5. Authorized team member approves apply
# 6. TFE applies using centralized credentials
# 7. Full audit log: who, when, what changed, plan output◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐ │ Open-Source Terraform vs Terraform Enterprise │ │ │ │ ┌─────────────────────────┐ ┌──────────────────────────────┐ │ │ │ Open-Source Terraform │ │ Terraform Enterprise │ │ │ │ │ │ │ │ │ │ ┌───────────────────┐ │ │ ┌────────────────────────┐ │ │ │ │ │ Local execution │ │ │ │ Remote execution │ │ │ │ │ │ Personal creds │ │ │ │ Centralized creds │ │ │ │ │ │ No approval gate │ │ │ │ Workspace approval │ │ │ │ │ │ S3 state (no RBAC)│ │ │ │ RBAC on state │ │ │ │ │ │ No policy engine │ │ │ │ Sentinel policies │ │ │ │ │ │ Copy-paste modules│ │ │ │ Private registry │ │ │ │ │ │ No audit trail │ │ │ │ Full audit logging │ │ │ │ │ └───────────────────┘ │ │ └────────────────────────┘ │ │ │ │ │ │ │ │ │ │ Good for: │ │ Good for: │ │ │ │ - Solo/small teams │ │ - Enterprise teams │ │ │ │ - Non-regulated envs │ │ - Regulated industries │ │ │ │ - Experimentation │ │ - Multi-team governance │ │ │ └─────────────────────────┘ └──────────────────────────────┘ │ │ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ TFE Workflow: PR → Speculative Plan → Sentinel Check │ │ │ │ → Merge → Real Plan → Approval → Apply → Audit Log │ │ │ └───────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Quick Answer
Create a single set of Terraform modules and root configurations, then use per-environment .tfvars files (dev.tfvars, qa.tfvars, uat.tfvars, prod.tfvars) to inject environment-specific values like instance sizes, replica counts, and CIDR blocks. The terraform plan -var-file flag selects the correct file, keeping code DRY while allowing each environment to diverge on sizing and configuration.
Detailed Answer
Using tfvars files for multi-environment management is like having a single restaurant menu (Terraform code) with portion size options (tfvars): the recipe stays the same whether you order a small (dev), medium (qa), large (uat), or extra-large (prod) — only the quantities change. This pattern eliminates code duplication while giving each environment its own tunable parameters. The variable definition layer starts in variables.tf, where you declare every configurable parameter with a type constraint, description, and optionally a default value. Variables without defaults are mandatory and must be provided by the tfvars file. Use object types for related parameters: instead of separate variables for each RDS setting, create a variable of type object({ instance_class = string, allocated_storage = number, multi_az = bool, backup_retention_days = number }) called database_config. This groups related values and makes tfvars files self-documenting. The tfvars files live in an environments/ directory: environments/dev.tfvars, environments/qa.tfvars, environments/uat.tfvars, environments/prod.tfvars. Each file sets the same variables with environment-appropriate values. Dev might use db.t3.medium with 20GB storage and single-AZ, while Prod uses db.r6g.2xlarge with 500GB and multi-AZ enabled. The key principle is that tfvars files contain only values, never logic. Conditional logic belongs in the Terraform code itself using ternary expressions or lookup maps. The CI/CD pipeline selects the correct tfvars file based on the target environment. A typical command is terraform plan -var-file=environments/prod.tfvars -out=prod.tfplan. The pipeline determines the environment from the Git branch (feature/* uses dev.tfvars, release/* uses uat.tfvars, main uses prod.tfvars) or from a pipeline variable. This pattern integrates cleanly with the approval workflow: the plan output shows exactly what will change in the target environment before anyone approves. For complex configurations, layer multiple tfvars files. Use a common.tfvars for values shared across all environments (organization name, DNS domain, tagging standards), then overlay environment-specific files: terraform plan -var-file=common.tfvars -var-file=environments/prod.tfvars. Later files override values from earlier ones, so environment-specific settings take precedence over common defaults. The critical gotcha is tfvars file drift. Over time, someone adds a new variable to dev.tfvars but forgets to add it to prod.tfvars. Without a default value, the prod plan fails. Prevent this by treating tfvars files as a matrix: every variable declared without a default must appear in every tfvars file. Enforce this with a CI check that parses variables.tf for required variables and validates their presence in all tfvars files. Another gotcha is accidentally committing sensitive values (database passwords, API keys) in tfvars files. Sensitive values should come from environment variables (TF_VAR_*), HashiCorp Vault, or AWS Secrets Manager data sources — never from tfvars files in version control.
Code Example
# variables.tf — Declare all configurable parameters
# Environment identifier used in resource naming
variable "environment" {
description = "Target deployment environment"
type = string
# Validate against allowed environment names
validation {
condition = contains(["dev", "qa", "uat", "prod"], var.environment)
error_message = "Environment must be one of: dev, qa, uat, prod."
}
}
# VPC CIDR block — different per environment to allow peering
variable "vpc_cidr" {
description = "CIDR block for the payments VPC"
type = string
}
# Database configuration object — groups all RDS parameters
variable "database_config" {
description = "RDS cluster configuration for the payments database"
type = object({
instance_class = string
instance_count = number
allocated_storage = number
multi_az = bool
backup_retention_days = number
})
}
# EKS node group scaling parameters
variable "node_scaling" {
description = "Scaling configuration for EKS application node group"
type = object({
min_size = number
desired_size = number
max_size = number
instance_type = string
})
}
# environments/dev.tfvars — Development environment values
# Environment name used in all resource naming and tagging
# environment = "dev"
# Small VPC for development workloads (non-overlapping with other envs)
# vpc_cidr = "10.1.0.0/16"
# Minimal database for developer testing
# database_config = {
# instance_class = "db.t3.medium"
# instance_count = 1
# allocated_storage = 20
# multi_az = false
# backup_retention_days = 1
# }
# Small node group for dev workloads
# node_scaling = {
# min_size = 1
# desired_size = 2
# max_size = 4
# instance_type = "t3.large"
# }
# environments/prod.tfvars — Production environment values
# environment = "prod"
# Large VPC for production workloads with room for growth
# vpc_cidr = "10.4.0.0/16"
# Production-grade database with HA and compliance retention
# database_config = {
# instance_class = "db.r6g.2xlarge"
# instance_count = 3
# allocated_storage = 500
# multi_az = true
# backup_retention_days = 30
# }
# Production node group sized for payment processing SLA
# node_scaling = {
# min_size = 3
# desired_size = 6
# max_size = 15
# instance_type = "m5.2xlarge"
# }
# CI/CD pipeline usage — selecting the correct tfvars file
# terraform init -backend-config=backends/prod.hcl
# terraform plan -var-file=environments/prod.tfvars -out=prod.tfplan
# terraform apply prod.tfplan◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ tfvars-Based Environment Management │ ├───────────────────────────────────────────────────────────────┤ │ │ │ Single Codebase: │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ main.tf │ variables.tf │ outputs.tf │ modules/ │ │ │ │ (same code for all environments) │ │ │ └──────────────────────┬──────────────────────────────┘ │ │ │ │ │ ┌───────────┴───────────┐ │ │ │ terraform plan │ │ │ │ -var-file=??? │ │ │ └───────────┬───────────┘ │ │ │ │ │ ┌────────────┬───────┴──────┬──────────────┐ │ │ ↓ ↓ ↓ ↓ │ │ ┌──────────┐┌──────────┐┌──────────┐┌──────────────┐ │ │ │dev.tfvars││qa.tfvars ││uat.tfvars││prod.tfvars │ │ │ │ ││ ││ ││ │ │ │ │t3.large ││t3.xlarge ││m5.xlarge ││m5.2xlarge │ │ │ │1 replica ││2 replicas││2 replicas││3 replicas │ │ │ │20GB disk ││50GB disk ││100GB disk││500GB disk │ │ │ │no multi- ││no multi- ││multi-az ││multi-az │ │ │ │az ││az ││ ││30-day backup │ │ │ │1-day bkp ││3-day bkp ││7-day bkp ││ │ │ │ └──────────┘└──────────┘└──────────┘└──────────────┘ │ │ │ │ Layered Override: │ │ terraform plan \ │ │ -var-file=common.tfvars \ ← shared values │ │ -var-file=environments/prod.tfvars ← env overrides │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Terraform workspaces use a single configuration with separate state files selected by workspace name (terraform workspace select prod), while directory-based separation duplicates the configuration into per-environment directories (envs/dev/, envs/prod/) each with their own state. Workspaces are simpler but share all code paths; directories offer full isolation but require synchronization effort.
Detailed Answer
Choosing between workspaces and directory-based separation is like choosing between a multi-tenant apartment building (workspaces) and separate houses (directories). The apartment building shares plumbing and electrical systems — cheaper and easier to maintain, but a pipe burst affects everyone. Separate houses cost more to build and maintain, but a problem in one house never reaches another. Terraform workspaces create named instances of state within the same backend configuration. When you run terraform workspace new qa, Terraform creates a new state file at env:/qa/terraform.tfstate in your backend (S3 key prefix changes to include the workspace name). The configuration stays identical — same main.tf, same modules — and you use terraform.workspace in conditionals or lookups to vary behavior. For example, locals { instance_type = terraform.workspace == "prod" ? "m5.2xlarge" : "t3.large" }. The advantage is zero code duplication: a module upgrade applies to all environments by running apply in each workspace sequentially. Directory-based separation creates independent root configurations per environment: infrastructure/envs/dev/main.tf, infrastructure/envs/prod/main.tf. Each directory has its own backend configuration, its own provider setup, and its own terraform.tfvars. Shared logic lives in modules referenced by relative path or registry source. The directories might look identical initially, but they can diverge intentionally — Prod might have a WAF module that Dev does not, or Dev might test a newer provider version before promoting it. The workspace approach has several tradeoffs. On the positive side: single source of truth for infrastructure code, atomic module upgrades, and less repository sprawl. On the negative side: a syntax error in main.tf breaks all environments simultaneously, there is no way to run different Terraform or provider versions per workspace, and terraform.workspace conditionals scattered through the code create hidden complexity. Most critically, a careless terraform apply in the wrong workspace can destroy production — there is no structural guardrail preventing this, only the workspace prompt. Directory-based separation trades DRY for safety. Each environment is fully independent: you can upgrade Dev to Terraform 1.8 while Prod stays on 1.7, you can test a new module version in QA without touching UAT, and a broken configuration in one directory cannot affect another. The cost is synchronization: when you fix a bug in the Dev VPC module call, you must remember to propagate the fix to QA, UAT, and Prod directories. Tooling like Terragrunt mitigates this by generating directory-based configurations from a DRY template, giving you the isolation of directories with the maintainability of workspaces. The production recommendation for most teams at Value Momentum's scale is a hybrid approach: use directory-based separation for major infrastructure boundaries (networking, EKS cluster, databases) and workspaces within each directory for environment selection only when the configurations are truly identical. Never use workspaces as a substitute for separate AWS accounts — they operate within a single provider configuration and do not provide IAM or network isolation.
Code Example
# Workspace-based approach — single configuration, multiple states
# main.tf — same code serves all environments
locals {
# Map workspace names to environment-specific configurations
env_config = {
dev = {
instance_type = "t3.large"
min_nodes = 1
max_nodes = 3
db_instance = "db.t3.medium"
multi_az = false
}
qa = {
instance_type = "t3.xlarge"
min_nodes = 2
max_nodes = 4
db_instance = "db.t3.large"
multi_az = false
}
prod = {
instance_type = "m5.2xlarge"
min_nodes = 3
max_nodes = 15
db_instance = "db.r6g.2xlarge"
multi_az = true
}
}
# Select the configuration for the current workspace
current = local.env_config[terraform.workspace]
}
# RDS cluster using workspace-driven configuration
resource "aws_rds_cluster" "payments_db" {
# Cluster name includes workspace for uniqueness
cluster_identifier = "payments-db-${terraform.workspace}"
# Engine and version are environment-agnostic
engine = "aurora-postgresql"
engine_version = "15.4"
# Instance class from the workspace-specific config map
# (applied to cluster instances, shown here for clarity)
# Multi-AZ driven by workspace config
# Note: Aurora handles AZ distribution via cluster instances
deletion_protection = terraform.workspace == "prod"
backup_retention_period = terraform.workspace == "prod" ? 30 : 7
}
# Workspace CLI commands
# terraform workspace new dev
# terraform workspace new qa
# terraform workspace new prod
# terraform workspace select prod
# terraform plan -out=prod.tfplan
# terraform apply prod.tfplan
# ─────────────────────────────────────────────────────
# Directory-based approach — separate configs per environment
# infrastructure/envs/dev/backend.tf
# terraform {
# backend "s3" {
# bucket = "valuemomentum-terraform-state"
# key = "dev/payments/terraform.tfstate"
# region = "us-east-1"
# dynamodb_table = "terraform-locks"
# encrypt = true
# }
# }
# infrastructure/envs/prod/backend.tf
# terraform {
# backend "s3" {
# bucket = "valuemomentum-terraform-state"
# key = "prod/payments/terraform.tfstate"
# region = "us-east-1"
# dynamodb_table = "terraform-locks"
# encrypt = true
# }
# }
# infrastructure/envs/prod/main.tf
# module "payments_vpc" {
# source = "../../modules/vpc"
# vpc_name = "payments-vpc-prod"
# vpc_cidr = "10.4.0.0/16"
# environment = "prod"
# }◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Workspaces vs Directory-Based Environment Separation │ ├───────────────────────────────────────────────────────────────┤ │ │ │ Workspace Approach: │ │ ┌─────────────────────────────────────────────┐ │ │ │ Single Configuration (main.tf) │ │ │ │ ┌──────────────────────────────────────┐ │ │ │ │ │ terraform.workspace → selects state │ │ │ │ │ └──────────────────────────────────────┘ │ │ │ │ │ │ │ │ │ │ │ ↓ ↓ ↓ │ │ │ │ ┌───────┐ ┌───────┐ ┌───────┐ │ │ │ │ │ dev │ │ qa │ │ prod │ states │ │ │ │ │ state │ │ state │ │ state │ │ │ │ │ └───────┘ └───────┘ └───────┘ │ │ │ └─────────────────────────────────────────────┘ │ │ Pros: DRY, single upgrade path │ │ Cons: shared code = shared failures, no version isolation │ │ │ │ Directory Approach: │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ envs/dev/ │ │ envs/qa/ │ │ envs/prod/│ │ │ │ │ │ │ │ │ │ │ │ main.tf │ │ main.tf │ │ main.tf │ │ │ │ backend.tf│ │ backend.tf│ │ backend.tf│ │ │ │ dev.tfvars│ │ qa.tfvars │ │ prod.tfvar│ │ │ │ │ │ │ │ + WAF mod │ │ │ │ TF 1.7 │ │ TF 1.7 │ │ TF 1.7 │ │ │ │ state: dev│ │ state: qa │ │ state:prod│ │ │ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │ │ │ │ │ │ │ └──────────────┴──────────────┘ │ │ │ │ │ ┌───────┴───────┐ │ │ │ Shared Modules│ │ │ │ modules/vpc │ │ │ │ modules/eks │ │ │ │ modules/rds │ │ │ └───────────────┘ │ │ Pros: full isolation, independent versions │ │ Cons: code duplication, sync overhead │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Use a single monorepo with environment-specific directories (envs/dev, envs/prod) and shared modules, combined with a trunk-based branching strategy where main is always deployable. Feature branches target dev first, promotion happens through directory-level changes, and protected branch rules prevent direct pushes to main without PR approval and passing CI checks.
Detailed Answer
Choosing between a monorepo and multi-repo for Terraform is like choosing between a single warehouse with organized aisles (monorepo) versus separate warehouses per product line (multi-repo). The single warehouse is easier to navigate and keeps inventory consistent, but a forklift accident in one aisle can block the whole building. Separate warehouses provide isolation but make it harder to share parts and coordinate shipments. The monorepo approach stores all Terraform code in a single repository: shared modules in modules/, environment configurations in envs/dev/, envs/qa/, envs/prod/, and CI/CD pipeline definitions alongside the code. The primary advantage is atomic changes: when you update a shared module and its callers, everything is in one PR, reviewable together. Cross-cutting concerns like provider version upgrades, backend configuration changes, or tagging policy updates are a single commit. Module development and consumption happen in the same repository, eliminating the version publishing ceremony. The multi-repo approach creates separate repositories per environment or per infrastructure domain: infra-networking, infra-eks, infra-databases. Each repository has its own CI/CD pipeline, its own access controls, and its own release cadence. The advantage is strict blast radius isolation: a broken CI pipeline in the networking repo cannot block database deployments. Access control is more granular — you can give database administrators write access to infra-databases without exposing networking code. For most teams at the scale of 4 environments (Dev, QA, UAT, Prod), the monorepo with environment directories is the pragmatic choice. The branching strategy that works best is trunk-based development with short-lived feature branches. Main (or trunk) is always deployable — it represents the current desired state of all environments. Engineers create feature branches from main, make changes in the relevant environment directory, open a PR, and the CI pipeline runs terraform plan against the affected environments. The promotion flow works through directory-level changes, not branch-based promotion. An engineer develops a new VPC peering configuration in envs/dev/peering.tf, merges to main, and the dev pipeline applies it. After validation in dev, they create a new PR that adds the same configuration to envs/qa/peering.tf (possibly with different tfvars). This continues through UAT and Prod. Each environment change is a separate PR with its own review and approval cycle. The critical branching anti-pattern to avoid is environment branches (dev branch, qa branch, prod branch) where you promote by merging dev into qa into prod. This creates merge conflicts, diverging code paths, and the nightmare scenario where a hotfix to prod must be cherry-picked back through all environment branches. Trunk-based development with directory separation eliminates this entirely. For the multi-repo approach, use Git tags or releases to version shared modules. Consumer repositories reference modules via Git source URLs with pinned tags: source = "git::https://github.com/org/tf-modules.git//vpc?ref=v2.2.0". Promotion happens by bumping the version tag in each environment's module source, creating a clear audit trail of what version each environment runs.
Code Example
# Monorepo directory structure for Value Momentum
# infrastructure/
# ├── modules/ # Shared reusable modules
# │ ├── vpc/
# │ │ ├── main.tf
# │ │ ├── variables.tf
# │ │ └── outputs.tf
# │ ├── eks-cluster/
# │ ├── rds-aurora/
# │ └── security-baseline/
# ├── envs/ # Environment-specific roots
# │ ├── dev/
# │ │ ├── main.tf # Calls shared modules
# │ │ ├── backend.tf # Dev state backend config
# │ │ └── terraform.tfvars # Dev-specific values
# │ ├── qa/
# │ ├── uat/
# │ └── prod/
# │ ├── main.tf # Same modules, prod values
# │ ├── backend.tf # Prod state backend config
# │ └── terraform.tfvars # Prod-specific values
# ├── .github/
# │ └── workflows/
# │ └── terraform.yml # CI/CD pipeline
# └── CODEOWNERS # Require platform team review
# CODEOWNERS — enforce review policies per directory
# Require platform-admins approval for production changes
# /infrastructure/envs/prod/ @valuemomentum/platform-admins
# Require networking team approval for VPC changes
# /infrastructure/modules/vpc/ @valuemomentum/networking-team
# Any engineer can modify dev without special approval
# /infrastructure/envs/dev/ @valuemomentum/developers
# CI/CD pipeline — detect which environments changed
# .github/workflows/terraform.yml
# name: Terraform Multi-Environment Pipeline
# on:
# pull_request:
# paths: ['infrastructure/**']
# jobs:
# detect-changes:
# runs-on: ubuntu-latest
# outputs:
# dev_changed: ${{ steps.changes.outputs.dev }}
# prod_changed: ${{ steps.changes.outputs.prod }}
# steps:
# - uses: dorny/paths-filter@v3
# id: changes
# with:
# filters: |
# dev:
# - 'infrastructure/envs/dev/**'
# - 'infrastructure/modules/**'
# prod:
# - 'infrastructure/envs/prod/**'
# - 'infrastructure/modules/**'
#
# plan-dev:
# needs: detect-changes
# if: needs.detect-changes.outputs.dev_changed == 'true'
# runs-on: ubuntu-latest
# steps:
# - run: cd infrastructure/envs/dev && terraform plan
#
# plan-prod:
# needs: detect-changes
# if: needs.detect-changes.outputs.prod_changed == 'true'
# runs-on: ubuntu-latest
# environment: production-review
# steps:
# - run: cd infrastructure/envs/prod && terraform plan
# Module reference from environment directory
# infrastructure/envs/prod/main.tf
module "payments_vpc" {
# Relative path to shared module within the monorepo
source = "../../modules/vpc"
# Production VPC configuration
vpc_name = "payments-vpc-prod"
vpc_cidr = "10.4.0.0/16"
environment = "prod"
# Enable 3 NAT gateways for production HA
single_nat_gateway = false
}
module "payments_eks" {
# Relative path to shared EKS module
source = "../../modules/eks-cluster"
# Production cluster configuration
cluster_name = "payments-eks-prod"
vpc_id = module.payments_vpc.vpc_id
subnet_ids = module.payments_vpc.private_subnet_ids
# Production Kubernetes version
k8s_version = "1.29"
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Monorepo Trunk-Based Branching Strategy │ ├───────────────────────────────────────────────────────────────┤ │ │ │ main (always deployable) │ │ ─────────●─────────●─────────●─────────●──────→ │ │ │ │ │ │ │ │ │ │ │ └── PR: add WAF to │ │ │ │ │ envs/prod/ │ │ │ │ │ │ │ │ │ └── PR: add peering to │ │ │ │ envs/uat/ │ │ │ │ │ │ │ └── PR: add peering to envs/qa/ │ │ │ │ │ └── PR: add peering to envs/dev/ │ │ (feature/vpc-peering branch) │ │ │ │ Promotion Flow (directory-based, NOT branch-based): │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ 1. PR: modify envs/dev/peering.tf → merge → apply │ │ │ │ 2. Validate in dev │ │ │ │ 3. PR: modify envs/qa/peering.tf → merge → apply │ │ │ │ 4. Validate in qa │ │ │ │ 5. PR: modify envs/uat/peering.tf → merge → apply │ │ │ │ 6. Validate in uat │ │ │ │ 7. PR: modify envs/prod/peering.tf → approve → apply│ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ Anti-Pattern (environment branches): │ │ dev ──────●──────●──────→ │ │ ╲ merge │ │ qa ────────────────●──────→ ← merge conflicts │ │ ╲ merge │ │ prod ────────────────●──────→ ← cherry-pick hell │ │ │ │ CI Trigger Matrix: │ │ ┌──────────────────┬────────────────────────┐ │ │ │ Path Changed │ Environments Planned │ │ │ ├──────────────────┼────────────────────────┤ │ │ │ envs/dev/** │ dev only │ │ │ │ envs/prod/** │ prod only │ │ │ │ modules/** │ ALL environments │ │ │ └──────────────────┴────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Implement a naming convention module that generates consistent resource names using a pattern like {project}-{component}-{environment}-{region-short}. Pass environment as a variable, embed it in every resource name and tag, and use validation rules to enforce the convention. This prevents name collisions across environments sharing the same AWS account and makes resources identifiable in billing reports and the AWS console.
Detailed Answer
Naming conventions in Terraform are like street addresses in a city: without a consistent scheme, you end up with two buildings called 'Main Office' and no one knows which is production and which is staging until something breaks. A well-designed naming convention makes every resource self-describing — you should be able to look at a resource name in a CloudWatch alarm or a billing report and immediately know its environment, project, component, and owner. The naming pattern should follow a hierarchical structure: {project}-{component}-{environment}-{region-short}. For example: payments-api-alb-prod-use1 (project=payments, component=api-alb, environment=prod, region=us-east-1 abbreviated). This pattern works across most AWS resource types, though some have character limits (S3 bucket names max at 63 characters, IAM role names at 64). Build a naming module that accepts these components and outputs formatted names, handling truncation and character restrictions per resource type. The naming module centralizes convention enforcement. Every resource in your Terraform code calls this module to generate its name rather than constructing names inline. This ensures consistency: if the team decides to change the naming format (adding a cost center code, for example), you update one module and every resource follows suit. The module also handles AWS-specific constraints: S3 buckets cannot have underscores, security group names cannot start with sg-, and some resources require globally unique names while others only need account-level uniqueness. Environment embedding is the critical collision-prevention mechanism. When Dev and QA share an AWS account (common in early-stage projects), every resource must include the environment in its name. Without this, a terraform apply in dev that creates a security group called payments-api-sg will conflict with the same apply in qa trying to create the same name. The conflict can manifest as Terraform errors (resource already exists) or worse — Terraform adopting the existing resource into its state, giving one environment control over another's security group. Tag standardization complements naming. While names are visible in the console and logs, tags power cost allocation, automation, and compliance. Define a standard tag set: Environment, Project, Component, ManagedBy (always 'terraform'), Team, CostCenter. Use the AWS provider's default_tags block to automatically apply these to every resource without repeating them in each resource block. This guarantees no resource escapes tagging, even if an engineer forgets to add tags to a new resource. The gotcha most teams hit is naming convention drift over time. Early resources use one pattern, new resources use an updated pattern, and the codebase becomes inconsistent. Prevent this by adding tflint rules or OPA policies that validate resource names against a regex pattern. Run these checks in CI so non-conforming names fail the pipeline before reaching any environment. Another gotcha is renaming existing resources: changing a resource name in Terraform causes a destroy-and-recreate cycle unless you use a moved block or terraform state mv to update the state mapping.
Code Example
# modules/naming/main.tf — Centralized naming convention module
# Input variables for name construction
variable "project" {
description = "Project identifier (e.g., payments, user-auth)"
type = string
# Enforce lowercase alphanumeric with hyphens only
validation {
condition = can(regex("^[a-z][a-z0-9-]{1,20}$", var.project))
error_message = "Project name must be lowercase alphanumeric with hyphens, 2-21 chars."
}
}
variable "component" {
description = "Component identifier (e.g., api-alb, eks-cluster, rds-primary)"
type = string
}
variable "environment" {
description = "Environment identifier (dev, qa, uat, prod)"
type = string
validation {
condition = contains(["dev", "qa", "uat", "prod"], var.environment)
error_message = "Environment must be one of: dev, qa, uat, prod."
}
}
variable "region" {
description = "AWS region for the short code suffix"
type = string
default = "us-east-1"
}
# Region short code mapping for compact names
locals {
# Map full region names to 4-character abbreviations
region_short = {
"us-east-1" = "use1"
"us-east-2" = "use2"
"us-west-2" = "usw2"
"eu-west-1" = "euw1"
}
# Standard name pattern: project-component-env-region
standard_name = "${var.project}-${var.component}-${var.environment}-${local.region_short[var.region]}"
# S3-safe name (no underscores, max 63 chars)
s3_name = substr(replace(local.standard_name, "_", "-"), 0, 63)
# Standard tag set applied to all resources
standard_tags = {
Project = var.project
Component = var.component
Environment = var.environment
ManagedBy = "terraform"
NamingVersion = "v2"
}
}
# Output the generated names and tags
output "standard_name" {
description = "Standard resource name"
value = local.standard_name
}
output "s3_name" {
description = "S3-safe resource name (no underscores, max 63 chars)"
value = local.s3_name
}
output "tags" {
description = "Standard tag set for all resources"
value = local.standard_tags
}
# Usage in root configuration — every resource uses the naming module
module "naming_vpc" {
# Reference the naming module
source = "./modules/naming"
project = "payments"
component = "vpc"
environment = var.environment
}
resource "aws_vpc" "payments_network" {
# Use the naming module output for the VPC name tag
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
tags = merge(module.naming_vpc.tags, {
# Name tag uses the standardized naming output
Name = module.naming_vpc.standard_name
})
}
module "naming_alb" {
source = "./modules/naming"
project = "payments"
component = "api-alb"
environment = var.environment
}
resource "aws_lb" "payments_api" {
# ALB name using the naming convention (max 32 chars for ALB)
name = substr(module.naming_alb.standard_name, 0, 32)
internal = false
load_balancer_type = "application"
subnets = var.public_subnet_ids
tags = module.naming_alb.tags
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐
│ Terraform Naming Convention Architecture │
├───────────────────────────────────────────────────────────────┤
│ │
│ Naming Pattern: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ {project}-{component}-{environment}-{region-short} │ │
│ │ │ │
│ │ payments-api-alb-prod-use1 │ │
│ │ payments-vpc-dev-use1 │ │
│ │ user-auth-rds-qa-use1 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ Naming Module Flow: │
│ ┌──────────────┐ │
│ │ Inputs: │ │
│ │ project │ ┌──────────────────┐ │
│ │ component │───→│ Naming Module │ │
│ │ environment │ │ │ │
│ │ region │ │ Validates input │ │
│ └──────────────┘ │ Applies rules │ │
│ │ Handles limits │ │
│ └────────┬─────────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ ↓ ↓ ↓ │
│ ┌────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │standard_name│ │s3_name │ │standard_tags │ │
│ │(general) │ │(63 char max) │ │(Project, Env │ │
│ │ │ │(no _) │ │ Component) │ │
│ └────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Collision Prevention: │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Same account, different environments: │ │
│ │ │ │
│ │ payments-api-sg-dev-use1 ← unique, no collision │ │
│ │ payments-api-sg-qa-use1 ← unique, no collision │ │
│ │ payments-api-sg-prod-use1 ← unique, no collision │ │
│ │ │ │
│ │ Without env in name: │ │
│ │ payments-api-sg ← COLLISION across environments │ │
│ └──────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘Quick Answer
Terraform workspaces allow you to maintain multiple state files for the same configuration, enabling environment separation (dev/staging/prod) with a single codebase. Separate directories are preferred when environments have significantly different resource compositions or provider configurations.
Detailed Answer
Terraform workspaces are a built-in mechanism for managing multiple instances of the same infrastructure configuration. Think of workspaces like branches in a photo editing app — you have the same source image but can apply different filters and adjustments to each branch independently. Each workspace gets its own state file, so resources in the 'dev' workspace are completely isolated from resources in the 'prod' workspace, even though they share the same Terraform code. Internally, workspaces work by modifying the state file path. When using the default local backend, Terraform stores state files in a terraform.tfstate.d/ directory with subdirectories for each workspace. With an S3 backend, workspace state files are stored under the key prefix with the workspace name appended — for example, payments-platform/dev/terraform.tfstate and payments-platform/prod/terraform.tfstate. The terraform.workspace variable is available in your HCL code, letting you conditionally set values based on the active workspace. Workspaces shine when your environments are structurally identical but differ in scale or configuration. If dev, staging, and production all need the same VPC, RDS cluster, ECS service, and ALB, but dev uses smaller instance types and fewer replicas, workspaces with conditional expressions or workspace-specific tfvars files work beautifully. You can use terraform.workspace in locals to set instance sizes, replica counts, and CIDR ranges per environment. However, workspaces have significant limitations that push many teams toward separate directories. First, all environments share the same provider configuration — you cannot easily use different AWS accounts per workspace without complex provider alias tricks. Second, if your production environment has additional resources that dev does not need (WAF rules, CloudFront distributions, compliance monitoring), you end up with count = terraform.workspace == "prod" ? 1 : 0 scattered throughout your code, which becomes unreadable. Third, workspaces provide no protection against accidentally running apply in the wrong workspace. A sleep-deprived engineer who forgets to run terraform workspace select prod before applying a production hotfix could accidentally modify the dev environment. Separate directories (often called the directory-per-environment pattern) give you complete isolation. Each environment has its own directory with its own backend configuration, provider configuration, and state. This means prod can use a different AWS account, different provider version constraints, and completely different resource compositions. The tradeoff is code duplication — you need to keep common modules in sync across directories. The modern consensus in the Terraform community is to use workspaces for lightweight environment differentiation within a single account and team, and separate directories (or separate Terraform Cloud workspaces with VCS integration) for production-grade multi-account setups. Many teams use a hybrid approach: modules contain the shared logic, and each environment directory calls those modules with environment-specific variables.
Code Example
# Using workspaces with conditional configuration for the payments platform
# Define local values that change based on the active workspace
locals {
# Map workspace names to environment-specific configurations
environment_config = {
# Development environment uses minimal resources to save costs
dev = {
instance_class = "db.t3.medium"
replica_count = 1
vpc_cidr = "10.10.0.0/16"
enable_waf = false
backup_retention = 3
}
# Staging mirrors production structure but at reduced scale
staging = {
instance_class = "db.r6g.large"
replica_count = 2
vpc_cidr = "10.20.0.0/16"
enable_waf = true
backup_retention = 7
}
# Production runs at full scale with maximum protection
prod = {
instance_class = "db.r6g.2xlarge"
replica_count = 3
vpc_cidr = "10.30.0.0/16"
enable_waf = true
backup_retention = 35
}
}
# Look up the current workspace's config from the map above
config = local.environment_config[terraform.workspace]
}
# Configure the S3 backend — workspace name is automatically appended to key
terraform {
# S3 backend stores each workspace's state at a separate key path
backend "s3" {
# Shared state bucket for all environments
bucket = "fintech-terraform-state"
# Base key path — workspace name is appended automatically
key = "payments-platform/terraform.tfstate"
# Region where the state bucket resides
region = "us-east-1"
# Lock table shared across all workspaces
dynamodb_table = "fintech-terraform-locks"
# Encrypt state at rest for compliance
encrypt = true
}
}
# VPC sized according to the current workspace
resource "aws_vpc" "payments_vpc" {
# CIDR block varies by environment — dev is /16, staging is /16, prod is /16
cidr_block = local.config.vpc_cidr
# Enable DNS hostnames for service discovery within the VPC
enable_dns_hostnames = true
# Tag with the workspace name so resources are identifiable in the console
tags = {
Name = "payments-vpc-${terraform.workspace}"
Environment = terraform.workspace
ManagedBy = "terraform"
}
}
# RDS cluster scaled per environment using workspace-driven locals
resource "aws_rds_cluster_instance" "payments_db_instances" {
# Create the number of replicas specified for this workspace
count = local.config.replica_count
# Unique identifier includes workspace and instance index
identifier = "payments-db-${terraform.workspace}-${count.index}"
# Associate with the payments database cluster
cluster_identifier = aws_rds_cluster.payments_db.id
# Instance class varies by workspace — t3.medium in dev, r6g.2xlarge in prod
instance_class = local.config.instance_class
# Use the same Aurora PostgreSQL engine as the cluster
engine = aws_rds_cluster.payments_db.engine
# Tag for environment identification and cost tracking
tags = {
Environment = terraform.workspace
Service = "payments-db"
}
}◈ Architecture Diagram
┌──────────────────────────────────────────────────────────┐ │ Workspace Approach │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ Shared HCL Configuration │ │ │ │ main.tf │ variables.tf │ outputs.tf │ │ │ └──────────────────┬───────────────────────────┘ │ │ │ │ │ ┌────────────┼────────────┐ │ │ │ │ │ │ │ ┌─────▼─────┐ ┌────▼─────┐ ┌───▼──────┐ │ │ │ Workspace │ │Workspace │ │Workspace │ │ │ │ dev │ │ staging │ │ prod │ │ │ │ │ │ │ │ │ │ │ │ State: A │ │ State: B │ │ State: C │ │ │ │ t3.medium │ │ r6g.large│ │r6g.2xl │ │ │ └───────────┘ └──────────┘ └──────────┘ │ └──────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────┐ │ Directory Approach │ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ envs/dev/ │ │envs/stage/ │ │ envs/prod/ │ │ │ │ main.tf │ │ main.tf │ │ main.tf │ │ │ │ backend.tf │ │ backend.tf │ │ backend.tf │ │ │ │ State: X │ │ State: Y │ │ State: Z │ │ │ │ AcctID: 111│ │ AcctID: 222│ │ AcctID: 333│ │ │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ │ │ │ │ │ │ └───────────────┼───────────────┘ │ │ │ │ │ ┌─────────▼─────────┐ │ │ │ Shared Modules │ │ │ │ modules/vpc/ │ │ │ │ modules/rds/ │ │ │ │ modules/ecs/ │ │ │ └───────────────────┘ │ └──────────────────────────────────────────────────────────┘
Quick Answer
The Azure DevOps Data Migration Tool (OpsImport) performs a high-fidelity migration of project collections from Azure DevOps Server to Azure DevOps Services, preserving full history of source code, work items, builds, and identities. The process involves running the migration tool to validate and generate import files, uploading the DACPAC database export to Azure blob storage, submitting the import request, and completing identity mapping to link on-premises Active Directory accounts to Azure AD identities.
Detailed Answer
Think of migrating from Azure DevOps Server to Services like moving an entire museum collection from one building to another across the country. You cannot simply load paintings into a truck: each piece must be cataloged, wrapped in appropriate protective materials, transported in climate-controlled vehicles, and reinstalled in the new building with the same spatial relationships, labels, and security settings. The catalog (identities) must be translated because room numbers (Active Directory SIDs) are different in the new building (Azure AD). The Data Migration Tool is the professional moving company that handles all of this complexity: it exports the database, ships it to Azure, imports it, and maps the old identity references to new ones. The migration begins with preparation and validation. You must be running a supported version of Azure DevOps Server (2019 or later, with latest updates). The Data Migration Tool validates your collection against a set of readiness checks: unsupported field types, oversized attachments, deprecated features in use, identity conflicts, and data integrity issues. Common blocking issues include Git repositories exceeding 10GB, TFVC files exceeding 1GB, custom process templates with unsupported elements, and work items with more than 1000 revisions. The validation report identifies these issues and provides remediation guidance. Some issues require modifying data before migration; others simply generate warnings about features that will behave differently in Services. The actual migration process has two modes: dry run (import with no data, just validation) and production import. For the production import, you generate a DACPAC (Data-Tier Application Package) from your collection database using SQL Server tools. This DACPAC is a compressed export of the entire database schema and data. You upload the DACPAC to an Azure blob storage container provided by Microsoft during the import setup. For large databases, this upload can take hours or days depending on bandwidth. Microsoft then processes the import on their infrastructure, applying schema transformations to adapt the on-premises database structure to the multi-tenant Services format. The import duration depends on database size: small collections (under 10GB) complete in hours, while large collections (500GB+) may take days. Identity mapping is the most complex aspect of the migration. On-premises Azure DevOps Server uses Active Directory (AD) security identifiers (SIDs) for all permissions, work item assignments, and history attribution. Azure DevOps Services uses Azure Active Directory (Azure AD) identities. The migration tool generates an identity map file listing every AD identity referenced in the collection and requiring you to map each to a corresponding Azure AD identity. For organizations that have already synchronized their AD to Azure AD using Azure AD Connect, many mappings are automatic. For identities that have no Azure AD equivalent (departed employees, service accounts, renamed accounts), you must decide whether to map them to existing Azure AD accounts or leave them as historical references. Incorrect identity mapping results in permission errors and misattributed history. Post-migration validation and cutover planning are critical. During migration, the source Server is taken offline (or set to read-only) to prevent changes that would not be captured in the migration. After import completes, you validate that all repositories, work items, pipelines, and permissions exist correctly in Services. Pipelines need reconfiguration because Service Connections, agent pools, and variable groups do not migrate automatically. Self-hosted agents must be re-registered to point to the new Services organization. External integrations (webhooks, API clients, IDE connections) must be updated to use the new dev.azure.com URLs. Plan a cutover window that accounts for DNS changes, client reconfiguration, and team communication. The production gotcha is underestimating the identity mapping effort and the pipeline reconfiguration work. Organizations with 10+ years of TFS history have thousands of unique identities from employees who have left, contractors, and service accounts. Each must be manually reviewed and mapped. Another common issue is TFVC workspace configuration: developers with local TFVC workspaces must recreate them pointing to the new Services instance. Pipeline migration requires recreating every Service Connection, variable group, and agent pool in the new organization, then updating pipeline YAML to reference the new resource names. Budget 2-4 weeks of effort beyond the actual data migration for this reconfiguration work.
Code Example
# Azure DevOps Data Migration — Complete Process # Step 1: Install the Data Migration Tool # Download from https://www.microsoft.com/download/details.aspx?id=54274 # Extract to C:\DataMigration # Step 2: Run validation against collection Migrator.exe validate /collection:http://tfs.internal:8080/tfs/DefaultCollection \ /tenantDomainName:bank-corp.onmicrosoft.com \ /outputPath:C:\Migration\Validation # Review validation results # C:\Migration\Validation\Results.json contains: # - Blocking issues (must fix before migration) # - Warnings (features that behave differently) # - Identity mapping file template # Step 3: Fix blocking issues # Example: Repository too large # git filter-branch to remove large files # Or: Move large binaries to Git LFS before migration # Step 4: Generate identity map Migrator.exe identityMap /collection:http://tfs.internal:8080/tfs/DefaultCollection \ /outputPath:C:\Migration\IdentityMap # Identity map file (IdentityMap.csv): # Source (AD),Target (Azure AD),Status # BANK\ramesh.a,[email protected],Matched # BANK\john.former,[email protected],NotFound ← must resolve # BANK\svc-build,[email protected],ServiceAccount # Step 5: Generate DACPAC export # Use SqlPackage.exe from SQL Server tools SqlPackage.exe /Action:Export \ /SourceServerName:"sql-server.internal" \ /SourceDatabaseName:"Tfs_DefaultCollection" \ /TargetFile:"C:\Migration\DefaultCollection.dacpac" \ /p:Storage=File # Step 6: Upload DACPAC to Azure blob storage # Microsoft provides a SAS token during import setup az storage blob upload \ --account-name importstorageaccount \ --container-name import \ --file C:\Migration\DefaultCollection.dacpac \ --name DefaultCollection.dacpac \ --sas-token "?sv=2021-06-08&ss=b&srt=co&sp=rwdlac..." # Step 7: Submit import request Migrator.exe import /collection:http://tfs.internal:8080/tfs/DefaultCollection \ /tenantDomainName:bank-corp.onmicrosoft.com \ /targetOrgName:bank-corp \ /identityMapFile:C:\Migration\IdentityMap.csv \ /dacpacFile:DefaultCollection.dacpac # Step 8: Monitor import progress # Check status at: https://dev.azure.com/bank-corp/_admin/_import # Status transitions: Queued → InProgress → Completed/Failed # Step 9: Post-migration validation # Verify repos az repos list --org https://dev.azure.com/bank-corp --project payments-platform # Verify work items preserved az boards query --org https://dev.azure.com/bank-corp \ --wiql "SELECT [System.Id] FROM workitems WHERE [System.TeamProject] = 'payments-platform'" \ --output table # Step 10: Reconfigure pipelines (these don't auto-migrate) # Create service connections az devops service-endpoint create \ --service-endpoint-configuration aws-connection.json \ --org https://dev.azure.com/bank-corp \ --project payments-platform # Re-register self-hosted agents ./config.sh --unattended \ --url https://dev.azure.com/bank-corp \ --auth pat --token $NEW_PAT \ --pool "Default" --agent "build-01" --replace
◈ Architecture Diagram
┌──────────────────────────────────────────────────────────┐ │ TFS/Server to Services Migration Flow │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ │ │1.Validate│→ │2.Identity│→ │3.Export │→ │4.Upload │ │ │ │ (fix │ │ Map │ │ DACPAC │ │ to Blob│ │ │ │ issues) │ │ (AD→AAD)│ │ (SQL) │ │ Storage│ │ │ └──────────┘ └──────────┘ └──────────┘ └────┬────┘ │ │ │ │ │ ↓ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ │ │8.Cutover │← │7.Reconfig│← │6.Validate│← │5.Import │ │ │ │ (DNS, │ │ (Service│ │ (repos, │ │ (MSFT │ │ │ │ clients)│ │ Conns, │ │ WIs, │ │ process│ │ │ │ │ │ agents) │ │ history) │ │ queue) │ │ │ └──────────┘ └──────────┘ └──────────┘ └─────────┘ │ │ │ │ What migrates: What does NOT migrate: │ │ ├── Source code + hist ├── Service Connections │ │ ├── Work items + revs ├── Agent pools/agents │ │ ├── Git branches/tags ├── Variable groups │ │ ├── Build history ├── Extensions (reinstall) │ │ └── Test results └── Webhooks/integrations │ └──────────────────────────────────────────────────────────┘
Quick Answer
Classic pipelines use a GUI-based editor stored in Azure DevOps metadata, while YAML pipelines define CI/CD as code in a version-controlled file. YAML enables pull request reviews, template reuse, and multi-stage deployments in a single file.
Detailed Answer
Imagine two ways to give driving directions: a voice-guided GPS (Classic) where you click through turns on a screen, versus a written route card (YAML) that you can photocopy, annotate, version, and hand to anyone. Both get you to the destination, but the written card travels with the project and can be peer-reviewed before the trip. Classic pipelines were the original Azure DevOps experience. Build definitions use a visual task editor where you drag-and-drop tasks like NuGet Restore, MSBuild, or Docker Build. Release definitions add environments with deployment gates, approvals, and artifact triggers. The configuration is stored as JSON metadata inside Azure DevOps, not in your repository. This means pipeline changes do not go through pull requests, cannot be easily diffed, and are invisible in your Git history. YAML pipelines store the entire pipeline definition in an azure-pipelines.yml file committed alongside your source code. Every change to the pipeline goes through the same pull request workflow as application code. YAML supports multi-stage pipelines (build, test, deploy to staging, deploy to production) in a single file with conditional execution, template references, and environment approvals. The extends keyword and template repositories enable centralized governance across hundreds of pipelines. Under the hood, both pipeline types use the same agent infrastructure and task ecosystem. A Classic build task like DotNetCoreCLI@2 is the same task referenced in YAML as - task: DotNetCoreCLI@2. The difference is purely in how the orchestration is defined and stored. In production, most organizations are migrating from Classic to YAML because Microsoft has signaled Classic pipelines will not receive new features. The gotcha is that Classic Release pipelines have some features (like release gates with Azure Monitor integration and graphical deployment visualization) that require extra YAML configuration using environments and checks. Teams migrating often underestimate the effort to replicate approval workflows, variable scoping, and artifact filtering that Classic provided through the GUI.
Code Example
# Classic pipeline equivalent in YAML — azure-pipelines.yml
# This replaces a Classic Build + Release definition pair
trigger:
branches:
include:
- main
- release/*
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Build
displayName: 'Build payments-api'
jobs:
- job: BuildJob
steps:
- task: DotNetCoreCLI@2
displayName: 'Restore packages'
inputs:
command: 'restore'
projects: 'src/payments-api/*.csproj'
- task: DotNetCoreCLI@2
displayName: 'Build solution'
inputs:
command: 'build'
projects: 'src/payments-api/*.csproj'
arguments: '--configuration Release'
- task: DotNetCoreCLI@2
displayName: 'Run unit tests'
inputs:
command: 'test'
projects: 'tests/**/*.csproj'
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)'
ArtifactName: 'drop'
- stage: DeployStaging
displayName: 'Deploy to Staging'
dependsOn: Build
jobs:
- deployment: DeployToStaging
environment: 'payments-staging'
strategy:
runOnce:
deploy:
steps:
- script: echo 'Deploying to staging environment'◈ Architecture Diagram
┌──────────────────────────────────────────────────────────┐ │ Classic Pipeline │ │ ┌──────────┐ ┌──────────────┐ ┌─────────────┐ │ │ │ Build │───→│ Release Def │───→│ Environment │ │ │ │ (GUI) │ │ (GUI) │ │ (GUI) │ │ │ └──────────┘ └──────────────┘ └─────────────┘ │ │ Stored in Azure DevOps metadata (not in Git) │ └──────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────┐ │ YAML Pipeline │ │ ┌────────────────────────────────────────────────────┐ │ │ │ azure-pipelines.yml (in Git repo) │ │ │ │ stages: Build → Test → Deploy Staging → Deploy Prod│ │ │ └────────────────────────────────────────────────────┘ │ │ Versioned │ PR-reviewed │ Templated │ Multi-stage │ └──────────────────────────────────────────────────────────┘
Quick Answer
Create an azure-pipelines.yml file in your repo root defining a trigger, agent pool, and build steps. Azure DevOps detects the file and offers to create the pipeline, or you can use az pipelines create to wire it up via CLI.
Detailed Answer
Think of a YAML pipeline file like a recipe card you tape to your refrigerator. Anyone who opens the fridge (clones the repo) immediately knows exactly how to cook the dish (build the project) without asking the chef (searching through a GUI for hidden configuration). The minimum viable YAML pipeline has three elements: a trigger that specifies which branches activate the pipeline, a pool that defines which agent runs the work, and steps that list the actual build commands. For a .NET project, the steps typically include restore, build, test, and publish. For Node.js, they include npm install, lint, test, and build. Azure DevOps provides starter templates when you create a new pipeline through the portal, automatically detecting your project type. When the pipeline runs, Azure DevOps provisions a fresh agent from the specified pool. Microsoft-hosted agents come pre-installed with common SDKs (.NET, Node.js, Python, Java, Go) and tools (Docker, kubectl, Terraform). The agent clones your repository, executes each step sequentially, and reports results back. Build artifacts like compiled binaries or Docker images can be published for downstream stages. In production, even a basic pipeline should include caching for package restore (to speed up builds from 5 minutes to 90 seconds), test result publishing (so failures appear in the PR UI), and branch filters (to avoid running on documentation-only branches). The variables section externalizes configuration like SDK versions so upgrades require changing one line. The most common gotcha for beginners is indentation errors in YAML causing cryptic parse failures. Azure DevOps provides a YAML editor with IntelliSense in the portal, but many teams prefer editing locally with the Azure Pipelines VS Code extension that provides schema validation. Another frequent issue is the agent not having a required tool version — use the UseDotNet@2 or NodeTool@0 tasks to explicitly install the version you need rather than relying on whatever is pre-installed.
Code Example
# azure-pipelines.yml — Basic .NET build pipeline for payments-api
trigger:
- main
- feature/*
pool:
vmImage: 'ubuntu-latest'
variables:
buildConfiguration: 'Release'
dotnetVersion: '8.0.x'
steps:
- task: UseDotNet@2
displayName: 'Install .NET SDK'
inputs:
packageType: 'sdk'
version: '$(dotnetVersion)'
- task: DotNetCoreCLI@2
displayName: 'Restore NuGet packages'
inputs:
command: 'restore'
projects: 'src/payments-api/**/*.csproj'
feedsToUse: 'select'
vstsFeed: 'contoso-internal-feed'
- task: DotNetCoreCLI@2
displayName: 'Build payments-api'
inputs:
command: 'build'
projects: 'src/payments-api/**/*.csproj'
arguments: '--configuration $(buildConfiguration) --no-restore'
- task: DotNetCoreCLI@2
displayName: 'Run unit tests'
inputs:
command: 'test'
projects: 'tests/**/*.csproj'
arguments: '--configuration $(buildConfiguration) --collect:"XPlat Code Coverage"'
- task: PublishTestResults@2
displayName: 'Publish test results'
inputs:
testResultsFormat: 'VSTest'
testResultsFiles: '**/*.trx'
---
# azure-pipelines.yml — Basic Node.js pipeline for fraud-detector
trigger:
- main
pool:
vmImage: 'ubuntu-latest'
steps:
- task: NodeTool@0
displayName: 'Use Node.js 20.x'
inputs:
versionSpec: '20.x'
- script: npm ci
displayName: 'Install dependencies (clean)'
- script: npm run lint
displayName: 'Run ESLint'
- script: npm run test -- --coverage
displayName: 'Run Jest tests with coverage'
- script: npm run build
displayName: 'Build production bundle'◈ Architecture Diagram
┌─────────────────────────────────────────────────┐ │ Pipeline Execution Flow │ │ │ │ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │ │ │ Trigger │──→│ Agent │──→│ Clone Repo │ │ │ │ (push) │ │ (pool) │ │ (checkout) │ │ │ └─────────┘ └─────────┘ └──────┬───────┘ │ │ │ │ │ ↓ │ │ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │ │ │Publish │←──│ Test │←──│ Build │ │ │ │Artifacts│ │ Run │ │ Compile │ │ │ └─────────┘ └─────────┘ └──────────────┘ │ └─────────────────────────────────────────────────┘
Quick Answer
Multi-stage YAML pipelines define build, test, and deploy stages in a single azure-pipelines.yml file. Approval gates are configured on Azure DevOps Environments, requiring manual approval or automated checks before the deployment job targeting that environment can proceed.
Detailed Answer
Think of a relay race with checkpoints. Each runner (stage) must complete their leg before the next starts, and at certain checkpoints (environments), a judge (approver) must wave the green flag before the next runner can go. The entire race plan is written down in advance (YAML), but human judgment controls progression at critical points. Multi-stage YAML pipelines replaced the old Classic release pipelines with a code-as-configuration approach. A single YAML file defines multiple stages — typically Build, Test, Deploy-Dev, Deploy-Staging, Deploy-Prod. Each stage contains jobs, and each job contains steps. Stages run sequentially by default but can be configured with dependsOn to run in parallel or in custom orders. The deployment jobs reference Azure DevOps Environments. Environments are the key to approval gates. You create environments (dev, staging, production) in Azure DevOps under Pipelines > Environments. On each environment, you configure Approvals and Checks: manual approvals (specific users or groups must approve), business hours check (only deploy during work hours), branch control (only allow deployments from the main branch), and exclusive lock (prevent concurrent deployments). When a pipeline stage targets an environment with approvals, it pauses and notifies the approvers. At production scale, teams configure progressively stricter gates. Dev deploys automatically on every PR merge. Staging requires one approval from the QA lead. Production requires two approvals from different teams (dev lead + ops lead), business hours enforcement, and a branch control check that only allows the main branch. Templates extract common stage definitions into reusable files, so 50 pipelines share the same deploy-to-production stage with identical gates. The non-obvious gotcha is that environment approvals apply to the environment resource, not the pipeline. If you rename or recreate an environment, you lose all configured approvals and must set them up again. Also, approval timeouts default to 30 days — if nobody approves within that window, the pipeline run expires. Teams should set shorter timeouts (24-48 hours) and configure approval notifications to avoid stale pipeline runs accumulating.
Code Example
# azure-pipelines.yml — Multi-stage pipeline with approval gates
trigger:
branches:
include: [main] # Only trigger on main branch
stages:
- stage: Build
jobs:
- job: BuildApp
pool:
vmImage: ubuntu-latest # Microsoft-hosted agent
steps:
- script: dotnet build --configuration Release # Build the .NET application
- task: PublishBuildArtifacts@1 # Publish artifacts for deploy stages
inputs:
pathtoPublish: $(Build.ArtifactStagingDirectory)
artifactName: drop
- stage: DeployDev
dependsOn: Build # Runs after Build completes
jobs:
- deployment: DeployToDev
environment: dev # No approvals configured — auto deploys
strategy:
runOnce:
deploy:
steps:
- script: echo "Deploying to dev" # Deploy steps here
- stage: DeployProd
dependsOn: DeployDev # Runs after dev succeeds
jobs:
- deployment: DeployToProd
environment: production # Has manual approval configured
strategy:
runOnce:
deploy:
steps:
- script: echo "Deploying to production" # Deploy steps hereQuick Answer
Upgrading TFS to Azure DevOps Server involves backing up databases, verifying hardware requirements, running the installer which performs an in-place upgrade of the application tier and database schema, then validating all collections, build agents, and extensions. The upgrade supports skipping versions (TFS 2018 directly to Server 2022) but requires careful pre-upgrade validation and rollback planning.
Detailed Answer
Think of upgrading TFS to Azure DevOps Server like renovating a house while people are still living in it. You need to photograph every room beforehand (backup), verify the new appliances fit the existing plumbing and wiring (compatibility checks), schedule the renovation during a weekend when disruption is minimal (maintenance window), perform the work (installer), verify everything works (validation), and keep the original photos in case you need to restore something (rollback plan). Skipping a version is like jumping from a 1990s kitchen directly to a 2024 design: possible because the renovation contractor handles all intermediate structural changes, but the transformation is more dramatic and testing more critical. The pre-upgrade phase is the most important and most frequently rushed. Begin by documenting your current environment: TFS version and update level, SQL Server version, Windows Server version, number of project collections, total database size, installed extensions, configured build agents, and any custom plugins or event handlers. Check the compatibility matrix: Azure DevOps Server 2022 requires SQL Server 2019 or 2022 and Windows Server 2019 or 2022. If your current SQL Server version is not supported, you must upgrade SQL Server first, which is a separate project with its own planning. Run the TFS upgrade readiness tool (included in the Azure DevOps Server installer) against your databases to identify any blocking issues like unsupported collation settings, deprecated features, or corrupted work item data. The backup strategy must be comprehensive and tested. Back up all collection databases, the configuration database, the warehouse database (if used), the reporting databases, and the file system cache. Do not rely solely on SQL Server backups: also export your build definitions, release definitions, extension configurations, and agent pool settings using the Azure DevOps CLI or REST API. These exports serve as documentation even if the upgrade succeeds, because they provide a reference point for validating that everything migrated correctly. Most importantly, test your restore process in a separate environment before the real upgrade. A backup you have never restored is a backup you cannot trust. The upgrade itself is surprisingly straightforward when prerequisites are met. Download the Azure DevOps Server 2022 installer, run it on the application tier server, and it detects the existing TFS installation. The wizard offers an upgrade path, validates the environment, and then performs the schema migration on each collection database. For large databases (500GB+), the schema migration can take several hours. The installer handles all intermediate version migrations: if you are upgrading from TFS 2018, it applies the schema changes for 2018→2019→2020→2022 sequentially. During this time, the server is offline. Plan your maintenance window based on database size: roughly 1 hour per 100GB is a conservative estimate, though actual times depend on SQL Server hardware. Post-upgrade validation is where most teams cut corners and regret it later. Verify that all project collections are online and accessible. Check that build agents reconnect (agents from TFS 2018 may need updating). Validate that XAML build definitions were preserved as read-only references. Test that Git and TFVC repositories are accessible with full history. Verify that work item queries, dashboards, and extensions function correctly. Run a sample build pipeline to confirm agents execute properly. Check that notification subscriptions still deliver emails. If you use the reporting warehouse or SharePoint integration, verify those connections. The production gotcha is the agents and extensions gap. Self-hosted build agents from TFS 2018 use an older agent version that may not be compatible with Azure DevOps Server 2022 features. You will likely need to deploy new agents or update existing ones. Extensions installed from the Visual Studio Marketplace may have version compatibility issues: some extensions that worked on TFS 2018 have been deprecated or replaced. Test all critical extensions in a pre-production upgrade before touching production. Another common issue is authentication: if you switch from NTLM to Kerberos or add Azure AD authentication during the upgrade, all existing PATs and service account connections need reconfiguration.
Code Example
# Pre-upgrade validation and backup procedure
# Step 1: Document current environment
Req -Method GET -Uri "http://tfs.internal:8080/tfs/_apis/projectCollections" \
-Headers @{Authorization="Basic $base64Pat"}
# Check TFS version
curl -u user:pat http://tfs.internal:8080/tfs/_apis/connectionData
# Look for: "releaseType":"Release", "version":"16.153.x" (TFS 2018)
# Step 2: Verify SQL Server compatibility
# Azure DevOps Server 2022 requires SQL Server 2019 or 2022
SELECT @@VERSION -- Check current SQL version
SELECT name, collation_name FROM sys.databases -- Verify collation
# Step 3: Backup all databases
-- SQL Server backup script for all TFS databases
BACKUP DATABASE [Tfs_Configuration]
TO DISK = 'E:\Backups\PreUpgrade\Tfs_Configuration.bak'
WITH COMPRESSION, CHECKSUM;
BACKUP DATABASE [Tfs_DefaultCollection]
TO DISK = 'E:\Backups\PreUpgrade\Tfs_DefaultCollection.bak'
WITH COMPRESSION, CHECKSUM;
BACKUP DATABASE [Tfs_Warehouse]
TO DISK = 'E:\Backups\PreUpgrade\Tfs_Warehouse.bak'
WITH COMPRESSION, CHECKSUM;
# Step 4: Export build definitions (for documentation)
az pipelines list --org http://tfs.internal:8080/tfs/DefaultCollection \
--project payments-platform --output json > build-definitions-backup.json
# Step 5: Run upgrade readiness check
# Launch Azure DevOps Server 2022 installer
# Select: "Upgrade" → "Pre-production upgrade validation"
# Or use command line:
# TfsConfig.exe preUpgrade /sqlInstance:SQLServer\Instance
# Step 6: Perform the upgrade (during maintenance window)
# Run installer on Application Tier server
# Installer detects existing TFS 2018/2019 installation
# Follow wizard: Upgrade → Select configuration database → Validate → Upgrade
# Step 7: Post-upgrade validation
# Verify collections are online
curl -u user:pat https://devops.internal:8080/tfs/_apis/projectCollections
# Verify agent pools
az pipelines agent list --pool-id 1 \
--org https://devops.internal:8080/tfs/DefaultCollection
# Update self-hosted agents to latest version
./config.sh --unattended \
--url https://devops.internal:8080/tfs/DefaultCollection \
--auth negotiate \
--pool "Default" \
--agent "build-agent-01" \
--replace
# Verify Git repositories are accessible
git clone https://devops.internal:8080/tfs/DefaultCollection/Project/_git/repo
# Step 8: Rollback procedure (if upgrade fails)
# Stop Azure DevOps Server services
# Restore SQL databases from pre-upgrade backups
# Reinstall previous TFS version pointing to restored databases
# Verify functionality
# Upgrade path reference:
# TFS 2018 → Azure DevOps Server 2022 (direct, skips 2019/2020)
# TFS 2017 → Azure DevOps Server 2022 (direct)
# TFS 2015 → TFS 2018 → Azure DevOps Server 2022 (two hops)
# TFS 2013 → TFS 2018 → Azure DevOps Server 2022 (two hops)◈ Architecture Diagram
┌──────────────────────────────────────────────────────────┐ │ TFS to Azure DevOps Server Upgrade Path │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────┐ │ │ │ TFS 2015 │──→│ TFS 2018 │──→│ Server │──→│Server│ │ │ └──────────┘ └────┬─────┘ │ 2020 │ │ 2022 │ │ │ │ └──────────┘ └──────┘ │ │ │ ↑ ↑ │ │ └──────────────┴──────────────┘ │ │ (Direct upgrade supported) │ │ │ │ Upgrade Process: │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌─────────┐ │ │ │1. Backup │→ │2. Validate│→ │3. Upgrade│→ │4. Verify│ │ │ │ • SQL DBs│ │ • SQL ver │ │ • Run │ │ • Repos │ │ │ │ • Config │ │ • Win ver │ │ installer│ │ • Agents│ │ │ │ • Agents │ │ • Disk │ │ • Schema │ │ • Builds│ │ │ │ • Export │ │ • Extensions│ │ migrate│ │ • Exts │ │ │ └──────────┘ └──────────┘ └──────────┘ └─────────┘ │ │ │ │ Rollback: Restore SQL backups + reinstall previous ver │ └──────────────────────────────────────────────────────────┘