Quick Answer
Configure Terraform Enterprise workspaces with scheduled plan-only runs (e.g., nightly) that detect differences between actual infrastructure and the Terraform state. Alert on drift via webhook notifications, categorize drift by severity, and either auto-remediate safe drifts or create tickets for manual review.
Detailed Answer
Think of drift detection like a nightly security guard doing rounds. The guard has a checklist of how every door, window, and safe should look. If something has changed — a window left open, a safe combination altered, a new lock installed — the guard reports it. Drift detection in Terraform works the same way: scheduled plan runs compare what actually exists in AWS against what Terraform expects, and any discrepancy is flagged for investigation. Infrastructure drift occurs when the actual state of cloud resources diverges from the Terraform-declared state. This happens through manual console changes (someone modifies a security group via the AWS console), changes by other tools (an automation script modifies a resource that Terraform also manages), auto-scaling events that modify resource attributes, and AWS service updates that change default behaviors. In a banking environment, drift is a compliance risk — if your Terraform code declares that an RDS instance has encryption enabled but someone disables it through the console, your compliance posture is degraded and your Terraform state does not reflect reality. Terraform Enterprise enables scheduled plan-only runs on workspaces. You configure a workspace to run terraform plan automatically at a set interval — typically nightly for production workspaces and weekly for non-production. The plan compares the current Terraform configuration and state against the actual infrastructure via provider API calls. If the plan detects changes (resources to update, create, or destroy), it means drift has occurred. TFE marks the run as 'planned and finished' with a non-empty plan, and you can configure webhook notifications to alert your team via Slack, PagerDuty, or a custom drift-tracking system. At enterprise scale, not all drift is equal. A changed tag is low-severity drift that might be auto-remediated. A modified security group rule is high-severity drift that requires immediate investigation — someone may have opened a port that violates PCI-DSS. A deleted resource is critical drift that needs urgent attention. Build a drift classification system: the webhook from TFE sends the plan summary to a Lambda function or custom service that parses the plan output, categorizes each change by resource type and attribute, assigns a severity level, and routes the notification appropriately. Low-severity drift creates a Jira ticket for the next sprint. High-severity drift pages the security team. Critical drift triggers an incident response. For drift remediation, there are two approaches. Auto-remediation configures TFE to automatically apply the plan when drift is detected, restoring infrastructure to the declared state. This is appropriate for low-risk drifts like tag changes or description updates, but dangerous for high-risk resources — auto-applying a plan that wants to recreate an RDS instance would cause downtime. Selective auto-remediation uses Sentinel policies to evaluate the drift plan: if the only changes are to tags and descriptions, auto-apply; if the plan includes any destroy or replace actions, block and alert. Manual remediation requires a human to review the drift, determine whether the Terraform code or the infrastructure should be updated, and either apply the plan or update the code to match the new reality. The biggest gotcha is drift detection generating noise that teams ignore. If your scheduled plans consistently show drift from resources that Terraform partially manages (like ASG instance counts that change with auto-scaling), the team learns to dismiss all drift alerts. Use lifecycle ignore_changes blocks in Terraform for attributes that are expected to drift (like ASG desired_count), and ensure your scheduled plans only flag genuine unauthorized changes. Another gotcha is the API rate limiting — running terraform plan across 200 workspaces simultaneously hammers the AWS API. Stagger your scheduled plans across the night, and use workspace tags to group and schedule them in batches. Finally, drift detection only catches drift in resources Terraform manages — resources created manually outside of Terraform are invisible. Complement TFE drift detection with AWS Config rules that detect unmanaged resources.
Code Example
# TFE workspace with scheduled drift detection
resource "tfe_workspace" "payments_infra_prod" {
name = "payments-infra-production"
organization = "bank-platform"
terraform_version = "1.7.0"
auto_apply = false
vcs_repo {
identifier = "bank/payments-infrastructure"
branch = "main"
oauth_token_id = var.github_oauth_id
}
}
# Scheduled plan-only run for nightly drift detection
resource "tfe_workspace_run_schedule" "payments_drift_check" {
workspace_id = tfe_workspace.payments_infra_prod.id
# Run plan every night at 2 AM ET (7 AM UTC)
cron_schedule = "0 7 * * *"
# Plan only — do not auto-apply
plan_only = true
}
# Webhook notification for drift alerts
resource "tfe_notification_configuration" "drift_alert" {
name = "drift-detection-alert"
enabled = true
workspace_id = tfe_workspace.payments_infra_prod.id
destination_type = "generic" # Custom webhook
url = "https://drift-handler.bank.internal/webhook"
triggers = [
"run:needs_attention", # Plan with changes detected
"run:errored", # Plan failed (possible API issue)
]
}
---
# Drift classification Lambda (triggered by TFE webhook)
# drift-handler/handler.py
import json
import boto3
def classify_drift(event):
"""Classify drift severity based on resource type and change type."""
plan_summary = event.get('plan_summary', {})
changes = plan_summary.get('resource_changes', [])
severity = 'low'
findings = []
for change in changes:
resource_type = change['type']
actions = change['actions']
# Critical: any destroy or replace action
if 'delete' in actions or 'replace' in actions:
severity = 'critical'
findings.append(f"CRITICAL: {resource_type} will be {actions}")
# High: security-related resources modified
elif resource_type in [
'aws_security_group_rule',
'aws_iam_policy',
'aws_iam_role_policy',
'aws_kms_key',
'aws_s3_bucket_policy'
]:
severity = max(severity, 'high')
findings.append(f"HIGH: {resource_type} drifted")
# Low: tags, descriptions, non-functional changes
else:
findings.append(f"LOW: {resource_type} drifted")
return severity, findings
def route_alert(severity, findings, workspace_name):
"""Route drift alerts based on severity."""
if severity == 'critical':
# Page security team immediately
pagerduty_alert(f"CRITICAL drift in {workspace_name}", findings)
create_jira_incident(workspace_name, findings)
elif severity == 'high':
# Slack alert to security channel
slack_alert('#security-ops', workspace_name, findings)
create_jira_ticket('HIGH', workspace_name, findings)
else:
# Low priority — create ticket for next sprint
create_jira_ticket('LOW', workspace_name, findings)
---
# Terraform lifecycle blocks to reduce drift noise
# Ignore expected drift from auto-scaling
resource "aws_autoscaling_group" "payments_api" {
# ... configuration ...
lifecycle {
ignore_changes = [
desired_capacity, # Changes with auto-scaling
target_group_arns, # Changes with blue-green deploys
]
}
}
# Ignore expected drift from external secret rotation
resource "aws_db_instance" "settlements_db" {
# ... configuration ...
lifecycle {
ignore_changes = [
password, # Rotated by Vault, not managed by Terraform
]
}
}◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐ │ Infrastructure Drift Detection Pipeline │ │ │ │ ┌──────────────────┐ │ │ │ TFE Workspace │ Scheduled: Nightly at 2 AM │ │ │ Plan-Only Run │──────────────────────────────┐ │ │ └──────────────────┘ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────┐ │ │ │ terraform plan (read-only) │ │ │ │ │ │ │ │ Declared State ←──compare──→ Actual Infrastructure │ │ │ │ (Terraform code) (AWS API responses) │ │ │ └──────────────────────┬────────────────────────────────────┘ │ │ │ │ │ ┌──────────▼──────────┐ │ │ │ Changes Detected? │ │ │ └──┬──────────────┬───┘ │ │ No │ │ Yes │ │ ┌──────▼────┐ ┌──────▼───────────────────────────┐ │ │ │ No drift │ │ Webhook → Drift Classifier │ │ │ │ All good │ │ │ │ │ └───────────┘ │ ┌─────────┐ ┌────────┐ ┌─────┐ │ │ │ │ │CRITICAL │ │ HIGH │ │ LOW │ │ │ │ │ │Delete/ │ │SecGroup│ │Tags │ │ │ │ │ │Replace │ │IAM/KMS │ │Desc │ │ │ │ │ │→ Page │ │→ Slack │ │→Jira│ │ │ │ │ │ SecOps │ │ Alert │ │ Tkt │ │ │ │ │ └─────────┘ └────────┘ └─────┘ │ │ │ └──────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Quick Answer
Store state in an S3 bucket with versioning enabled, server-side encryption (SSE-KMS), and a DynamoDB table for state locking. Structure the bucket with environment-prefixed keys (prod/networking/terraform.tfstate) and restrict access using IAM policies scoped to each environment's prefix. Prevent corruption through locking, versioning for rollback, and CI/CD-only access patterns.
Detailed Answer
Terraform state file management is like managing a bank vault's ledger: the ledger (state file) records what is in every safe deposit box (cloud resource), and if the ledger is corrupted or leaked, you either lose track of assets or expose their locations to unauthorized parties. The storage, encryption, access control, and corruption prevention for state files must be treated with the same rigor as production database backups. The S3 backend is the standard for AWS-centric teams. The bucket itself requires several hardening measures: versioning enabled so you can recover previous state versions if an apply corrupts the current state, server-side encryption with a dedicated KMS key (not the default aws/s3 key) so you can audit and rotate encryption independently, public access blocked via the S3 Block Public Access settings, and a bucket policy that explicitly denies unencrypted uploads. The bucket should be in a dedicated SharedServices or Management account, separate from any workload account, so that workload account compromises cannot directly access state. The key structure within the bucket follows a hierarchy: {account-id-or-env}/{stack-name}/terraform.tfstate. For example: prod/networking/terraform.tfstate, prod/eks-cluster/terraform.tfstate, prod/payments-database/terraform.tfstate. This structure enables per-stack state isolation and per-environment IAM scoping. The Prod account's TerraformExecutionRole gets an S3 policy allowing s3:GetObject and s3:PutObject only on keys prefixed with prod/, while the Dev role can only access dev/. This prevents a Dev pipeline misconfiguration from overwriting Prod state. DynamoDB state locking prevents concurrent modifications. Create a single DynamoDB table (PAY_PER_REQUEST billing) with a partition key named LockID of type String. Every Terraform operation acquires a lock before modifying state by writing a conditional item to this table. If two engineers run terraform apply simultaneously on the same stack, the second operation receives a ConditionalCheckFailedException and waits. The lock record contains the operator's hostname, the operation type, and a timestamp, which helps diagnose stale locks from crashed CI pipelines. Corruption prevention goes beyond locking. S3 versioning provides a recovery path: if an apply fails midway and leaves state inconsistent, you can restore a previous version using aws s3api list-object-versions and aws s3api get-object with the desired VersionId. Terraform also writes a backup of the previous state locally before modifying it (terraform.tfstate.backup), though this is less useful in CI/CD where runners are ephemeral. For critical production stacks, enable S3 Replication to copy state to a bucket in another region for disaster recovery. The most dangerous corruption scenario is partial apply failure: Terraform creates some resources but crashes before writing updated state. The created resources become orphans — they exist in AWS but are not tracked by Terraform. Recovery requires manually importing the orphaned resources using terraform import or, in Terraform 1.5+, using import blocks. To reduce this risk, break large configurations into smaller stacks so each apply touches fewer resources, and use the -target flag only as a last resort since it creates partial state updates by design.
Code Example
# State backend configuration with full security hardening
terraform {
# S3 backend for remote state storage
backend "s3" {
# Dedicated state bucket in the SharedServices account
bucket = "valuemomentum-terraform-state-prod"
# Environment-prefixed key for access control scoping
key = "prod/payments-platform/networking/terraform.tfstate"
# Primary region for state storage
region = "us-east-1"
# DynamoDB table for state locking
dynamodb_table = "terraform-state-locks"
# Enable SSE-KMS encryption with a dedicated key
encrypt = true
# KMS key ARN for state file encryption
kms_key_id = "arn:aws:kms:us-east-1:555555555555:key/mrk-abc123"
# Use the SharedServices account profile for state access
profile = "valuemomentum-shared-services"
}
}
# S3 bucket for Terraform state (provisioned once by bootstrap)
resource "aws_s3_bucket" "terraform_state" {
# Bucket name following organization naming convention
bucket = "valuemomentum-terraform-state-prod"
# Prevent accidental deletion of the state bucket
force_destroy = false
tags = {
Purpose = "terraform-state-storage"
ManagedBy = "bootstrap-terraform"
}
}
# Enable versioning for state file recovery
resource "aws_s3_bucket_versioning" "state_versioning" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
# Enable versioning to recover from corruption
versioning_configuration {
status = "Enabled"
}
}
# Server-side encryption with dedicated KMS key
resource "aws_s3_bucket_server_side_encryption_configuration" "state_encryption" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
# Use KMS encryption instead of AES-256
sse_algorithm = "aws:kms"
# Dedicated KMS key for independent rotation and audit
kms_master_key_id = aws_kms_key.terraform_state_key.arn
}
# Enforce encryption on all objects including uploads
bucket_key_enabled = true
}
}
# Block all public access to the state bucket
resource "aws_s3_bucket_public_access_block" "state_public_block" {
# Reference the state bucket
bucket = aws_s3_bucket.terraform_state.id
# Block all forms of public access
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# DynamoDB table for state locking
resource "aws_dynamodb_table" "terraform_locks" {
# Table name matching backend configuration
name = "terraform-state-locks"
# Pay-per-request to avoid capacity planning
billing_mode = "PAY_PER_REQUEST"
# LockID is the required partition key for Terraform
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
# Enable point-in-time recovery for lock table safety
point_in_time_recovery {
enabled = true
}
}
# IAM policy scoping Prod role to only prod/ state prefix
# data "aws_iam_policy_document" "prod_state_access" {
# statement {
# effect = "Allow"
# actions = ["s3:GetObject", "s3:PutObject"]
# resources = ["arn:aws:s3:::valuemomentum-terraform-state-prod/prod/*"]
# }
# statement {
# effect = "Deny"
# actions = ["s3:*"]
# resources = ["arn:aws:s3:::valuemomentum-terraform-state-prod/dev/*"]
# }
# }◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Security Architecture │ ├───────────────────────────────────────────────────────────────┤ │ │ │ SharedServices Account │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ S3: valuemomentum-terraform-state-prod │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ Versioning: Enabled │ │ │ │ │ │ Encryption: SSE-KMS (dedicated key) │ │ │ │ │ │ Public Access: Blocked │ │ │ │ │ │ Replication: us-east-1 → us-west-2 (DR) │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ │ │ │ │ │ Key Structure: │ │ │ │ ├── dev/ │ │ │ │ │ ├── networking/terraform.tfstate │ │ │ │ │ ├── eks-cluster/terraform.tfstate │ │ │ │ │ └── payments-db/terraform.tfstate │ │ │ │ ├── qa/ │ │ │ │ │ └── ... │ │ │ │ ├── uat/ │ │ │ │ │ └── ... │ │ │ │ └── prod/ ← Prod role can ONLY access this │ │ │ │ ├── networking/terraform.tfstate │ │ │ │ ├── eks-cluster/terraform.tfstate │ │ │ │ └── payments-db/terraform.tfstate │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ DynamoDB: terraform-state-locks │ │ │ │ ┌──────────────────────────────────────────────┐ │ │ │ │ │ LockID (PK) │ Info │ Who │ Operation │ │ │ │ │ │ prod/net/... │ ... │ ci │ apply │ │ │ │ │ └──────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ Access Pattern: │ │ CI/CD Runner → OIDC → AssumeRole → Scoped S3 Access │ │ ┌──────────┐ ┌────────────┐ ┌──────────────┐ │ │ │ Pipeline │───→│ Prod Role │───→│ prod/* only │ │ │ │ (OIDC) │ │ (IAM) │ │ (S3 policy) │ │ │ └──────────┘ └────────────┘ └──────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Prevent cross-environment contamination through four layers: separate state files per environment with IAM-scoped access, provider configurations locked to specific AWS accounts via assume_role, module versioning with pinned tags so an untested module change cannot propagate, and CI/CD pipeline guardrails that validate the target environment before apply.
Detailed Answer
Preventing cross-environment contamination in Terraform is like building firewalls between apartments in a building: you need physical separation (state isolation), locked doors (IAM boundaries), independent utilities (provider configurations), and a building code (CI/CD guardrails) that prevents shortcuts through shared walls. The first layer is state file isolation. Each environment must have its own state file with its own backend configuration. Never share a state file between environments, even with workspaces, if the blast radius of corruption is unacceptable. The state file contains sensitive data including resource IDs, IP addresses, and sometimes plaintext outputs. An S3 bucket policy should restrict each environment's Terraform role to only its own key prefix: the prod role can access s3://state-bucket/prod/* but is explicitly denied s3://state-bucket/dev/*. This prevents a misconfigured prod pipeline from reading or overwriting dev state. The second layer is provider-level isolation. Each environment's provider block must assume a role in its specific AWS account. Even if someone accidentally passes the wrong tfvars file, the provider configuration ensures Terraform operates in the correct account. Add a validation check using the aws_caller_identity data source: compare the actual account ID against the expected one and fail early if they do not match. This catches the scenario where an engineer runs terraform apply with prod credentials but dev configuration, or vice versa. The third layer is module versioning. When environments share modules from a private registry or Git repository, use pinned version tags. Dev might use module version 2.3.0-rc1 while Prod uses 2.2.0 (the last stable release). Without version pinning, a module change pushed to the main branch immediately affects every environment that references source = "git::...?ref=main". This is the most common cause of accidental cross-environment impact: someone fixes a bug in a shared VPC module, the fix has a typo, and every environment that references the module head picks up the broken code on next apply. The fourth layer is CI/CD pipeline guardrails. The pipeline should validate environment consistency before plan: check that the workspace name matches the tfvars file, verify the AWS account ID matches the target environment, and confirm the Git branch is allowed to deploy to that environment (only main can deploy to prod). Implement a pre-plan script that runs aws sts get-caller-identity and compares the account against an expected value from the pipeline configuration. Remote state data sources are a particularly dangerous vector for cross-environment bleed. When a production EKS module reads the networking module's state via terraform_remote_state, it must reference the production networking state, not dev. Parameterize the remote state data source's backend configuration using the environment variable: data.terraform_remote_state.networking.config.key should resolve to prod/networking/terraform.tfstate, not a hardcoded path. A common gotcha is using terraform_remote_state with a hardcoded key that works in dev but points to prod state when someone copies the configuration without updating the key. The ultimate safeguard is defense in depth: even if one layer fails, the others prevent damage. If the IAM policy has a bug that allows dev access to prod state, the provider's assume_role still locks operations to the dev account. If the provider configuration is wrong, the account ID validation check fails before any resources are touched.
Code Example
# Account identity validation — fail fast on wrong account
# Fetch the actual AWS account identity
data "aws_caller_identity" "current" {}
# Validate the account ID matches the expected environment
locals {
# Map of expected account IDs per environment
expected_accounts = {
dev = "111111111111"
qa = "222222222222"
uat = "333333333333"
prod = "444444444444"
}
# Check if current account matches the target environment
account_validated = (
data.aws_caller_identity.current.account_id ==
local.expected_accounts[var.environment]
)
}
# Validation resource that fails plan if accounts mismatch
resource "null_resource" "account_validation" {
# This count trick fails if account does not match
count = local.account_validated ? 0 : "ERROR: Running in wrong AWS account"
}
# Remote state data source — parameterized per environment
data "terraform_remote_state" "networking" {
# S3 backend for reading the networking layer state
backend = "s3"
config = {
# Same state bucket as all other stacks
bucket = "valuemomentum-terraform-state-prod"
# Key parameterized by environment to prevent cross-env reads
key = "${var.environment}/networking/terraform.tfstate"
# Same region as the backend
region = "us-east-1"
}
}
# Use networking outputs safely scoped to the correct environment
resource "aws_eks_cluster" "payments_cluster" {
# Cluster name scoped to the environment
name = "payments-eks-${var.environment}"
version = "1.29"
role_arn = aws_iam_role.eks_cluster_role.arn
vpc_config {
# Subnet IDs from the SAME environment's networking state
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = var.environment == "prod" ? false : true
}
}
# Module versioning — pinned per environment
module "payments_vpc" {
# Pinned Git tag prevents untested changes from propagating
source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.2.0"
# In dev, you might test a release candidate:
# source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.3.0-rc1"
vpc_name = "payments-vpc-${var.environment}"
vpc_cidr = var.vpc_cidr
environment = var.environment
}
# CI/CD pre-plan validation script (run before terraform plan)
# #!/bin/bash
# EXPECTED_ACCOUNT=$(jq -r ".${ENVIRONMENT}" accounts.json)
# ACTUAL_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
# if [ "$EXPECTED_ACCOUNT" != "$ACTUAL_ACCOUNT" ]; then
# echo "FATAL: Expected account $EXPECTED_ACCOUNT but authenticated to $ACTUAL_ACCOUNT"
# exit 1
# fi◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐
│ Cross-Environment Protection Layers │
├───────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: State Isolation (IAM-Scoped) │
│ ┌──────────────┐ DENY ┌──────────────┐ │
│ │ Dev Role │─────X─────│ prod/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ALLOW ┌──────────────┐ │
│ │ Dev Role │───────────│ dev/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Layer 2: Provider Account Lock │
│ ┌──────────────────────────────────────────┐ │
│ │ provider "aws" { │ │
│ │ assume_role { │ │
│ │ role_arn = ".../${var.env}/Role" │ │
│ │ } │ │
│ │ } │ │
│ │ → Operations locked to target account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 3: Account ID Validation │
│ ┌──────────────────────────────────────────┐ │
│ │ aws_caller_identity.account_id │ │
│ │ == expected_accounts[var.environment] │ │
│ │ → FAIL FAST if wrong account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 4: Module Version Pinning │
│ ┌──────────────────────────────────────────┐ │
│ │ Dev: source = "...?ref=v2.3.0-rc1" │ │
│ │ Prod: source = "...?ref=v2.2.0" │ │
│ │ → Untested changes cannot reach prod │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 5: CI/CD Pipeline Guardrails │
│ ┌──────────────────────────────────────────┐ │
│ │ Branch → Environment mapping │ │
│ │ main → prod (requires approval) │ │
│ │ develop → dev (auto-apply) │ │
│ │ Pre-plan: sts get-caller-identity check │ │
│ └──────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘Quick Answer
Implement a multi-stage pipeline: PR triggers terraform plan with output posted as a PR comment, OPA/Sentinel policy checks validate compliance, manual approval gates (GitHub Environments with required reviewers) protect production, and merge-to-main triggers terraform apply using the saved plan file. Use OIDC for keyless authentication and concurrency controls to prevent parallel applies on the same stack.
Detailed Answer
A Terraform CI/CD pipeline is like an air traffic control system: every infrastructure change (flight) must file a plan (flight plan), get reviewed by controllers (PR reviewers), receive clearance (approval gate), and land on the correct runway (target environment) — all while preventing two planes from using the same runway simultaneously (state locking and concurrency control). The pipeline begins with authentication. Modern pipelines use OIDC federation instead of stored AWS credentials. GitHub Actions requests a JWT token from GitHub's OIDC provider, presents it to AWS STS via AssumeRoleWithWebIdentity, and receives short-lived credentials scoped to the Terraform execution role. The OIDC trust policy restricts which repositories, branches, and environments can assume the role: production apply roles should only be assumable by the main branch, while plan roles can be assumed by any branch. This eliminates long-lived access keys that could be exfiltrated from CI secrets. The plan stage runs on every pull request. It executes terraform init, terraform validate, terraform fmt -check, and terraform plan -out=plan.tfplan. The plan output is captured and posted as a PR comment using tools like tfcmt or the native GitHub Actions Terraform setup action. Reviewers see exactly what resources will be created, modified, or destroyed — including sensitive changes like security group rule modifications or IAM policy updates. The saved plan file is uploaded as a CI artifact for use in the apply stage. Policy-as-code gates run between plan and approval. Open Policy Agent (OPA) evaluates the plan JSON (terraform show -json plan.tfplan) against organizational policies: no S3 buckets without encryption, no security groups with 0.0.0.0/0 ingress on port 22, all RDS instances must have deletion protection in production. These checks are non-negotiable — a policy violation fails the pipeline regardless of who approves the PR. Sentinel serves the same purpose in Terraform Cloud/Enterprise environments. The approval gate differs by environment. Dev and QA may auto-apply on merge — the PR review itself is sufficient approval. UAT requires team lead approval via a GitHub Environment with one required reviewer. Production requires two approvals from the platform-admins team, with a 15-minute wait timer to prevent hasty approvals. These are configured as GitHub Environments with protection rules, which the apply job references via the environment keyword. The apply stage triggers after merge to main. Critically, it should use the saved plan file from the plan stage rather than re-running plan, because infrastructure may have changed between plan review and apply execution. If the saved plan is stale (state serial mismatch), Terraform rejects it and the pipeline must re-plan. After successful apply, the pipeline posts results to a Slack channel (#infra-changes-prod) and creates a GitHub deployment record for audit trail. Concurrency control prevents two merged PRs from applying simultaneously to the same stack. GitHub Actions concurrency groups scoped to the stack name (concurrency: group: terraform-payments-prod) ensure only one apply runs at a time. Queued runs wait for the current apply to complete. Combined with DynamoDB state locking, this provides two layers of concurrent modification prevention.
Code Example
# .github/workflows/terraform-payments.yml
# Multi-stage Terraform pipeline with OIDC and approval gates
name: Payments Infrastructure Pipeline
# Trigger on PRs and pushes to main affecting payments infra
on:
pull_request:
paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']
push:
branches: [main]
paths: ['infrastructure/envs/prod/**', 'infrastructure/modules/**']
# OIDC permissions for keyless AWS authentication
permissions:
id-token: write
contents: read
pull-requests: write
# Prevent concurrent applies on the same stack
concurrency:
group: terraform-payments-prod-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
# Stage 1: Validate and plan on every PR
plan:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
# Checkout the infrastructure code
- uses: actions/checkout@v4
# OIDC authentication — plan role (read-only)
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformPlan
aws-region: us-east-1
# Install pinned Terraform version
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Initialize the backend and download providers
- name: Init
run: terraform -chdir=infrastructure/envs/prod init -input=false
# Validate syntax and configuration
- name: Validate
run: terraform -chdir=infrastructure/envs/prod validate
# Format check to enforce style standards
- name: Format Check
run: terraform fmt -check -recursive infrastructure/
# Generate execution plan and save to file
- name: Plan
run: terraform -chdir=infrastructure/envs/prod plan -input=false -out=prod.tfplan
# Export plan as JSON for OPA policy evaluation
- name: Export Plan JSON
run: terraform -chdir=infrastructure/envs/prod show -json prod.tfplan > plan.json
# Run OPA policy checks against the plan
- name: OPA Policy Check
run: |
opa eval --data policies/ --input plan.json "data.terraform.deny[msg]" --fail-defined
# Post plan output as a PR comment for reviewers
- name: Comment Plan on PR
uses: borchero/terraform-plan-comment@v2
with:
working-directory: infrastructure/envs/prod
# Upload plan artifact for the apply stage
- uses: actions/upload-artifact@v4
with:
name: prod-tfplan
path: infrastructure/envs/prod/prod.tfplan
retention-days: 5
# Stage 2: Apply after merge with manual approval
apply:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
# Production environment with required approvers and wait timer
environment:
name: production-payments
url: https://console.aws.amazon.com/eks
steps:
- uses: actions/checkout@v4
# OIDC authentication — apply role (read-write)
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::444444444444:role/GitHubActions-TerraformApply
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Re-init and apply (saved plan may be stale after merge)
- name: Init and Apply
run: |
terraform -chdir=infrastructure/envs/prod init -input=false
terraform -chdir=infrastructure/envs/prod apply -input=false -auto-approve
# Notify team of successful deployment
- name: Slack Notification
if: success()
uses: slackapi/slack-github-action@v1
with:
payload: '{"text": "Prod payments infra deployed by ${{ github.actor }}"}'◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform CI/CD Pipeline with Approval Gates │ ├───────────────────────────────────────────────────────────────┤ │ │ │ PR Opened │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 1: Plan (on every PR) │ │ │ │ │ │ │ │ OIDC → AssumeRole (Plan Role, read-only) │ │ │ │ ┌──────┐ ┌────────┐ ┌────┐ ┌──────┐ │ │ │ │ │ init │→│validate│→│fmt │→│ plan │ │ │ │ │ └──────┘ └────────┘ └────┘ └──┬───┘ │ │ │ │ │ │ │ │ │ ┌──────┴──────┐ │ │ │ │ │ plan.json │ │ │ │ │ └──────┬──────┘ │ │ │ │ ↓ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ OPA Policy Check │ │ │ │ │ │ - no public S3 │ │ │ │ │ │ - encryption on │ │ │ │ │ │ - tags required │ │ │ │ │ └────────┬───────────┘ │ │ │ │ ↓ │ │ │ │ ┌────────────────────┐ │ │ │ │ │ PR Comment with │ │ │ │ │ │ plan output │ │ │ │ │ └────────────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ PR Approved + Merged to main │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 2: Manual Approval Gate │ │ │ │ GitHub Environment: production-payments │ │ │ │ Required reviewers: 2 from platform-admins │ │ │ │ Wait timer: 15 minutes │ │ │ └──────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 3: Apply (after approval) │ │ │ │ │ │ │ │ OIDC → AssumeRole (Apply Role, read-write) │ │ │ │ ┌──────┐ ┌────────────────┐ │ │ │ │ │ init │→│ apply │ │ │ │ │ └──────┘ │ -auto-approve │ │ │ │ │ └───────┬────────┘ │ │ │ │ ↓ │ │ │ │ ┌──────────────┐ │ │ │ │ │ Slack notify │ │ │ │ │ │ #infra-changes│ │ │ │ │ └──────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ Concurrency: group=terraform-payments-prod (1 at a time) │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Detect drift by running terraform plan -refresh-only to compare actual infrastructure against state without proposing changes. Remediate by either importing the manual change into Terraform (terraform import or import blocks), reverting the manual change by running terraform apply to converge back to the declared configuration, or updating the Terraform code to reflect the intentional change and then applying.
Detailed Answer
Terraform state drift is like someone rearranging furniture in a room that has a blueprint: the blueprint (state file) says the couch is by the window, but someone physically moved it to the center of the room (console change). Terraform detects this discrepancy during the refresh phase of plan and proposes moving the couch back to the window (converging to declared state). The question is whether the move was intentional or accidental — and that determines whether you update the blueprint or move the couch back. Drift detection happens during the refresh phase of terraform plan. For every resource tracked in state, Terraform calls the cloud provider's API to read the current configuration. If the API response differs from what state records, Terraform updates its in-memory state and then diffs that against your HCL configuration. The -refresh-only flag runs only the refresh phase without proposing configuration-driven changes, making it a pure drift detection scan. The output shows which attributes have drifted and their before/after values. There are three categories of drift, each requiring a different remediation strategy. The first is accidental drift: an engineer manually opened port 443 on a security group to debug a connectivity issue and forgot to revert it. The fix is to run terraform apply, which converges the security group back to the declared configuration, removing the manually added rule. This is Terraform's self-healing property — the declared state is the source of truth. The second is intentional drift: an operations engineer manually scaled up an RDS instance from db.r6g.xlarge to db.r6g.2xlarge during a traffic incident. The change was correct and should be preserved. The fix is to update the Terraform code to reflect the new instance class, then run terraform plan to verify the plan shows no changes (the code now matches reality). If you run apply without updating the code, Terraform would downgrade the instance back to the original size — potentially causing another outage. The third is untracked resource creation: someone created a new S3 bucket via the console that Terraform knows nothing about. Since Terraform only tracks resources in its state, it cannot detect untracked resources. Tools like AWS Config, Driftctl (now Snyk IaC), or CloudQuery scan the entire account and compare against Terraform state to find resources that exist but are not managed. Once identified, you either import the resource into Terraform using import blocks (Terraform 1.5+) or the terraform import command, or you delete the resource if it should not exist. Proactive drift prevention is better than reactive detection. Implement AWS Config rules that alert on configuration changes not made by the Terraform execution role. Set up CloudTrail-based alarms that trigger when console users modify resources tagged with ManagedBy=terraform. Use IAM policies that restrict console users to read-only access for Terraform-managed resource types. Schedule a daily terraform plan -refresh-only in CI that posts drift reports to a Slack channel — this catches drift within 24 hours instead of discovering it during the next deployment. The lifecycle meta-argument ignore_changes is the escape hatch for expected drift. Auto-scaling groups change desired_capacity based on scaling policies, ECS services change task_count, and some resources have attributes that are set once and then managed externally. Adding these attributes to ignore_changes tells Terraform to skip them during drift comparison, preventing false positives and accidental reverts of legitimate operational changes.
Code Example
# Drift detection and remediation workflow
# Step 1: Run refresh-only plan to detect drift without proposing changes
# terraform plan -refresh-only -out=drift-check.tfplan
# This shows which resources have drifted from their recorded state
# Step 2: Review the drift report
# terraform show drift-check.tfplan
# Example output:
# ~ aws_security_group_rule.payments_api_ingress
# from_port: 443 → 8080 (someone changed the port manually)
# Step 3a: Revert accidental drift — apply converges back to declared state
# terraform apply
# This restores the security group rule to port 443 as declared in code
# Step 3b: Adopt intentional drift — update code to match reality
resource "aws_rds_cluster" "payments_db" {
# Cluster identifier for the payments transaction database
cluster_identifier = "payments-db-prod"
# Updated instance class to match the manual scaling during incident
# Previously: db.r6g.xlarge — changed during traffic spike on 2026-06-15
engine = "aurora-postgresql"
engine_version = "15.4"
deletion_protection = true
backup_retention_period = 30
}
# Step 3c: Import untracked resources using import blocks (TF 1.5+)
import {
# S3 bucket created manually via console during incident response
to = aws_s3_bucket.payments_audit_logs
# The actual bucket name to import from AWS
id = "valuemomentum-payments-audit-logs-prod"
}
# Resource block to match the imported bucket's configuration
resource "aws_s3_bucket" "payments_audit_logs" {
# Bucket name matching the manually created bucket
bucket = "valuemomentum-payments-audit-logs-prod"
tags = {
Purpose = "audit-log-storage"
Environment = "prod"
ManagedBy = "terraform"
ImportedOn = "2026-06-20"
}
}
# Lifecycle ignore_changes for expected drift patterns
resource "aws_autoscaling_group" "payments_api_fleet" {
# ASG name following the naming convention
name = "payments-api-fleet-prod-use1"
# Baseline desired capacity — autoscaler adjusts this
desired_capacity = 6
# Minimum instances for SLA compliance
min_size = 3
# Maximum instances during peak events
max_size = 24
launch_template {
id = aws_launch_template.payments_api.id
version = "$Latest"
}
lifecycle {
# Ignore desired_capacity — managed by cluster autoscaler
# Ignore target_group_arns — managed by EKS ingress controller
ignore_changes = [desired_capacity, target_group_arns]
}
}
# Scheduled drift detection in CI (runs daily at 6 AM UTC)
# .github/workflows/drift-detection.yml
# name: Daily Drift Detection
# on:
# schedule:
# - cron: '0 6 * * *'
# jobs:
# detect-drift:
# runs-on: ubuntu-latest
# steps:
# - uses: actions/checkout@v4
# - run: terraform -chdir=infrastructure/envs/prod init
# - run: terraform -chdir=infrastructure/envs/prod plan -refresh-only -detailed-exitcode
# # Exit code 2 means drift detected
# - if: failure()
# run: |
# curl -X POST $SLACK_WEBHOOK -d '{"text": "DRIFT DETECTED in prod payments infra"}'◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Drift Detection & Remediation │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌───────────────┐ Manual ┌──────────────────┐ │ │ │ Terraform │ Change │ AWS Console │ │ │ │ State File │ │ or CLI │ │ │ │ │ │ │ │ │ │ sg port: 443 │ │ sg port: 8080 │ │ │ │ (declared) │ │ (actual) │ │ │ └───────┬───────┘ └──────────────────┘ │ │ │ │ │ │ └──────────────┬─────────────────────┘ │ │ ↓ │ │ ┌──────────────────────┐ │ │ │ terraform plan │ │ │ │ -refresh-only │ │ │ │ │ │ │ │ DRIFT DETECTED: │ │ │ │ sg port: 443 → 8080 │ │ │ └──────────┬───────────┘ │ │ │ │ │ ┌──────────────┼──────────────┐ │ │ ↓ ↓ ↓ │ │ ┌──────────────┐┌─────────────┐┌──────────────────┐ │ │ │ Accidental ││ Intentional ││ Untracked │ │ │ │ Drift ││ Drift ││ Resource │ │ │ │ ││ ││ │ │ │ │ terraform ││ Update HCL ││ terraform import │ │ │ │ apply ││ to match ││ or delete the │ │ │ │ (revert to ││ reality, ││ resource │ │ │ │ declared) ││ then apply ││ │ │ │ └──────────────┘└─────────────┘└──────────────────┘ │ │ │ │ Proactive Detection: │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Daily CI Job (cron: 0 6 * * *) │ │ │ │ terraform plan -refresh-only -detailed-exitcode │ │ │ │ │ │ │ │ Exit 0 → No drift → all clear │ │ │ │ Exit 2 → Drift detected → Slack alert │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ Expected Drift (ignore_changes): │ │ ┌──────────────────────────────────────────────────┐ │ │ │ ASG desired_capacity → managed by autoscaler │ │ │ │ ECS task_count → managed by scaling policy │ │ │ │ ignore_changes = [desired_capacity] │ │ │ └──────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
State locking prevents concurrent modifications by acquiring a lock on the state file before any write operation. When using backends like S3+DynamoDB, Terraform creates a lock entry with a unique ID. If a second operator attempts a write while locked, Terraform returns a ConditionalCheckFailedException and blocks until the lock is released or the operator force-unlocks.
Detailed Answer
Terraform state locking is a concurrency control mechanism that prevents two or more operators from writing to the same state file simultaneously, which would cause state corruption and potentially orphaned cloud resources. Think of it like a database row-level lock: before Terraform can modify state, it must first acquire an exclusive lock, and any other process attempting to modify the same state must wait or fail. When you run terraform apply or terraform plan (with certain backends), Terraform sends a Lock request to the backend. For S3+DynamoDB, this means writing a record to the DynamoDB table with the state file's digest as the partition key and a unique LockID. The LockID contains the operator's hostname, the Terraform operation type, the workspace name, and a timestamp. DynamoDB's conditional write ensures atomicity: if a record already exists with that key, the write fails with a ConditionalCheckFailedException, and Terraform reports that the state is locked by another process. In production, lock conflicts arise in several scenarios. The most common is when two engineers run terraform apply on the same workspace concurrently. Terraform will display the lock holder's information including their username, operation type, and when the lock was acquired. The blocked operator must wait for the first operation to complete. Another common scenario is a CI/CD pipeline crash: if a pipeline runner dies mid-apply, the lock remains in DynamoDB as a stale lock. This requires manual intervention using terraform force-unlock with the lock ID. Force-unlock is dangerous in production because you cannot guarantee the previous operation completed cleanly. Before force-unlocking, you should verify the state of the actual infrastructure using AWS console or CLI, check the DynamoDB table directly to confirm the lock metadata, and review CloudTrail logs for any API calls made by the crashed process. A safer pattern is to implement lock timeouts in your CI/CD pipeline: wrap terraform apply in a timeout command and have the pipeline explicitly run terraform force-unlock only after confirming no infrastructure changes are in progress. Different backends handle locking differently. Consul uses its built-in session-based locking with TTL. Azure Blob Storage uses native blob leases. Google Cloud Storage uses object generation numbers for optimistic locking. Not all backends support locking; the local backend does via filesystem locks, but NFS-mounted local backends have notoriously unreliable locking, which is why teams migrate to remote backends. The etcd backend uses its compare-and-swap primitive. Understanding your backend's locking semantics is critical for disaster recovery planning, because a corrupted lock table can block all infrastructure changes across your organization.
Code Example
# Backend configuration with S3 state locking via DynamoDB
terraform {
# Define the S3 backend for remote state storage
backend "s3" {
# S3 bucket storing the payments platform state files
bucket = "fintech-corp-terraform-state-prod"
# State file path scoped to the payments VPC workspace
key = "infrastructure/payments-vpc/terraform.tfstate"
# AWS region where the state bucket resides
region = "us-east-1"
# DynamoDB table that manages state locks
dynamodb_table = "terraform-state-locks-prod"
# Enable server-side encryption for state at rest
encrypt = true
# Use the shared infrastructure AWS profile
profile = "fintech-infra-admin"
}
}
# DynamoDB table resource for state locking (provisioned separately)
resource "aws_dynamodb_table" "terraform_locks" {
# Table name matching the backend configuration reference
name = "terraform-state-locks-prod"
# Pay-per-request to avoid capacity planning for lock operations
billing_mode = "PAY_PER_REQUEST"
# LockID is the required partition key for Terraform state locks
hash_key = "LockID"
# Define the LockID attribute as a string type
attribute {
name = "LockID"
type = "S"
}
# Tag for cost allocation and ownership tracking
tags = {
Team = "platform-engineering"
Environment = "production"
ManagedBy = "terraform-bootstrap"
}
}
# Example: force-unlock command when a CI pipeline crashes
# terraform force-unlock 2b6a6738-5ef0-7c20-a036-48eb6273784f◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────┐ │ Terraform State Locking Flow │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ Lock Request ┌──────────────────┐ │ │ │ Operator A │──────────────────→│ DynamoDB Table │ │ │ │ terraform │ LockID: abc123 │ terraform-state │ │ │ │ apply │←──────────────────│ -locks-prod │ │ │ └──────────────┘ Lock Acquired └──────────────────┘ │ │ │ ↑ │ │ │ Write State │ Lock Request │ │ ↓ │ (BLOCKED) │ │ ┌──────────────┐ ┌──────────────────┐ │ │ │ S3 Bucket │ │ Operator B │ │ │ │ fintech-corp │ │ terraform apply │ │ │ │ -terraform │ │ (waiting...) │ │ │ │ -state-prod │ └──────────────────┘ │ │ └──────────────┘ │ │ │ │ │ │ Apply Complete │ │ ↓ │ │ ┌──────────────┐ Unlock Request ┌──────────────────┐ │ │ │ Operator A │──────────────────→│ DynamoDB Table │ │ │ │ (finished) │ Delete LockID │ (lock released) │ │ │ └──────────────┘ └──────────────────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────┐ │ │ │ Operator B │ │ │ │ Lock Acquired │ │ │ │ (proceeds) │ │ │ └──────────────────┘ │ └─────────────────────────────────────────────────────────────┘
Quick Answer
Terraform plan detects drift by reading the current state of every resource via provider API calls and comparing it against the state file. It identifies differences as drift. However, it only checks attributes it manages, cannot detect out-of-band resource creation, misses resources not in state, and some providers do not report all attributes accurately.
Detailed Answer
Terraform plan's drift detection works through a refresh-then-diff process. Think of it like an inventory audit: Terraform reads the last known inventory (state file), physically checks every item in the warehouse (API calls to cloud providers), updates the inventory with actual findings (state refresh), and then compares the updated inventory against the blueprint (configuration). Any discrepancies between the refreshed state and the desired configuration become the plan. The refresh phase is where drift detection happens. For every resource tracked in the state file, Terraform calls the provider's ReadResource RPC method, which translates to cloud API calls. For an aws_rds_cluster.payments_db, this triggers a DescribeDBClusters API call. The provider compares the API response against the state file's recorded attributes. If the production database's backup_retention_period was changed from 30 to 7 via the AWS console, the refresh detects this as drift and updates the in-memory state. After refresh, Terraform diffs the refreshed state against the configuration. If your configuration says backup_retention_period = 30 but the refreshed state shows 7, the plan proposes changing it back to 30. This is Terraform's self-healing property: it converges actual infrastructure toward the declared configuration. However, the limitations are significant and often misunderstood in production. First, Terraform only detects drift on resources it manages. If someone creates an additional security group rule via the AWS console that is not in Terraform's state, Terraform has no knowledge of it. This is the 'unknown unknowns' problem: Terraform cannot detect resources it does not track. Second, not all providers report all attributes during refresh. Some cloud APIs return partial data, or certain attributes are write-only (like passwords). The AWS provider, for example, cannot detect drift on certain IAM policy document orderings because the API returns a canonicalized version that may not match the original. Third, the refresh phase can be slow and expensive. In a large infrastructure with thousands of resources, the refresh makes thousands of API calls, which can hit rate limits and take tens of minutes. Terraform 1.5 introduced the -refresh=false flag to skip refresh for faster plans, but this trades drift detection for speed. Fourth, eventual consistency in cloud APIs can cause false drift detection. After an AWS resource is created, the API may return stale data for seconds or minutes. Running plan immediately after apply can show phantom drift that resolves itself. Fifth, Terraform cannot detect drift on resource dependencies that are not explicitly modeled. If a VPC peering connection's route table was modified outside Terraform but the peering resource itself was not, Terraform might not detect the functional impact. Tools like AWS Config, CloudTrail-based drift detection, or Driftctl (now part of Snyk) fill these gaps by scanning entire accounts for unmanaged resources.
Code Example
# Demonstrating drift detection behavior with refresh configuration
# Backend configuration for the payments infrastructure state
terraform {
# Required Terraform version for refresh-only plan support
required_version = ">= 1.5.0"
# S3 backend with state locking for the payments platform
backend "s3" {
# State bucket for the production payments infrastructure
bucket = "fintech-corp-terraform-state-prod"
# State file path for the payments database workspace
key = "payments-database/terraform.tfstate"
# Primary region for state storage
region = "us-east-1"
# Lock table to prevent concurrent modifications
dynamodb_table = "terraform-state-locks-prod"
}
}
# RDS cluster that we want to detect drift on
resource "aws_rds_cluster" "payments_db" {
# Cluster identifier for the payments transaction database
cluster_identifier = "payments-db-production"
# Aurora PostgreSQL engine for transaction processing
engine = "aurora-postgresql"
# Engine version validated by the DBA team
engine_version = "15.4"
# Backup retention: 30 days for PCI compliance
# If someone changes this via console, plan will detect drift
backup_retention_period = 30
# Deletion protection must stay enabled in production
deletion_protection = true
# Preferred maintenance window outside peak transaction hours
preferred_maintenance_window = "sun:03:00-sun:04:00"
}
# Lifecycle rule to ignore drift on specific attributes
resource "aws_autoscaling_group" "payments_api_fleet" {
# ASG name following the organization convention
name = "payments-api-fleet-production"
# Desired capacity managed by autoscaling policies, not Terraform
desired_capacity = 6
# Minimum instances for baseline transaction processing
min_size = 3
# Maximum instances during peak shopping events
max_size = 24
# Launch template for the payments API container hosts
launch_template {
# Reference the payments API launch template
id = aws_launch_template.payments_api.id
# Always use the latest validated AMI version
version = "$Latest"
}
# Ignore drift on desired_capacity because autoscaling changes it
lifecycle {
# Prevent Terraform from reverting autoscaler decisions
ignore_changes = [desired_capacity]
}
}
# Refresh-only plan command to detect drift without proposing changes
# terraform plan -refresh-only -out=drift-report.tfplan◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform Plan Drift Detection Flow │ ├───────────────────────────────────────────────────────────────┤ │ │ │ Phase 1: Refresh (Drift Detection) │ │ ┌──────────────┐ ReadResource ┌──────────────────┐ │ │ │ State File │ RPC calls │ Cloud Provider │ │ │ │ │──────────────────→│ APIs │ │ │ │ payments_db: │ │ │ │ │ │ retention=30 │←──────────────────│ Actual: ret=7 │ │ │ │ │ API Response │ (console change) │ │ │ └──────┬───────┘ └──────────────────┘ │ │ │ │ │ │ Update in-memory state │ │ ↓ │ │ ┌──────────────┐ │ │ │ Refreshed │ │ │ │ State │ payments_db: retention=7 (drift detected) │ │ └──────┬───────┘ │ │ │ │ │ Phase 2: Diff (Plan Generation) │ │ │ │ │ ↓ │ │ ┌──────────────┐ Compare ┌──────────────────┐ │ │ │ Refreshed │─────────────→│ Configuration │ │ │ │ State │ │ (main.tf) │ │ │ │ retention=7 │ │ retention=30 │ │ │ └──────────────┘ └──────────────────┘ │ │ │ │ │ ↓ │ │ ┌───────────────────────────────────────────────────┐ │ │ │ Plan Output: │ │ │ │ ~ aws_rds_cluster.payments_db │ │ │ │ ~ backup_retention_period: 7 → 30 │ │ │ │ (drift will be corrected on apply) │ │ │ └───────────────────────────────────────────────────┘ │ │ │ │ Limitations: │ │ ┌────────────────────────────────────────────────┐ │ │ │ ✗ Cannot detect unmanaged resources │ │ │ │ ✗ Write-only attributes invisible to refresh │ │ │ │ ✗ Eventual consistency → false drift │ │ │ │ ✗ Rate limits slow large-scale refresh │ │ │ └────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Implement a CI/CD pipeline that runs terraform plan on pull requests, posts the plan output as a PR comment for review, requires manual approval before apply, and uses remote state locking to prevent concurrent operations. Use OIDC authentication, separate plan and apply stages, and implement policy-as-code gates with Sentinel or OPA.
Detailed Answer
A production-grade Terraform CI/CD pipeline must solve five problems: authentication without long-lived credentials, plan visibility for reviewers, approval gates before destructive changes, concurrency control to prevent conflicting applies, and policy enforcement to catch compliance violations before they reach infrastructure. Think of it like a surgical operation workflow: the surgeon (engineer) proposes an operation (plan), it gets reviewed by a board (PR review), approved by an authority (manual approval gate), and only then executed in a controlled environment (apply) with safeguards (state locking). Authentication should use OIDC federation. GitHub Actions, GitLab CI, and CircleCI all support OIDC tokens that can be exchanged for short-lived AWS credentials via STS AssumeRoleWithWebIdentity. This eliminates the need to store AWS access keys as CI secrets, which is a common audit finding. The OIDC trust policy should be scoped to specific repositories and branches to prevent unauthorized access. The pipeline structure typically has three stages. The first stage runs on every pull request: terraform init, terraform validate, terraform fmt -check, and terraform plan -out=plan.tfplan. The plan output is captured and posted as a PR comment using a tool like tfcmt or a custom script that parses the plan JSON output. This gives reviewers visibility into exactly what will change. The second stage is the approval gate. For non-production environments, this might be automatic after PR merge. For production, it requires explicit manual approval. In GitHub Actions, this is implemented using environments with required reviewers. In GitLab, it is a manual job gate. The approval should be from someone other than the PR author (four-eyes principle) and ideally from a platform engineering team member who understands the blast radius. The third stage runs terraform apply using the saved plan file. This is critical: never re-run plan during apply, because infrastructure may have changed between the plan and apply stages. The saved plan file ensures exactly what was reviewed gets applied. After apply, the pipeline should post the apply output back to the PR or a Slack channel for visibility. Policy-as-code adds guardrails. HashiCorp Sentinel (Terraform Cloud/Enterprise) or Open Policy Agent (open source) evaluate the plan against organizational policies: no public S3 buckets, all RDS instances must have encryption, all security groups must have descriptions. These checks run after plan but before approval, catching violations early. Production gotchas include handling plan file expiration (plan files reference specific provider plugin versions and state serial numbers, so they expire when state changes), managing workspace-level parallelism (only one pipeline should operate on a workspace at a time), and dealing with long-running applies that exceed CI timeout limits. Some teams implement a Terraform-specific lock in Redis or DynamoDB beyond the state lock, to queue pipeline runs at the workspace level.
Code Example
# GitHub Actions workflow for Terraform CI/CD with approval gates
# File: .github/workflows/terraform-payments-infra.yml
name: Payments Infrastructure Terraform Pipeline
# Trigger on pull requests targeting the main branch
on:
pull_request:
# Only run when infrastructure code changes
paths:
- 'infrastructure/payments/**'
push:
branches:
- main
paths:
- 'infrastructure/payments/**'
# OIDC token permissions for AWS authentication
permissions:
# Allow requesting OIDC JWT tokens from GitHub
id-token: write
# Allow posting plan output as PR comments
pull-requests: write
# Allow reading repository contents
contents: read
# Prevent concurrent runs on the same branch/PR
concurrency:
# Group by workflow name and PR number or branch
group: terraform-payments-${{ github.event.pull_request.number || github.ref }}
# Cancel in-progress plan runs but never cancel apply
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
jobs:
# Plan stage: runs on every pull request
terraform-plan:
# Use the latest Ubuntu runner for consistency
runs-on: ubuntu-latest
# Only run plan on pull requests, not on merge
if: github.event_name == 'pull_request'
steps:
# Checkout the payments infrastructure code
- uses: actions/checkout@v4
# Configure AWS credentials via OIDC federation
- uses: aws-actions/configure-aws-credentials@v4
with:
# OIDC role scoped to this repository and branch
role-to-assume: arn:aws:iam::111111111111:role/GitHubActions-TerraformPlan
# Region for API calls and state backend
aws-region: us-east-1
# Install the pinned Terraform version
- uses: hashicorp/setup-terraform@v3
with:
# Version locked to match team standard
terraform_version: 1.7.4
# Initialize Terraform with backend configuration
- name: Terraform Init
# Run init in the payments infrastructure directory
run: terraform init -input=false
working-directory: infrastructure/payments
# Run format check to enforce code style
- name: Terraform Format Check
# Fail the pipeline if code is not formatted
run: terraform fmt -check -recursive
working-directory: infrastructure/payments
# Generate the execution plan and save to file
- name: Terraform Plan
# Save plan to file for use in apply stage
run: terraform plan -input=false -out=payments.tfplan
working-directory: infrastructure/payments
# Post plan output as a PR comment for reviewers
- name: Post Plan to PR
# Use tfcmt for formatted plan comments
run: tfcmt plan -- terraform show payments.tfplan
working-directory: infrastructure/payments
# Apply stage: runs after merge with manual approval
terraform-apply:
# Use the latest Ubuntu runner for consistency
runs-on: ubuntu-latest
# Only run apply on push to main (after PR merge)
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
# Require manual approval from the platform-admins team
environment: production-payments
steps:
# Checkout the merged infrastructure code
- uses: actions/checkout@v4
# Configure AWS credentials with apply permissions
- uses: aws-actions/configure-aws-credentials@v4
with:
# Apply role has write permissions to production
role-to-assume: arn:aws:iam::111111111111:role/GitHubActions-TerraformApply
# Same region as the plan stage
aws-region: us-east-1
# Install the same Terraform version as plan stage
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.4
# Initialize and apply the configuration
- name: Terraform Init and Apply
# Auto-approve because approval happened via GitHub environment
run: |
terraform init -input=false
terraform apply -input=false -auto-approve
working-directory: infrastructure/payments◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform CI/CD Pipeline with Approval Gates │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Engineer │ │ │ │ Opens PR │ │ │ └──────┬───────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 1: Plan (on PR) │ │ │ │ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │ │ │ │ │ init │→│validate│→│fmt chk │→│ plan │ │ │ │ │ └────────┘ └────────┘ └────────┘ └───┬────┘ │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Policy Check (OPA / Sentinel) │ │ │ │ ┌────────────────────────────────────────┐ │ │ │ │ │ No public S3 buckets │ │ │ │ │ │ All RDS encrypted │ │ │ │ │ │ Security groups have descriptions │ │ │ │ │ └────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Plan posted as PR comment (tfcmt) │ │ │ │ Reviewer examines changes │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 2: Manual Approval │ │ │ │ (GitHub Environment: production-payments) │ │ │ │ Required reviewers: platform-admins team │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Stage 3: Apply (on merge to main) │ │ │ │ ┌────────┐ ┌─────────────────────────────────┐ │ │ │ │ │ init │→│ apply -auto-approve │ │ │ │ │ └────────┘ └─────────────────────────────────┘ │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Notify: Slack #payments-infra-changes │ │ │ └──────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Handle state corruption by first enabling S3 versioning for automatic backups, then recovering using terraform state pull to inspect the corrupted state, restoring from a previous version, or as a last resort using terraform import to rebuild state from existing infrastructure. Always maintain state backups and test recovery procedures regularly.
Detailed Answer
Terraform state corruption is one of the most dangerous operational incidents because the state file is the single source of truth mapping your configuration to real infrastructure. Think of it like a hospital's patient registry: if the registry is corrupted, you do not know which patient is in which room, and any action you take risks harming the wrong patient. State corruption can cause Terraform to destroy resources it thinks do not exist, create duplicates of resources it cannot find, or fail entirely with cryptic deserialization errors. State corruption occurs in several ways. The most common is a partial write during terraform apply: if the process is killed mid-apply (OOM killer, network interruption, CI timeout), the state file may contain a partially updated resource. Another common cause is manual state file editing with JSON syntax errors. Concurrent writes without proper locking can interleave state updates. Provider bugs can write malformed attribute values. And rarely, S3 eventual consistency (in older S3 behavior) could serve a stale state version. The first line of defense is prevention. Enable S3 versioning on your state bucket so every state write creates a new version, giving you point-in-time recovery. Enable MFA delete on the bucket to prevent accidental or malicious version deletion. Use DynamoDB state locking to prevent concurrent writes. Run terraform plan before apply to catch state inconsistencies early. Implement CI/CD pipelines that prevent direct state manipulation. When corruption occurs, follow a structured recovery procedure. First, immediately stop all Terraform operations on the affected workspace. If your CI/CD pipeline has queued runs, cancel them. This prevents compounding the corruption. Second, assess the damage. Run terraform state pull to download the current state file and inspect it. Check the JSON structure: is it valid JSON? Are the serial number and lineage fields present? Use jq to examine specific resources. Compare the state against your actual infrastructure using AWS CLI or console. Third, attempt recovery from backup. If S3 versioning is enabled, list previous versions with aws s3api list-object-versions, download a known-good version, and push it back using terraform state push. The state push command validates the state format and updates the serial number. Be careful with the -force flag: it skips lineage checking, which is a safety mechanism that prevents pushing state from the wrong workspace. Fourth, if no backup is available, perform selective state surgery. Use terraform state rm to remove the corrupted resource entries, then terraform import to re-import the existing infrastructure resources. This is tedious for large configurations but preserves the non-corrupted portions of state. For each imported resource, run terraform plan to verify the import produced a state entry that matches your configuration. Fifth, as a last resort for total state loss, you can rebuild the entire state from scratch using terraform import for every resource. This is the infrastructure equivalent of a bare-metal restore. Tools like terraformer and former2 can help by scanning your AWS account and generating import commands, but they require careful validation. After recovery, conduct a post-mortem. Implement additional safeguards: S3 bucket replication to a separate account for disaster recovery, automated state backup jobs that copy state to a different storage system, monitoring on state file size and serial number for anomaly detection, and regular recovery drills where you practice restoring state from backup in a non-production workspace.
Code Example
# State corruption recovery playbook
# Step 1: Pull and inspect the corrupted state
# Download the current state file for local inspection
# terraform state pull > corrupted-state-backup.json
# Step 2: Check S3 versioning for previous good state
# List all versions of the payments state file
# aws s3api list-object-versions \
# --bucket fintech-corp-terraform-state-prod \
# --prefix payments-database/terraform.tfstate \
# --query 'Versions[*].{VersionId:VersionId,Modified:LastModified,Size:Size}'
# Step 3: Download a known-good state version
# aws s3api get-object \
# --bucket fintech-corp-terraform-state-prod \
# --key payments-database/terraform.tfstate \
# --version-id "abc123def456" \
# recovered-state.json
# Step 4: Push the recovered state back
# terraform state push recovered-state.json
# Prevention: S3 bucket with versioning and replication
resource "aws_s3_bucket" "terraform_state" {
# Bucket name following the organization naming convention
bucket = "fintech-corp-terraform-state-prod"
# Prevent accidental deletion of the state bucket
force_destroy = false
# Tags for ownership and cost allocation
tags = {
Team = "platform-engineering"
Purpose = "terraform-state-storage"
Criticality = "critical"
}
}
# Enable versioning for point-in-time state recovery
resource "aws_s3_bucket_versioning" "terraform_state" {
# Reference the state storage bucket
bucket = aws_s3_bucket.terraform_state.id
# Enable versioning to retain all state file versions
versioning_configuration {
# Enabled status ensures every write creates a new version
status = "Enabled"
}
}
# Server-side encryption for state files at rest
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
# Reference the state storage bucket
bucket = aws_s3_bucket.terraform_state.id
# Encryption rule using AWS KMS for audit trail
rule {
# Apply encryption to all new state file versions
apply_server_side_encryption_by_default {
# Use a dedicated KMS key for state encryption
sse_algorithm = "aws:kms"
# KMS key managed by the platform engineering team
kms_master_key_id = aws_kms_key.terraform_state_encryption.arn
}
# Force encryption on all uploaded state files
bucket_key_enabled = true
}
}
# Cross-region replication for disaster recovery
resource "aws_s3_bucket_replication_configuration" "terraform_state_dr" {
# Source bucket for replication
bucket = aws_s3_bucket.terraform_state.id
# IAM role with permissions to replicate objects
role = aws_iam_role.terraform_state_replication.arn
# Replication rule for all state files
rule {
# Unique identifier for the replication rule
id = "state-dr-replication"
# Enable the replication rule
status = "Enabled"
# Replicate to a bucket in a different region
destination {
# DR bucket in us-west-2 for geographic redundancy
bucket = aws_s3_bucket.terraform_state_dr.arn
# Use the DR region's KMS key for encryption
storage_class = "STANDARD_IA"
}
}
}
# Lifecycle policy to manage state version retention
resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
# Reference the state storage bucket
bucket = aws_s3_bucket.terraform_state.id
# Rule to manage old state file versions
rule {
# Unique identifier for the lifecycle rule
id = "state-version-retention"
# Enable the lifecycle rule
status = "Enabled"
# Transition old versions to cheaper storage after 30 days
noncurrent_version_transition {
# Move to Glacier after 30 days for cost savings
noncurrent_days = 30
# Glacier storage for long-term state version retention
storage_class = "GLACIER"
}
# Delete very old versions after 365 days
noncurrent_version_expiration {
# Retain versions for one year for compliance audits
noncurrent_days = 365
}
}
}
# Import command for rebuilding state (example)
# terraform import aws_rds_cluster.payments_db payments-db-production
# terraform import aws_vpc.payments_network vpc-0a1b2c3d4e5f67890
# terraform import aws_security_group.payments_api_ingress sg-0a1b2c3d4e5f67890◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ Terraform State Recovery Workflow │ ├───────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ Corruption │ │ │ │ Detected! │ │ │ └──────┬───────┘ │ │ │ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 1: Stop All Operations │ │ │ │ Cancel CI/CD queued runs │ │ │ │ Notify platform-engineering team │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 2: Assess Damage │ │ │ │ terraform state pull > corrupted-backup.json │ │ │ │ jq '.resources | length' corrupted-backup.json │ │ │ │ Compare against AWS CLI inventory │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌──────────────┬─────────────────────┘ │ │ ↓ ↓ │ │ ┌────────────┐ ┌───────────────┐ │ │ │ S3 Version │ │ No Backup │ │ │ │ Available? │ │ Available │ │ │ └──────┬─────┘ └───────┬───────┘ │ │ │ │ │ │ ↓ ↓ │ │ ┌────────────┐ ┌───────────────┐ │ │ │ Step 3A: │ │ Step 3B: │ │ │ │ Restore │ │ Rebuild │ │ │ │ from S3 │ │ from scratch │ │ │ │ version │ │ │ │ │ │ │ │ terraform │ │ │ │ aws s3api │ │ state rm │ │ │ │ get-object │ │ (corrupted) │ │ │ │ --version-id│ │ │ │ │ │ │ │ terraform │ │ │ │ terraform │ │ import │ │ │ │ state push │ │ (each resource│ │ │ │ │ │ from cloud) │ │ │ └──────┬─────┘ └───────┬───────┘ │ │ │ │ │ │ └───────┬───────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 4: Validate Recovery │ │ │ │ terraform plan (expect no changes) │ │ │ │ Verify resource count matches cloud inventory │ │ │ └──────────────────────────────────────────┬───────┘ │ │ │ │ │ ┌────────────────────────────────────┘ │ │ ↓ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Step 5: Post-Mortem │ │ │ │ Document root cause │ │ │ │ Implement additional safeguards │ │ │ │ Schedule recovery drill for next quarter │ │ │ └──────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Remote state gives a team a shared, durable view of managed infrastructure, while state locking prevents two Terraform runs from changing the same state at the same time. If state is split around convenience instead of ownership and dependency boundaries, teams create hidden coupling, stale outputs, and unsafe apply ordering.
Detailed Answer
Think of Terraform state like the official city property registry. If two clerks update the same parcel records at the same time, one may overwrite the other's change and the registry becomes untrustworthy. Remote state puts the registry in a shared office instead of on one clerk's laptop, and locking is the sign on the desk that says one clerk is actively editing this registry right now. Without that sign, production changes become a race. In Terraform, state maps configuration resources to real infrastructure objects and stores attributes Terraform needs for planning. Local state is simple for one person, but it breaks down for teams because every operator needs the newest file and must coordinate manually. Remote backends such as HCP Terraform, S3 with locking, Azure Blob, or GCS move state to a shared system. Locking prevents concurrent plans or applies from corrupting the same state or producing plans from stale assumptions. The internal flow is: Terraform loads configuration, initializes the backend, reads the latest state, attempts to acquire a lock, refreshes resource data through providers, builds a dependency graph, produces a plan, and applies changes if approved. During apply, state is updated as resources change. If the process crashes, the lock may need cleanup, but force-unlocking should be treated like removing a warning tag from dangerous machinery: only do it after confirming no run is active. At scale, teams split state to reduce blast radius, improve performance, and align ownership. A network platform state might own VPCs, subnets, and transit gateways, while an application state consumes published subnet IDs. The dangerous split is by file size or folder convenience rather than lifecycle. If an app state creates security groups that the network state mutates, or two states manage different arguments on the same resource, Terraform cannot reason globally and drift becomes normal. The non-obvious gotcha is that remote state outputs are not a service discovery system. They expose snapshots from another state, and consumers may act on outputs that are valid syntactically but operationally stale. Senior engineers prefer stable contracts, narrow outputs, versioned modules, explicit ownership, and separate data stores for values consumed by non-Terraform systems. They also design emergency unlock procedures, backend access controls, and audit trails before the first production incident.
Code Example
terraform init -backend-config=env/prod.s3backend # Initializes the shared production backend instead of using local state. terraform plan -out=prod.tfplan # Creates an auditable plan from the latest locked remote state. terraform apply prod.tfplan # Applies exactly the reviewed plan while Terraform holds the backend lock. terraform force-unlock LOCK_ID # Removes a stale lock only after confirming no apply is still running. terraform state list # Lists resources owned by this state so ownership boundaries can be reviewed.
◈ Architecture Diagram
┌──────────┐
│ Engineer │
└────┬─────┘
↓
┌──────────┐
│ Backend │
└────┬─────┘
↓ lock
┌──────────┐
│ State │
└────┬─────┘
↓ graph
┌──────────┐
│ Plan │
└────┬─────┘
↓ apply
┌──────────┐
│ Cloud │
└──────────┘Quick Answer
At scale, Terraform state must be stored in remote backends like S3 with DynamoDB locking or Terraform Cloud, split into small blast-radius units by domain or environment, and isolated via workspaces or directory structure. State locking prevents concurrent applies from corrupting state, and state splitting ensures a single terraform apply cannot accidentally destroy unrelated infrastructure.
Detailed Answer
Think of a hospital records system. If every department writes to one giant patient file simultaneously, records get corrupted and the wrong medication gets administered. Splitting records by department, locking each file during edits, and storing everything in a central secure archive prevents these disasters. Terraform state management works the same way — it is the record of what infrastructure exists, and mismanaging it causes outages. Terraform state is a JSON file that maps every resource in your configuration to a real cloud object. When terraform plan runs, it reads the state to determine what exists, compares it to the desired configuration, and calculates the diff. If two engineers run terraform apply simultaneously against the same state, one overwrites the other's changes, causing state corruption where Terraform's view of the world no longer matches reality. Remote backends solve storage and collaboration: S3 stores the state file durably, DynamoDB provides a lock table so only one operation can modify state at a time, and versioning on the S3 bucket enables recovery from bad applies. Internally, when terraform apply starts, it sends a Lock request to the backend. For S3+DynamoDB, this writes a lock record to the DynamoDB table with a unique ID, the user's identity, and a timestamp. If another process already holds the lock, Terraform exits with an error. After the apply completes, Terraform writes the updated state to S3 and releases the lock. If a process crashes mid-apply, the lock remains until it expires or is manually force-unlocked with terraform force-unlock. Terraform Cloud handles locking internally and adds run queues so multiple plans can exist but only one apply executes at a time per workspace. At production scale, the critical architectural decision is state splitting. A monolithic state file containing the VPC, databases, Kubernetes clusters, DNS records, and application services means a single terraform apply can accidentally destroy the database while updating a DNS record. The recommended pattern is splitting state by blast radius: network foundations in one state, data layer in another, compute in another, and application configurations in their own states. Each state has its own backend configuration and can use terraform_remote_state data sources or outputs stored in SSM Parameter Store to share values. Workspaces can further separate environments (dev, staging, production) within the same configuration, but they should not be used as a substitute for proper state splitting — all workspaces in a configuration share the same codebase, backend, and permissions. The non-obvious gotcha is that terraform_remote_state creates a hard coupling between states, and if the upstream state is corrupted or the output names change, downstream plans break. Many mature teams replace terraform_remote_state with data sources that look up infrastructure by tags or names, or they store shared values in AWS SSM Parameter Store or HashiCorp Consul, which decouples state files completely. Another trap is that S3 bucket versioning does not protect against state file deletion — teams must also enable MFA Delete or use S3 Object Lock for regulatory environments.
Code Example
# backend.tf — Remote backend configuration for the payments data layer
terraform {
# Use S3 as the remote state storage backend
backend "s3" {
# S3 bucket dedicated to Terraform state files
bucket = "company-terraform-state-prod"
# State file path scoped to team and layer
key = "payments/data-layer/terraform.tfstate"
# AWS region for the state bucket
region = "us-east-1"
# DynamoDB table for state locking and consistency checking
dynamodb_table = "terraform-state-locks"
# Enable server-side encryption for state at rest
encrypt = true
# Use a specific KMS key for encryption
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/payments-tf-state-key"
}
}
# Reference outputs from the network layer state without tight coupling
# Using SSM Parameter Store instead of terraform_remote_state
data "aws_ssm_parameter" "vpc_id" {
# Parameter path set by the network layer's terraform apply
name = "/infrastructure/network/vpc-id"
}
data "aws_ssm_parameter" "private_subnet_ids" {
# Comma-separated subnet IDs stored by the network team
name = "/infrastructure/network/private-subnet-ids"
}
# Use the decoupled values in resource configuration
resource "aws_db_subnet_group" "payments" {
# Subnet group name for the payments database
name = "payments-db-subnets"
# Split the comma-separated parameter value into a list
subnet_ids = split(",", data.aws_ssm_parameter.private_subnet_ids.value)
# Tag for operational identification
tags = {
Team = "payments"
Layer = "data"
}
}
# DynamoDB lock table must exist before backend configuration
# This is typically created by a bootstrap state or manually
# aws dynamodb create-table \
# --table-name terraform-state-locks \
# --attribute-definitions AttributeName=LockID,AttributeType=S \
# --key-schema AttributeName=LockID,KeyType=HASH \
# --billing-mode PAY_PER_REQUEST◈ Architecture Diagram
┌──────────┐
│ tf apply │
└────┬─────┘
│
┌────┴─────┐
│ Lock │
│ DynamoDB │
└────┬─────┘
│
┌────┴─────┐
│ Read │
│ S3 State │
└────┬─────┘
│
┌────┴─────┐
│ Apply │
│ Changes │
└────┬─────┘
│
┌────┴─────┐
│ Write │
│ S3 State │
└────┬─────┘
│
┌────┴─────┐
│ Unlock │
└──────────┘Quick Answer
Terraform plan -refresh-only detects drift by comparing actual cloud state against the stored state file without proposing configuration changes. Import blocks bring unmanaged resources under Terraform control declaratively. Moved blocks refactor resource addresses in state without destroying and recreating infrastructure. Together they let architects reconcile drift, adopt existing resources, and restructure code safely.
Detailed Answer
Think of a warehouse inventory system. Drift detection is like a stock audit — you compare what the computer says is on the shelf to what is actually there. Import is like scanning a product that was placed on the shelf without being logged into the system. Move is like changing the shelf label without physically moving the product. All three keep the inventory accurate without throwing anything away. Infrastructure drift occurs when cloud resources are modified outside Terraform — through the console, CLI, another IaC tool, or automated processes like auto-scaling. terraform plan -refresh-only reads the current state of every managed resource from the cloud provider APIs and compares it to the stored state file. It shows what has changed in the real world without proposing any configuration-level changes. This is distinct from a regular terraform plan, which both refreshes state and compares it to the desired configuration. Running refresh-only plans on a schedule helps teams detect unauthorized changes before they cause incidents. Internally, refresh-only mode calls the same provider Read functions that a normal plan uses, but it stops after updating the in-memory state representation. It shows a diff between the previously stored state and the freshly read state, highlighting attributes that changed externally. If the operator approves the refresh with terraform apply -refresh-only, the state file is updated to match reality without making any infrastructure changes. Import blocks, introduced in Terraform 1.5, allow declarative imports in configuration files rather than the imperative terraform import CLI command. A resource block with an import block specifies the cloud resource ID, and terraform plan generates the configuration needed to manage it. Moved blocks tell Terraform that a resource has been renamed or restructured in the configuration — for example, moving from a flat resource to a module or changing a resource's for_each key — so it updates the state address rather than planning a destroy and create. At production scale, drift detection should be automated. Teams run terraform plan -refresh-only in CI on a daily schedule and alert on any detected drift. The plan output is stored as an artifact for audit trail. Import blocks are essential during brownfield adoption — when a company has existing infrastructure created manually or by CloudFormation and wants to manage it with Terraform. Without import blocks, the alternative is terraform import commands that must be run manually for each resource, which is error-prone and not version-controlled. Moved blocks are critical during refactoring: when a team restructures modules, renames resources for clarity, or converts single resources to for_each collections, moved blocks prevent Terraform from destroying the production database and recreating it. The non-obvious gotcha with refresh-only is that it only detects drift in resources Terraform already manages — it cannot find resources created outside Terraform. Teams need cloud-native tools like AWS Config or Azure Policy for complete drift coverage. With import blocks, the generated configuration may not match the team's coding standards and needs manual cleanup. With moved blocks, the from address must exactly match the current state address, including module paths and index keys, and a typo silently creates a new resource instead of moving the existing one. Architects should always run terraform plan after adding moved blocks and verify that no destroy/create actions appear.
Code Example
# Detect drift on the payments infrastructure without changing anything
terraform plan -refresh-only -out=drift-report.tfplan
# Review the drift report to see what changed externally
terraform show drift-report.tfplan
# Apply the refresh to update state to match reality (no infra changes)
terraform apply -refresh-only drift-report.tfplan
# Import an existing RDS instance that was created manually in the console
# payments-data/main.tf
import {
# Specify the AWS resource ID of the existing database
id = "payments-orders-prod"
# Map it to this Terraform resource address
to = aws_db_instance.orders
}
resource "aws_db_instance" "orders" {
# Identifier matching the existing RDS instance name
identifier = "payments-orders-prod"
# Instance class matching the existing configuration
instance_class = "db.r6g.large"
# Engine matching the existing database
engine = "postgres"
# Engine version matching the existing database
engine_version = "16.3"
# Storage matching the existing allocation
allocated_storage = 500
# Prevent accidental deletion of the production database
deletion_protection = true
# Skip final snapshot only if you have other backup strategies
skip_final_snapshot = false
# Tag for operational identification
tags = {
Team = "payments"
Environment = "prod"
ManagedBy = "terraform"
}
}
# Refactor a resource into a module without destroying it
# Use moved block to update the state address
moved {
# Old address before modularization
from = aws_db_instance.orders
# New address inside the database module
to = module.orders_database.aws_db_instance.this
}◈ Architecture Diagram
┌──────────┐
│ Cloud │
│ (actual) │
└────┬─────┘
│ refresh
┌────┴─────┐
│ State │
│ (stored) │
└────┬─────┘
│ compare
┌────┴─────┐
│ Config │
│ (desired)│
└────┬─────┘
│
┌────┴─────┐
│ Plan │
│ (action) │
└──────────┘Quick Answer
Terraform state is a JSON file that maps your HCL configuration to real-world infrastructure resources. Remote state stores this file in a shared backend like S3 or Terraform Cloud, enabling team collaboration, state locking, and disaster recovery.
Detailed Answer
Terraform state is the backbone of how Terraform understands what infrastructure it manages. Think of it like a warehouse inventory ledger: without it, workers would walk into the warehouse every day not knowing what is already on the shelves, what was ordered, or what needs restocking. The state file (terraform.tfstate) is that ledger — it records every resource Terraform has created, its current attributes, metadata about dependencies, and the mapping between your HCL resource blocks and the actual cloud API objects. Internally, the state file is a JSON document containing a version number, a serial counter that increments on every write, a lineage UUID that uniquely identifies a state chain, and an array of resource objects. Each resource entry stores the provider, the resource type, the resource name, the mode (managed or data), and the full set of attributes returned by the provider API after creation. When you run terraform plan, Terraform reads this state, calls the cloud APIs to refresh the actual status of each resource, and then computes the diff between desired (your HCL) and actual (the refreshed state). Without state, Terraform would have no way to know that aws_rds_instance.payments_db already exists and would try to create a duplicate every time. Local state works fine for a solo developer experimenting, but it becomes dangerous in production for several reasons. First, if two engineers run terraform apply simultaneously against local state files, they can create conflicting resources or corrupt state entirely — there is no locking mechanism. Second, if your laptop dies or someone accidentally deletes the state file, you lose the mapping between code and infrastructure, making it extremely difficult to recover. Third, local state may contain sensitive outputs like database passwords or API keys stored in plaintext on a developer workstation. Remote state solves all of these problems. When you configure a backend like S3 with DynamoDB locking, the state file lives in a durable, versioned object store. DynamoDB provides a distributed lock so that only one terraform apply can run at a time, preventing race conditions. S3 versioning gives you automatic backup of every state revision, so you can roll back if something goes wrong. Additionally, remote state enables the terraform_remote_state data source, which lets one Terraform project read outputs from another — for example, a networking project can export VPC IDs that an application project consumes. In production, teams typically enforce remote state from day one using a backend configuration block, require encryption at rest and in transit, restrict access via IAM policies, and enable state locking. Ignoring remote state is one of the most common causes of Terraform disasters in growing organizations.
Code Example
# Configure S3 backend with DynamoDB locking for the payments infrastructure
terraform {
# Use the S3 backend to store state remotely
backend "s3" {
# S3 bucket dedicated to Terraform state files
bucket = "fintech-terraform-state-prod"
# Path within the bucket for this specific project's state
key = "payments-platform/us-east-1/terraform.tfstate"
# AWS region where the S3 bucket lives
region = "us-east-1"
# DynamoDB table used for state locking to prevent concurrent applies
dynamodb_table = "fintech-terraform-locks"
# Encrypt the state file at rest using AES-256
encrypt = true
# Use a specific KMS key for encryption instead of default S3 key
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/payments-state-key"
}
}
# Read remote state from the networking project to get VPC details
data "terraform_remote_state" "networking" {
# Use the S3 backend to read another project's state
backend = "s3"
config = {
# Same state bucket but different key path for the networking project
bucket = "fintech-terraform-state-prod"
# The networking team's state file location
key = "networking/us-east-1/terraform.tfstate"
# Region must match the bucket's region
region = "us-east-1"
}
}
# Use the VPC ID from the networking project's remote state
resource "aws_db_subnet_group" "payments_db_subnets" {
# Name the subnet group after the payments database
name = "payments-db-subnet-group"
# Pull private subnet IDs from the networking project's outputs
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
# Tag for cost tracking and ownership
tags = {
Team = "payments-backend"
Service = "payments-db"
}
}◈ Architecture Diagram
┌──────────────────────────────────────────────────────────┐
│ Developer Workstation │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ main.tf │ │ variables.tf │ │
│ │ (HCL Config) │ │ (Inputs) │ │
│ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ terraform plan │ │
│ │ terraform apply │ │
│ └────────┬────────┘ │
└───────────────────┼──────────────────────────────────────┘
│
┌──────────▼──────────┐
│ S3 Backend │
│ ┌───────────────┐ │
│ │ .tfstate file │ │
│ │ (encrypted) │ │
│ └───────────────┘ │
└──────────┬──────────┘
│
┌──────────▼──────────┐
│ DynamoDB Lock │
│ ┌───────────────┐ │
│ │ LockID: hash │ │
│ │ Who: engineer │ │
│ │ Created: ts │ │
│ └───────────────┘ │
└─────────────────────┘Quick Answer
Terraform plan is a dry-run that shows what changes Terraform would make without modifying any infrastructure, while terraform apply actually executes those changes. Plan reads state and config, computes a diff, and outputs it; apply performs that diff against real cloud APIs.
Detailed Answer
Understanding the difference between plan and apply is fundamental, but the internals reveal much more than just 'one previews, the other executes.' Think of terraform plan like a restaurant printing a receipt before charging your card — it shows exactly what will happen so you can review it. terraform apply is when the charge actually goes through. When you run terraform plan, Terraform performs several steps internally. First, it loads the configuration by parsing all .tf files in the current directory and resolving module sources. Second, it reads the current state file to understand what resources already exist. Third, it performs a state refresh — making API calls to every cloud provider to check the actual status of each managed resource and updating the in-memory state with real attributes. This refresh step is crucial because someone might have manually changed a security group rule outside of Terraform. Fourth, Terraform builds a dependency graph of all resources, computes the diff between desired state (your HCL) and actual state (refreshed), and produces an execution plan showing creates, updates, and destroys with specific attribute changes. The plan output uses a clear notation: + for create, ~ for update in-place, - for destroy, and -/+ for destroy-then-recreate (also called a forced replacement). When you see -/+ next to your production database, that is the moment you should stop and investigate — it means Terraform wants to destroy and recreate that resource, which could mean data loss. terraform apply by default runs a plan first and asks for confirmation before proceeding. Once confirmed, Terraform walks the dependency graph and makes real API calls to create, update, or destroy resources. It processes independent resources in parallel (up to 10 by default, configurable with -parallelism) and sequential resources in dependency order. After each resource operation completes, Terraform immediately writes the updated state file, ensuring that even if apply is interrupted midway, the state reflects what was actually created. A critical production practice is using saved plan files. You run terraform plan -out=tfplan to save the plan to a binary file, review it, and then run terraform apply tfplan. This guarantees that what you reviewed is exactly what gets applied — no re-computation, no changes from someone else's commit sneaking in between plan and apply. In CI/CD pipelines, this two-stage approach is essential. The plan stage runs in a pull request for review, and the apply stage runs only after merge using the exact saved plan. One subtle gotcha: terraform apply without a saved plan will re-compute the plan at apply time, meaning the infrastructure could have changed between when you reviewed the plan output and when apply runs. In fast-moving environments with multiple teams, this gap can cause surprises. Always use saved plans in production workflows.
Code Example
# Step 1: Run plan and save the output to a binary plan file
# The -out flag saves the computed plan for exact replay during apply
# terraform plan -out=payments-deploy-2024-03-15.tfplan
# Step 2: Review the plan output carefully before applying
# Look for any -/+ (destroy and recreate) on stateful resources
# terraform show payments-deploy-2024-03-15.tfplan
# Step 3: Apply the exact saved plan without re-computation
# terraform apply payments-deploy-2024-03-15.tfplan
# Production CI/CD pipeline example using saved plans
# This is typically in a Makefile or CI script
# Variable definitions for the payments infrastructure deployment
variable "db_instance_class" {
# The RDS instance size for the payments database
description = "Instance class for the payments RDS cluster"
# Enforce string type to prevent accidental numeric input
type = string
# Default to a production-grade instance size
default = "db.r6g.xlarge"
}
# RDS cluster that plan will evaluate and apply will create/update
resource "aws_rds_cluster" "payments_db" {
# Unique cluster identifier following the naming convention
cluster_identifier = "payments-db-prod-us-east-1"
# Use Aurora PostgreSQL for the payments database engine
engine = "aurora-postgresql"
# Pin to a specific engine version to avoid surprise upgrades
engine_version = "15.4"
# Place the database in the payments VPC private subnets
db_subnet_group_name = aws_db_subnet_group.payments_db_subnets.name
# Use the payments database security group for network access control
vpc_security_group_ids = [aws_security_group.payments_db_sg.id]
# Master username for the database administrator account
master_username = "payments_admin"
# Pull the password from AWS Secrets Manager, never hardcode
master_password = data.aws_secretsmanager_secret_version.db_password.secret_string
# Enable deletion protection to prevent accidental terraform destroy
deletion_protection = true
# Skip final snapshot only in dev; always snapshot in prod
skip_final_snapshot = false
# Name the final snapshot with a timestamp for recovery
final_snapshot_identifier = "payments-db-final-${formatdate("YYYY-MM-DD", timestamp())}"
# Tags for cost allocation and ownership tracking
tags = {
Service = "payments-processing"
Environment = "production"
BackupTier = "critical"
}
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────┐
│ terraform plan │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Load HCL │──→│ Read │──→│ Refresh │ │
│ │ Config │ │ State │ │ via API │ │
│ └──────────┘ └──────────┘ └────┬─────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Compute │ │
│ │ Diff │ │
│ └──────┬──────┘ │
│ │ │
│ ┌──────▼──────┐ │
│ │ Plan Output │ │
│ │ + create │ │
│ │ ~ update │ │
│ │ - destroy │ │
│ └──────┬──────┘ │
└─────────────────────────────────────┼─────────────────────┘
│
┌─────────▼──────────┐
│ Saved Plan File │
│ (.tfplan binary) │
└─────────┬──────────┘
│
┌─────────────────────────────────────┼─────────────────────┐
│ terraform apply │ │
│ ┌──────▼──────┐ │
│ │ Walk Dep │ │
│ │ Graph │ │
│ └──────┬──────┘ │
│ │ │
│ ┌────────────────┬┴────────────────┐ │
│ ┌─────▼─────┐ ┌──────▼─────┐ ┌────────▼┐ │
│ │ Create │ │ Update │ │ Destroy │ │
│ │ Resources │ │ Resources │ │ Removed │ │
│ └─────┬─────┘ └──────┬─────┘ └────────┬┘ │
│ └────────────────┼─────────────────┘ │
│ ┌──────▼──────┐ │
│ │ Write State │ │
│ └─────────────┘ │
└───────────────────────────────────────────────────────────┘