Coinbase

13 interview questions · terraform

terraformadvancedarchitectintermediate

How do you design Terraform module governance with a private registry, versioning, and team ownership?

advancedmodulesterraform

▼

Quick Answer

Publish vetted, hardened Terraform modules to a private registry (TFE or Artifactory) with semantic versioning. Each module has an owning team responsible for maintenance, security updates, and documentation. Consumer teams pin module versions and upgrade through a managed process.

Detailed Answer

Think of a private module registry like an internal app store for a bank. Instead of every team building their own RDS setup from scratch (with varying levels of security hardening), the platform team publishes a vetted 'RDS Module' to the internal store. Application teams install it, configure their specific parameters (database name, size), and get encryption, backup, monitoring, and compliance built in. The platform team updates the module when security requirements change, and consumer teams upgrade on their own schedule — just like updating an app on your phone. A private Terraform module registry serves as the single source of truth for approved infrastructure patterns. In Terraform Enterprise, the private registry is built in — you publish modules from Git repositories with the naming convention terraform-<provider>-<name> (like terraform-aws-eks-cluster or terraform-aws-rds-postgresql). In organizations using open-source Terraform, alternatives include JFrog Artifactory's Terraform provider, a self-hosted registry using the Terraform Registry Protocol, or even Git-based module references with version tags. The key principle is that production infrastructure should only use modules from the private registry, never ad-hoc inline resources — this is enforced via Sentinel policies that check the module source in every Terraform plan. Semantic versioning is critical for module governance. Every module follows semver (major.minor.patch): patch versions fix bugs without changing behavior, minor versions add features backward-compatibly, and major versions include breaking changes that require consumer updates. When the platform team updates the RDS module to require a new mandatory tag (minor version bump), consumer teams see the new version in the registry but continue using their pinned version until they are ready to upgrade. When a major version changes the module interface (removing an input variable or changing an output format), consumer teams must explicitly update their code. Version constraints in consumer code (version = '~> 2.0' means any 2.x version) allow automatic adoption of patches and minor updates while protecting against breaking changes. Team ownership is formalized through module CODEOWNERS files and documentation. Each module has an owning team listed in the repository's CODEOWNERS file, ensuring that any PR to the module requires review from the owners. The owning team is responsible for security patching (updating provider versions, fixing CVEs), documentation (README with usage examples, input/output descriptions, and architecture diagrams), testing (automated tests using Terratest or terraform-compliance that run in CI), and deprecation communication (announcing when older versions will lose support). In a banking organization, module ownership maps to infrastructure domains: the networking team owns the VPC and transit gateway modules, the database team owns the RDS and ElastiCache modules, and the platform team owns the EKS and observability modules. The module development lifecycle follows a structured process. A team identifies a repeated infrastructure pattern (every team needs an S3 bucket with encryption, versioning, access logging, and lifecycle policies). They build a module, test it with Terratest (creating real infrastructure in a sandbox account, validating it, then destroying it), write documentation, and publish version 1.0.0 to the private registry. Consumer teams adopt the module with a pinned version constraint. When a security requirement changes (for example, a new PCI-DSS control requires S3 Object Lock), the module team releases version 1.1.0 with the new feature as an optional input, and version 2.0.0 if the feature must be mandatory (breaking change for consumers not passing the new input). The module team announces the update through internal channels and provides migration guides for major version bumps. The biggest gotcha is creating modules that are either too opinionated or too flexible. A module that hardcodes the instance type, subnet, and tags is useless because every consumer has different requirements. A module that exposes every single AWS resource argument as an input variable is just a wrapper around the provider with no added value. Good modules encode organizational opinions (encryption is always on, backups are always enabled, monitoring is always configured) while exposing legitimate customization points (instance size, database name, backup retention period). Another gotcha is orphaned modules — modules published to the registry but never updated, with no clear owner. Implement a quarterly module health review where each module is checked for dependency updates, provider compatibility, and active ownership. Deprecate modules that are no longer maintained with clear migration paths to replacements.

Code Example

# Private module structure: terraform-aws-rds-postgresql
# Published to TFE private registry
#
# terraform-aws-rds-postgresql/
# ├── main.tf           # Core RDS resources
# ├── variables.tf      # Input variables
# ├── outputs.tf        # Output values
# ├── versions.tf       # Provider version constraints
# ├── sentinel.hcl      # Policy tests
# ├── README.md         # Usage documentation
# ├── CODEOWNERS        # Team ownership
# ├── CHANGELOG.md      # Version history
# └── test/
#     └── rds_test.go   # Terratest integration tests

# main.tf - Encodes banking compliance requirements
resource "aws_db_instance" "this" {
  identifier     = "${var.team}-${var.service_name}-${var.environment}"
  engine         = "postgres"
  engine_version = var.engine_version
  instance_class = var.instance_class

  # MANDATORY: Encryption at rest (PCI-DSS)
  storage_encrypted = true
  kms_key_id        = var.kms_key_arn

  # MANDATORY: Automated backups
  backup_retention_period = max(var.backup_retention_days, 7)  # Min 7 days
  backup_window           = "03:00-04:00"

  # MANDATORY: Multi-AZ for production
  multi_az = var.environment == "production" ? true : var.multi_az

  # MANDATORY: No public access
  publicly_accessible    = false
  db_subnet_group_name   = aws_db_subnet_group.this.name
  vpc_security_group_ids = [aws_security_group.this.id]

  # MANDATORY: Audit logging for compliance
  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

  # MANDATORY: Deletion protection in production
  deletion_protection = var.environment == "production" ? true : false

  tags = merge(var.additional_tags, {
    ManagedBy          = "terraform"
    Module             = "terraform-aws-rds-postgresql"
    ModuleVersion      = "2.3.0"
    Team               = var.team
    DataClassification = var.data_classification
    ComplianceScope    = "pci-dss"
  })
}

# variables.tf - Expose customization, encode opinions
variable "service_name" {
  description = "Name of the service this database supports"
  type        = string
  validation {
    condition     = can(regex("^[a-z][a-z0-9-]+$", var.service_name))
    error_message = "Service name must be lowercase alphanumeric with hyphens."
  }
}

variable "instance_class" {
  description = "RDS instance class"
  type        = string
  default     = "db.r6g.large"
  validation {
    condition     = can(regex("^db\\.", var.instance_class))
    error_message = "Must be a valid RDS instance class."
  }
}

variable "data_classification" {
  description = "Data classification level (required for compliance)"
  type        = string
  validation {
    condition     = contains(["public", "internal", "confidential", "restricted"], var.data_classification)
    error_message = "Must be one of: public, internal, confidential, restricted."
  }
}
---
# Consumer usage - payments team
# services/payments-api/infra/database.tf
module "settlements_db" {
  source  = "app.terraform.io/bank-platform/rds-postgresql/aws"
  version = "~> 2.3"  # Accept patches, not major bumps

  service_name        = "settlements-processor"
  team                = "payments"
  environment         = "production"
  instance_class      = "db.r6g.xlarge"
  engine_version      = "15.4"
  kms_key_arn         = data.aws_kms_key.banking.arn
  data_classification = "restricted"  # PII and financial data
  vpc_id              = data.terraform_remote_state.network.outputs.vpc_id
  subnet_ids          = data.terraform_remote_state.network.outputs.database_subnets

  additional_tags = {
    CostCenter = "payments-engineering"
  }
}
---
# Sentinel policy: enforce private registry usage
# policies/governance/require-private-registry.sentinel
import "tfconfig/v2" as tfconfig

approved_registry = "app.terraform.io/bank-platform"

module_calls = filter tfconfig.module_calls as _, mc {
  mc.source is not ""
}

all_modules_from_registry = rule {
  all module_calls as _, mc {
    mc.source matches approved_registry + "/.+"
  }
}

main = rule { all_modules_from_registry }
---
# Terratest - automated module validation
# test/rds_test.go
package test

import (
  "testing"
  "github.com/gruntwork-io/terratest/modules/terraform"
  "github.com/stretchr/testify/assert"
)

func TestRdsModule(t *testing.T) {
  terraformOptions := &terraform.Options{
    TerraformDir: "../",
    Vars: map[string]interface{}{
      "service_name":        "test-settlements",
      "team":                "platform-test",
      "environment":         "sandbox",
      "data_classification": "internal",
      "kms_key_arn":          "arn:aws:kms:us-east-1:123456:key/test",
    },
  }
  defer terraform.Destroy(t, terraformOptions)
  terraform.InitAndApply(t, terraformOptions)

  // Validate encryption is enabled
  encrypted := terraform.Output(t, terraformOptions, "storage_encrypted")
  assert.Equal(t, "true", encrypted)
}

◈ Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│            Terraform Module Governance Lifecycle                │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Module Development (Platform Team)                       │  │
│  │                                                           │  │
│  │  Identify Pattern → Build Module → Terratest → Publish   │  │
│  │                                                           │  │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐ │  │
│  │  │ Hardened │→ │ Unit +   │→ │ Semver   │→ │ Private  │ │  │
│  │  │ Defaults │  │ Integ    │  │ Tag      │  │ Registry │ │  │
│  │  │ (encrypt,│  │ Tests    │  │ v2.3.0   │  │ Publish  │ │  │
│  │  │  backup) │  │          │  │          │  │          │ │  │
│  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘ │  │
│  └───────────────────────────────────────────────────────────┘  │
│                             │                                   │
│                             ▼                                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Private Registry (app.terraform.io/bank-platform)        │  │
│  │                                                           │  │
│  │  terraform-aws-rds-postgresql  v2.3.0  (DB Team)         │  │
│  │  terraform-aws-eks-cluster     v3.1.0  (Platform Team)   │  │
│  │  terraform-aws-vpc-banking     v1.8.0  (Network Team)    │  │
│  │  terraform-aws-s3-compliant    v2.0.0  (Platform Team)   │  │
│  └───────────────────────────┬───────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Consumer Teams (Pin versions, upgrade on schedule)        │  │
│  │                                                           │  │
│  │  module "settlements_db" {                                │  │
│  │    source  = "app.terraform.io/.../rds-postgresql/aws"    │  │
│  │    version = "~> 2.3"  # Accept patches only              │  │
│  │  }                                                        │  │
│  └───────────────────────────────────────────────────────────┘  │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  Sentinel: All prod modules MUST come from private registry │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

How do you deploy EKS clusters using Terraform, and what module structure do you use for the cluster, node groups, and networking?

advancedmodulesterraform

▼

Quick Answer

Deploy EKS using a layered module structure: a networking module for VPC/subnets, a cluster module for the EKS control plane with OIDC provider, and a node-groups module for managed node groups with launch templates. Each module has its own state file and communicates through remote state data sources or SSM parameters.

Detailed Answer

Deploying EKS with Terraform is like building a three-story building: the foundation is networking (VPC, subnets, NAT gateways), the structural frame is the EKS control plane (API server, etcd, OIDC provider), and the floors are the node groups (compute capacity where workloads actually run). Each layer depends on the one below it, and you should be able to rebuild any floor without demolishing the entire building. The networking module provisions a dedicated VPC with public subnets for load balancers, private subnets for worker nodes, and optionally isolated subnets for databases. EKS requires specific subnet tags: kubernetes.io/cluster/<cluster-name> = shared on all subnets, kubernetes.io/role/elb = 1 on public subnets for internet-facing ALBs, and kubernetes.io/role/internal-elb = 1 on private subnets for internal services. The module outputs subnet IDs and the VPC ID for consumption by the cluster module. NAT gateways should be deployed per-AZ in production for high availability, meaning three NAT gateways across us-east-1a, us-east-1b, and us-east-1c. The cluster module creates the EKS control plane using the aws_eks_cluster resource or the terraform-aws-modules/eks/aws community module. Critical configurations include the Kubernetes version (pin to a specific minor version like 1.29), the cluster endpoint access (private-only or public-and-private with CIDR restrictions), envelope encryption for secrets using a dedicated KMS key, and the OIDC provider for IAM Roles for Service Accounts (IRSA). The OIDC provider is frequently missed but essential: it enables pods to assume IAM roles without injecting AWS credentials, which is the only secure way to grant AWS access to workloads. The node groups module manages EKS managed node groups with launch templates. Production clusters typically need multiple node groups: a system node group (t3.xlarge, 3 nodes, taints for system workloads like CoreDNS and kube-proxy), an application node group (m5.2xlarge, 3-15 nodes with cluster autoscaler), and optionally a GPU node group (g4dn.xlarge for ML inference). Each node group uses a custom launch template to specify the AMI (Amazon EKS-optimized AMI), bootstrap arguments, block device mappings (100Gi gp3 root volume), and user data for kubelet configuration. Instance refresh policies ensure rolling updates when the launch template changes. In production, state separation between these modules is critical. If your node group Terraform runs into an error, you do not want it to affect the EKS control plane state. Use separate state files: one for networking, one for the cluster, and one per node group pool. Pass data between modules using terraform_remote_state data sources or aws_ssm_parameter lookups. This blast radius isolation means a botched node group change cannot accidentally destroy the control plane. A common gotcha is the chicken-and-egg problem with EKS add-ons. The aws-auth ConfigMap (which controls IAM-to-Kubernetes RBAC mapping) requires a running cluster, but node groups need the aws-auth ConfigMap to join the cluster. The solution is to use the EKS access entries API (available since EKS platform version eks.8) instead of managing aws-auth directly, or to use the kubernetes provider with the EKS cluster's endpoint and token to manage the ConfigMap in the same apply as the cluster creation.

Code Example

# modules/eks-cluster/main.tf — EKS control plane module
# Create the EKS cluster with private endpoint and envelope encryption
resource "aws_eks_cluster" "payments_cluster" {
  # Cluster name following org naming convention
  name     = "payments-eks-${var.environment}"
  # Pin to specific Kubernetes minor version
  version  = "1.29"
  # IAM role for the EKS control plane service
  role_arn = aws_iam_role.eks_cluster_role.arn

  # VPC configuration for the control plane ENIs
  vpc_config {
    # Private subnets from the networking module
    subnet_ids              = var.private_subnet_ids
    # Enable private API endpoint for in-VPC access
    endpoint_private_access = true
    # Restrict public endpoint to CI/CD runner CIDRs only
    endpoint_public_access  = true
    # CIDR blocks allowed to reach the public endpoint
    public_access_cidrs     = ["10.0.0.0/8", "172.16.0.0/12"]
    # Security group for additional control plane access rules
    security_group_ids      = [aws_security_group.eks_cluster_sg.id]
  }

  # Envelope encryption for Kubernetes secrets using KMS
  encryption_config {
    # Encrypt the secrets resource type stored in etcd
    resources = ["secrets"]
    provider {
      # Dedicated KMS key for EKS secrets encryption
      key_arn = aws_kms_key.eks_secrets_key.arn
    }
  }

  # Enable all control plane logging for audit and troubleshooting
  enabled_cluster_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]

  # Tags for cost allocation and ownership tracking
  tags = {
    Team        = "payments-platform"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

# OIDC provider for IAM Roles for Service Accounts (IRSA)
resource "aws_iam_openid_connect_provider" "eks_oidc" {
  # OIDC issuer URL from the EKS cluster
  url = aws_eks_cluster.payments_cluster.identity[0].oidc[0].issuer
  # Audience for the STS AssumeRoleWithWebIdentity call
  client_id_list = ["sts.amazonaws.com"]
  # TLS certificate thumbprint for the OIDC provider
  thumbprint_list = [data.tls_certificate.eks_oidc.certificates[0].sha1_fingerprint]
}

# modules/eks-nodegroups/main.tf — Managed node groups
resource "aws_eks_node_group" "application_nodes" {
  # Reference the payments EKS cluster by name
  cluster_name    = var.cluster_name
  # Node group name identifying workload type
  node_group_name = "payments-app-nodes-${var.environment}"
  # IAM role for EC2 instances in this node group
  node_role_arn   = aws_iam_role.node_group_role.arn
  # Deploy nodes into private subnets only
  subnet_ids      = var.private_subnet_ids
  # Instance types optimized for payment processing workloads
  instance_types  = ["m5.2xlarge"]
  # Use AL2023 EKS-optimized AMI
  ami_type        = "AL2023_x86_64_STANDARD"

  # Autoscaling configuration for the node group
  scaling_config {
    # Minimum nodes for baseline capacity
    min_size     = 3
    # Desired nodes for normal transaction volume
    desired_size = 6
    # Maximum nodes for peak shopping events
    max_size     = 15
  }

  # Launch template for custom node configuration
  launch_template {
    # Reference the custom launch template
    id      = aws_launch_template.app_nodes.id
    # Use the latest version of the launch template
    version = aws_launch_template.app_nodes.latest_version
  }

  # Rolling update strategy to avoid downtime
  update_config {
    # Update 1 node at a time for safe rolling deploys
    max_unavailable = 1
  }
}

# Launch template for application node group
resource "aws_launch_template" "app_nodes" {
  # Template name matching the node group convention
  name_prefix = "payments-app-nodes-${var.environment}"

  # 100Gi gp3 root volume for container images and logs
  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      # 100GB root volume for container runtime storage
      volume_size = 100
      # gp3 for consistent baseline IOPS without cost of io2
      volume_type = "gp3"
      # Encrypt node volumes with the account default KMS key
      encrypted   = true
    }
  }

  # Tag instances for cost tracking and identification
  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "payments-app-node-${var.environment}"
      NodeGroup   = "application"
      Environment = var.environment
    }
  }
}

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│           EKS Terraform Module Structure                       │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  Layer 1: Networking Module (state: networking.tfstate)        │
│  ┌─────────────────────────────────────────────────────┐     │
│  │  payments-vpc (10.0.0.0/16)                          │     │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐          │     │
│  │  │ Public    │  │ Public    │  │ Public    │          │     │
│  │  │ Subnet    │  │ Subnet    │  │ Subnet    │          │     │
│  │  │ 1a (ALB)  │  │ 1b (ALB)  │  │ 1c (ALB)  │          │     │
│  │  └──────────┘  └──────────┘  └──────────┘          │     │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐          │     │
│  │  │ Private   │  │ Private   │  │ Private   │          │     │
│  │  │ Subnet    │  │ Subnet    │  │ Subnet    │          │     │
│  │  │ 1a (Nodes)│  │ 1b (Nodes)│  │ 1c (Nodes)│          │     │
│  │  └──────────┘  └──────────┘  └──────────┘          │     │
│  │  NAT GW x3 (one per AZ for HA)                      │     │
│  └─────────────────────────────────────────────────────┘     │
│         │ outputs: vpc_id, subnet_ids                         │
│         ↓                                                     │
│  Layer 2: Cluster Module (state: eks-cluster.tfstate)         │
│  ┌─────────────────────────────────────────────────────┐     │
│  │  payments-eks-prod                                   │     │
│  │  ┌──────────────┐  ┌────────────┐  ┌─────────────┐ │     │
│  │  │ Control Plane │  │ OIDC       │  │ KMS Key     │ │     │
│  │  │ K8s 1.29     │  │ Provider   │  │ (secrets    │ │     │
│  │  │ API + etcd   │  │ (for IRSA) │  │  encryption)│ │     │
│  │  └──────────────┘  └────────────┘  └─────────────┘ │     │
│  └─────────────────────────────────────────────────────┘     │
│         │ outputs: cluster_name, oidc_arn, endpoint           │
│         ↓                                                     │
│  Layer 3: Node Groups (state: eks-nodegroups.tfstate)         │
│  ┌─────────────────────────────────────────────────────┐     │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────────┐│     │
│  │  │ System     │  │ Application│  │ GPU Nodes      ││     │
│  │  │ Nodes      │  │ Nodes      │  │ (optional)     ││     │
│  │  │ t3.xlarge  │  │ m5.2xlarge │  │ g4dn.xlarge    ││     │
│  │  │ 3 fixed    │  │ 3-15 auto  │  │ 0-4 auto       ││     │
│  │  │ 100Gi gp3  │  │ 100Gi gp3  │  │ 100Gi gp3      ││     │
│  │  └────────────┘  └────────────┘  └────────────────┘│     │
│  └─────────────────────────────────────────────────────┘     │
└───────────────────────────────────────────────────────────────┘

How do you structure AWS accounts for Dev, QA, UAT, and Prod environments in Terraform — single account or multi-account with AWS Organizations?

advancedworkspacesterraform

▼

Quick Answer

Use a multi-account strategy with AWS Organizations where each environment (Dev, QA, UAT, Prod) gets its own AWS account under an organizational unit. Terraform manages this through assume-role provider configurations, per-account state files in a centralized S3 bucket with prefixed keys, and shared modules versioned in a private registry.

Detailed Answer

Structuring AWS accounts for multiple environments is like managing a hospital: you would never put the emergency room (production), the training lab (dev), the simulation center (QA), and the dress rehearsal ward (UAT) in the same building with shared keys and power circuits. AWS Organizations provides the building-per-department model, giving each environment complete blast radius isolation at the account boundary. The multi-account structure typically follows an organizational unit (OU) hierarchy. The root OU contains the management account (billing and Organizations API only — never deploy workloads here). Below it, you create OUs for Security (GuardDuty delegated admin, SecurityHub, CloudTrail aggregation), SharedServices (CI/CD runners, ECR registry, Route53 hosted zones, Terraform state bucket), and Workloads. The Workloads OU contains sub-OUs for NonProd (Dev, QA, UAT accounts) and Prod (production account). Service Control Policies (SCPs) at each OU level enforce guardrails: NonProd accounts cannot provision p4d.24xlarge instances or create public-facing resources, while the Prod OU has SCPs preventing deletion of CloudTrail logs or disabling encryption. In Terraform, each account is targeted through provider assume_role blocks. The CI/CD pipeline authenticates to the SharedServices account via OIDC, then assumes a TerraformExecutionRole in the target account. This role is provisioned by an account-baseline module that every new account receives. The role has permissions scoped to the services that environment needs — Dev gets broad permissions for experimentation, while Prod has tightly scoped permissions with explicit deny on destructive actions like deleting RDS clusters without snapshots. State file organization follows the account boundary. A single S3 bucket in the SharedServices account stores all state files, with key prefixes per account: s3://org-terraform-state/111111111111/networking/terraform.tfstate for the Dev account networking stack, s3://org-terraform-state/222222222222/networking/terraform.tfstate for Prod. Each account's TerraformExecutionRole has an S3 policy that restricts access to only its own prefix, preventing a Dev pipeline misconfiguration from reading or writing Prod state. The single-account approach — using naming conventions and tags to separate environments — is tempting for small teams but creates dangerous failure modes at scale. A single IAM policy mistake can grant Dev workloads access to Prod databases. Security groups in a shared VPC can be referenced across environments. Billing attribution becomes guesswork with cost allocation tags instead of per-account bills. Most critically, AWS service quotas are shared: a runaway Dev autoscaling group can exhaust EC2 limits and prevent Prod from scaling during a traffic spike. The gotcha with multi-account is cross-account resource sharing. VPC peering or Transit Gateway connects environments for legitimate data flows (QA reading anonymized Prod data, shared ECR images). Terraform must manage both sides of a peering connection: the requester in one account and the accepter in another, each with their own provider alias. This requires careful orchestration — apply the requester first, then the accepter — or use a two-phase apply with data sources that look up the peering connection ID.

Code Example

# organizations.tf — AWS Organizations account structure
# Create the organizational unit hierarchy
resource "aws_organizations_organizational_unit" "workloads" {
  # Parent is the organization root
  name      = "Workloads"
  # Attach to the root of the organization
  parent_id = aws_organizations_organization.org.roots[0].id
}

resource "aws_organizations_organizational_unit" "nonprod" {
  # Sub-OU for non-production environments
  name      = "NonProd"
  # Parent is the Workloads OU
  parent_id = aws_organizations_organizational_unit.workloads.id
}

resource "aws_organizations_organizational_unit" "prod" {
  # Sub-OU for production environment with stricter SCPs
  name      = "Prod"
  # Parent is the Workloads OU
  parent_id = aws_organizations_organizational_unit.workloads.id
}

# Account definitions for each environment
resource "aws_organizations_account" "environments" {
  # Create one account per environment using for_each
  for_each = {
    dev  = { email = "[email protected]",  ou = aws_organizations_organizational_unit.nonprod.id }
    qa   = { email = "[email protected]",   ou = aws_organizations_organizational_unit.nonprod.id }
    uat  = { email = "[email protected]",  ou = aws_organizations_organizational_unit.nonprod.id }
    prod = { email = "[email protected]", ou = aws_organizations_organizational_unit.prod.id }
  }
  # Account name following organization convention
  name      = "valuemomentum-${each.key}"
  # Unique root email per account (AWS requirement)
  email     = each.value.email
  # Place account in the correct OU
  parent_id = each.value.ou
  # IAM role created in the new account for cross-account access
  role_name = "TerraformExecutionRole"
  # Prevent accidental account closure
  close_on_deletion = false
}

# Provider configuration for targeting a specific environment
locals {
  # Map environment names to account IDs
  account_map = {
    dev  = aws_organizations_account.environments["dev"].id
    qa   = aws_organizations_account.environments["qa"].id
    uat  = aws_organizations_account.environments["uat"].id
    prod = aws_organizations_account.environments["prod"].id
  }
}

# Provider block for the target environment account
provider "aws" {
  # Region standardized across all accounts
  region = "us-east-1"
  # Assume the execution role in the target environment account
  assume_role {
    # Construct role ARN from the environment variable
    role_arn     = "arn:aws:iam::${local.account_map[var.environment]}:role/TerraformExecutionRole"
    # Session name for CloudTrail traceability
    session_name = "terraform-${var.environment}-pipeline"
  }
  # Default tags applied to every resource in this account
  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      Project     = "valuemomentum-platform"
    }
  }
}

# SCP preventing destructive actions in production
resource "aws_organizations_policy" "prod_guardrails" {
  # Policy name identifying its purpose
  name        = "prod-environment-guardrails"
  # Description for audit and compliance documentation
  description = "Prevent destructive actions in production accounts"
  # SCP policy type for organizational guardrails
  type        = "SERVICE_CONTROL_POLICY"
  # Policy document denying dangerous operations
  content = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid       = "PreventRDSDeletionWithoutSnapshot"
        Effect    = "Deny"
        Action    = ["rds:DeleteDBCluster", "rds:DeleteDBInstance"]
        Resource  = "*"
        Condition = {
          Bool = { "rds:SkipFinalSnapshot" = "true" }
        }
      }
    ]
  })
}

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│         AWS Organizations Multi-Account Structure              │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────────────────────────────┐                 │
│  │  Root OU                                 │                 │
│  │  ┌─────────────────────────────────┐    │                 │
│  │  │  Management Account (billing)    │    │                 │
│  │  └─────────────────────────────────┘    │                 │
│  └──────────────┬──────────────────────────┘                 │
│        ┌────────┴────────┬──────────────────┐                │
│        ↓                 ↓                  ↓                │
│  ┌───────────┐   ┌──────────────┐   ┌──────────────┐        │
│  │ Security  │   │ SharedSvcs   │   │ Workloads OU │        │
│  │ OU        │   │ OU           │   │              │        │
│  │           │   │              │   │  ┌─────────┐ │        │
│  │ GuardDuty │   │ CI/CD runners│   │  │ NonProd │ │        │
│  │ SecHub    │   │ ECR registry │   │  │  ┌─────┐│ │        │
│  │ CloudTrail│   │ TF State S3  │   │  │  │ Dev ││ │        │
│  │           │   │ Route53      │   │  │  │ QA  ││ │        │
│  └───────────┘   └──────────────┘   │  │  │ UAT ││ │        │
│                                      │  │  └─────┘│ │        │
│                                      │  └─────────┘ │        │
│                                      │  ┌─────────┐ │        │
│                                      │  │ Prod    │ │        │
│                                      │  │  ┌─────┐│ │        │
│                                      │  │  │Prod ││ │        │
│                                      │  │  └─────┘│ │        │
│                                      │  └─────────┘ │        │
│                                      └──────────────┘        │
│                                                               │
│  State Isolation:                                             │
│  s3://org-terraform-state/                                    │
│  ├── 111111111/dev/networking/terraform.tfstate               │
│  ├── 222222222/qa/networking/terraform.tfstate                │
│  ├── 333333333/uat/networking/terraform.tfstate               │
│  └── 444444444/prod/networking/terraform.tfstate              │
└───────────────────────────────────────────────────────────────┘

How do you prevent one Terraform environment from accidentally affecting another when using shared modules and remote state?

advancedstateterraform

▼

Quick Answer

Prevent cross-environment contamination through four layers: separate state files per environment with IAM-scoped access, provider configurations locked to specific AWS accounts via assume_role, module versioning with pinned tags so an untested module change cannot propagate, and CI/CD pipeline guardrails that validate the target environment before apply.

Detailed Answer

Preventing cross-environment contamination in Terraform is like building firewalls between apartments in a building: you need physical separation (state isolation), locked doors (IAM boundaries), independent utilities (provider configurations), and a building code (CI/CD guardrails) that prevents shortcuts through shared walls. The first layer is state file isolation. Each environment must have its own state file with its own backend configuration. Never share a state file between environments, even with workspaces, if the blast radius of corruption is unacceptable. The state file contains sensitive data including resource IDs, IP addresses, and sometimes plaintext outputs. An S3 bucket policy should restrict each environment's Terraform role to only its own key prefix: the prod role can access s3://state-bucket/prod/* but is explicitly denied s3://state-bucket/dev/*. This prevents a misconfigured prod pipeline from reading or overwriting dev state. The second layer is provider-level isolation. Each environment's provider block must assume a role in its specific AWS account. Even if someone accidentally passes the wrong tfvars file, the provider configuration ensures Terraform operates in the correct account. Add a validation check using the aws_caller_identity data source: compare the actual account ID against the expected one and fail early if they do not match. This catches the scenario where an engineer runs terraform apply with prod credentials but dev configuration, or vice versa. The third layer is module versioning. When environments share modules from a private registry or Git repository, use pinned version tags. Dev might use module version 2.3.0-rc1 while Prod uses 2.2.0 (the last stable release). Without version pinning, a module change pushed to the main branch immediately affects every environment that references source = "git::...?ref=main". This is the most common cause of accidental cross-environment impact: someone fixes a bug in a shared VPC module, the fix has a typo, and every environment that references the module head picks up the broken code on next apply. The fourth layer is CI/CD pipeline guardrails. The pipeline should validate environment consistency before plan: check that the workspace name matches the tfvars file, verify the AWS account ID matches the target environment, and confirm the Git branch is allowed to deploy to that environment (only main can deploy to prod). Implement a pre-plan script that runs aws sts get-caller-identity and compares the account against an expected value from the pipeline configuration. Remote state data sources are a particularly dangerous vector for cross-environment bleed. When a production EKS module reads the networking module's state via terraform_remote_state, it must reference the production networking state, not dev. Parameterize the remote state data source's backend configuration using the environment variable: data.terraform_remote_state.networking.config.key should resolve to prod/networking/terraform.tfstate, not a hardcoded path. A common gotcha is using terraform_remote_state with a hardcoded key that works in dev but points to prod state when someone copies the configuration without updating the key. The ultimate safeguard is defense in depth: even if one layer fails, the others prevent damage. If the IAM policy has a bug that allows dev access to prod state, the provider's assume_role still locks operations to the dev account. If the provider configuration is wrong, the account ID validation check fails before any resources are touched.

Code Example

# Account identity validation — fail fast on wrong account
# Fetch the actual AWS account identity
data "aws_caller_identity" "current" {}

# Validate the account ID matches the expected environment
locals {
  # Map of expected account IDs per environment
  expected_accounts = {
    dev  = "111111111111"
    qa   = "222222222222"
    uat  = "333333333333"
    prod = "444444444444"
  }
  # Check if current account matches the target environment
  account_validated = (
    data.aws_caller_identity.current.account_id ==
    local.expected_accounts[var.environment]
  )
}

# Validation resource that fails plan if accounts mismatch
resource "null_resource" "account_validation" {
  # This count trick fails if account does not match
  count = local.account_validated ? 0 : "ERROR: Running in wrong AWS account"
}

# Remote state data source — parameterized per environment
data "terraform_remote_state" "networking" {
  # S3 backend for reading the networking layer state
  backend = "s3"
  config = {
    # Same state bucket as all other stacks
    bucket = "valuemomentum-terraform-state-prod"
    # Key parameterized by environment to prevent cross-env reads
    key    = "${var.environment}/networking/terraform.tfstate"
    # Same region as the backend
    region = "us-east-1"
  }
}

# Use networking outputs safely scoped to the correct environment
resource "aws_eks_cluster" "payments_cluster" {
  # Cluster name scoped to the environment
  name     = "payments-eks-${var.environment}"
  version  = "1.29"
  role_arn = aws_iam_role.eks_cluster_role.arn
  vpc_config {
    # Subnet IDs from the SAME environment's networking state
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
    endpoint_private_access = true
    endpoint_public_access  = var.environment == "prod" ? false : true
  }
}

# Module versioning — pinned per environment
module "payments_vpc" {
  # Pinned Git tag prevents untested changes from propagating
  source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.2.0"
  # In dev, you might test a release candidate:
  # source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.3.0-rc1"
  vpc_name    = "payments-vpc-${var.environment}"
  vpc_cidr    = var.vpc_cidr
  environment = var.environment
}

# CI/CD pre-plan validation script (run before terraform plan)
# #!/bin/bash
# EXPECTED_ACCOUNT=$(jq -r ".${ENVIRONMENT}" accounts.json)
# ACTUAL_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
# if [ "$EXPECTED_ACCOUNT" != "$ACTUAL_ACCOUNT" ]; then
#   echo "FATAL: Expected account $EXPECTED_ACCOUNT but authenticated to $ACTUAL_ACCOUNT"
#   exit 1
# fi

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│        Cross-Environment Protection Layers                     │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  Layer 1: State Isolation (IAM-Scoped)                        │
│  ┌──────────────┐    DENY    ┌──────────────┐                │
│  │  Dev Role     │─────X─────│  prod/*       │                │
│  │  (IAM)       │            │  state keys   │                │
│  └──────────────┘            └──────────────┘                │
│  ┌──────────────┐   ALLOW    ┌──────────────┐                │
│  │  Dev Role     │───────────│  dev/*        │                │
│  │  (IAM)       │            │  state keys   │                │
│  └──────────────┘            └──────────────┘                │
│                                                               │
│  Layer 2: Provider Account Lock                               │
│  ┌──────────────────────────────────────────┐                │
│  │  provider "aws" {                         │                │
│  │    assume_role {                          │                │
│  │      role_arn = ".../${var.env}/Role"     │                │
│  │    }                                      │                │
│  │  }                                        │                │
│  │  → Operations locked to target account    │                │
│  └──────────────────────────────────────────┘                │
│                                                               │
│  Layer 3: Account ID Validation                               │
│  ┌──────────────────────────────────────────┐                │
│  │  aws_caller_identity.account_id          │                │
│  │    == expected_accounts[var.environment]  │                │
│  │  → FAIL FAST if wrong account            │                │
│  └──────────────────────────────────────────┘                │
│                                                               │
│  Layer 4: Module Version Pinning                              │
│  ┌──────────────────────────────────────────┐                │
│  │  Dev:  source = "...?ref=v2.3.0-rc1"     │                │
│  │  Prod: source = "...?ref=v2.2.0"         │                │
│  │  → Untested changes cannot reach prod     │                │
│  └──────────────────────────────────────────┘                │
│                                                               │
│  Layer 5: CI/CD Pipeline Guardrails                           │
│  ┌──────────────────────────────────────────┐                │
│  │  Branch → Environment mapping             │                │
│  │  main → prod (requires approval)          │                │
│  │  develop → dev (auto-apply)               │                │
│  │  Pre-plan: sts get-caller-identity check  │                │
│  └──────────────────────────────────────────┘                │
└───────────────────────────────────────────────────────────────┘

How do you structure Terraform for a multi-account AWS organization?

advancedmodulesterraform

▼

Quick Answer

Use a hub-and-spoke model with separate state files per account, shared modules for common patterns, and assume-role providers to manage cross-account resources. Structure repositories with account-level directories, environment-level workspaces, and a central module registry for VPC, IAM, and security baseline configurations.

Detailed Answer

Structuring Terraform for a multi-account AWS organization requires solving three interconnected problems: provider authentication across accounts, state isolation between accounts, and code reuse across similar environments. Think of it like managing a franchise: each store (account) runs the same playbook but has its own inventory (state), and headquarters (management account) needs oversight into all of them. The foundation is AWS Organizations with a well-defined account structure. Typically you have a management account (for Organizations API and billing), a security account (for GuardDuty, SecurityHub, CloudTrail aggregation), a shared-services account (for CI/CD, artifact repositories, DNS), and then workload accounts per environment or per team. Each account gets its own Terraform state file to ensure blast radius isolation: a misconfigured apply in the staging account cannot corrupt production state. For provider configuration, the recommended pattern is assume-role chaining. Your CI/CD pipeline authenticates to a central deployment account using OIDC (for GitHub Actions) or instance profiles (for Jenkins on EC2), then assumes environment-specific roles in target accounts. Each account has a TerraformExecutionRole with least-privilege permissions. The provider block uses assume_role with the target account's role ARN, and you parameterize the account ID using variables or a map lookup. The repository structure typically follows one of two patterns. The first is a monorepo with directory-per-account: infrastructure/accounts/production/, infrastructure/accounts/staging/, each with their own backend configuration and tfvars. The second is a module-based approach where a single root configuration uses for_each over a map of accounts to deploy baseline resources. The monorepo approach is simpler to reason about but leads to code duplication. The module approach is DRY but increases blast radius. Shared modules are the cornerstone of multi-account management. You create versioned modules for common patterns: a VPC module that enforces CIDR allocation from a central IPAM, a security-baseline module that deploys Config rules and GuardDuty, an IAM-baseline module that creates standard roles. These modules are published to a private Terraform registry or referenced via Git tags. Production gotchas are numerous. Cross-account resource references require careful handling: you cannot reference a resource in another account's state without remote state data sources or SSM Parameter Store lookups. VPC peering across accounts needs accepter-side resources managed by the accepter account's Terraform. Service Control Policies in the management account can block Terraform operations in child accounts if not carefully scoped. And state backend permissions must be tightly controlled: each account's Terraform role should only access its own state prefix in S3.

Code Example

# Provider configuration for multi-account AWS organization
# Map of account IDs for the fintech organization
locals {
  # Central registry of all AWS account IDs by environment name
  account_ids = {
    production  = "111111111111"
    staging     = "222222222222"
    security    = "333333333333"
    shared-svcs = "444444444444"
  }
  # Construct the IAM role ARN for cross-account access
  target_role_arn = "arn:aws:iam::${local.account_ids[var.environment]}:role/TerraformExecutionRole"
}

# Default provider assumes role into the target workload account
provider "aws" {
  # Region standardized across the organization
  region = "us-east-1"
  # Assume the execution role in the target account
  assume_role {
    # Role ARN constructed from the environment variable
    role_arn     = local.target_role_arn
    # Session name for CloudTrail audit traceability
    session_name = "terraform-ci-${var.environment}"
  }
  # Default tags applied to every resource in this account
  default_tags {
    tags = {
      Environment = var.environment
      ManagedBy   = "terraform"
      CostCenter  = "platform-engineering"
    }
  }
}

# Aliased provider for the shared-services account (DNS, ECR)
provider "aws" {
  # Alias used when referencing shared-services resources
  alias  = "shared_services"
  # Same region as the primary provider
  region = "us-east-1"
  # Assume role into the shared-services account
  assume_role {
    # Shared services account role for DNS and registry management
    role_arn     = "arn:aws:iam::${local.account_ids["shared-svcs"]}:role/TerraformDNSRole"
    # Distinct session name for audit separation
    session_name = "terraform-ci-shared-svcs"
  }
}

# VPC module instantiation using the organization's standard module
module "payments_vpc" {
  # Versioned module from the private Terraform registry
  source  = "app.terraform.io/fintech-corp/vpc/aws"
  # Pin to a specific minor version for stability
  version = "3.2.1"
  # VPC name following the organization naming convention
  vpc_name = "payments-vpc-${var.environment}"
  # CIDR allocated from the central IPAM pool
  vpc_cidr = var.vpc_cidr_blocks[var.environment]
  # Enable DNS resolution for private hosted zone lookups
  enable_dns_hostnames = true
  # Deploy NAT gateways in each AZ for high availability
  single_nat_gateway = false
}

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│              AWS Organization Account Structure                │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  ┌─────────────────────┐                                      │
│  │  Management Account  │                                      │
│  │  (Organizations API) │                                      │
│  │  SCPs, Billing       │                                      │
│  └──────────┬──────────┘                                      │
│             │                                                  │
│     ┌───────┴────────┬──────────────┬──────────────┐          │
│     ↓                ↓              ↓              ↓          │
│ ┌──────────┐  ┌───────────┐  ┌───────────┐  ┌───────────┐    │
│ │ Security  │  │ Shared    │  │ Production│  │ Staging   │    │
│ │ Account   │  │ Services  │  │ Account   │  │ Account   │    │
│ │           │  │ Account   │  │           │  │           │    │
│ │ GuardDuty │  │ ECR, DNS  │  │ payments  │  │ payments  │    │
│ │ SecHub    │  │ CI/CD     │  │ -vpc      │  │ -vpc      │    │
│ │ CloudTrail│  │ Artifacts │  │ user-auth │  │ user-auth │    │
│ └──────────┘  └─────┬─────┘  └───────────┘  └───────────┘    │
│                      │                                        │
│                      │ AssumeRole                              │
│                      ↓                                        │
│              ┌──────────────┐                                  │
│              │  CI/CD Runner │                                  │
│              │  (OIDC Auth)  │                                  │
│              │               │→ assume TerraformExecutionRole  │
│              │               │→ per target account             │
│              └──────────────┘                                  │
│                                                               │
│  State Isolation:                                              │
│  ┌────────────────────────────────────────────────────┐       │
│  │ s3://fintech-terraform-state/production/tfstate    │       │
│  │ s3://fintech-terraform-state/staging/tfstate       │       │
│  │ s3://fintech-terraform-state/security/tfstate      │       │
│  └────────────────────────────────────────────────────┘       │
└───────────────────────────────────────────────────────────────┘

How do you design Terraform state management at scale using remote backends, state locking, workspace isolation, and state splitting to prevent operational incidents?

architectbackendsterraform

▼

Quick Answer

At scale, Terraform state must be stored in remote backends like S3 with DynamoDB locking or Terraform Cloud, split into small blast-radius units by domain or environment, and isolated via workspaces or directory structure. State locking prevents concurrent applies from corrupting state, and state splitting ensures a single terraform apply cannot accidentally destroy unrelated infrastructure.

Detailed Answer

Think of a hospital records system. If every department writes to one giant patient file simultaneously, records get corrupted and the wrong medication gets administered. Splitting records by department, locking each file during edits, and storing everything in a central secure archive prevents these disasters. Terraform state management works the same way — it is the record of what infrastructure exists, and mismanaging it causes outages. Terraform state is a JSON file that maps every resource in your configuration to a real cloud object. When terraform plan runs, it reads the state to determine what exists, compares it to the desired configuration, and calculates the diff. If two engineers run terraform apply simultaneously against the same state, one overwrites the other's changes, causing state corruption where Terraform's view of the world no longer matches reality. Remote backends solve storage and collaboration: S3 stores the state file durably, DynamoDB provides a lock table so only one operation can modify state at a time, and versioning on the S3 bucket enables recovery from bad applies. Internally, when terraform apply starts, it sends a Lock request to the backend. For S3+DynamoDB, this writes a lock record to the DynamoDB table with a unique ID, the user's identity, and a timestamp. If another process already holds the lock, Terraform exits with an error. After the apply completes, Terraform writes the updated state to S3 and releases the lock. If a process crashes mid-apply, the lock remains until it expires or is manually force-unlocked with terraform force-unlock. Terraform Cloud handles locking internally and adds run queues so multiple plans can exist but only one apply executes at a time per workspace. At production scale, the critical architectural decision is state splitting. A monolithic state file containing the VPC, databases, Kubernetes clusters, DNS records, and application services means a single terraform apply can accidentally destroy the database while updating a DNS record. The recommended pattern is splitting state by blast radius: network foundations in one state, data layer in another, compute in another, and application configurations in their own states. Each state has its own backend configuration and can use terraform_remote_state data sources or outputs stored in SSM Parameter Store to share values. Workspaces can further separate environments (dev, staging, production) within the same configuration, but they should not be used as a substitute for proper state splitting — all workspaces in a configuration share the same codebase, backend, and permissions. The non-obvious gotcha is that terraform_remote_state creates a hard coupling between states, and if the upstream state is corrupted or the output names change, downstream plans break. Many mature teams replace terraform_remote_state with data sources that look up infrastructure by tags or names, or they store shared values in AWS SSM Parameter Store or HashiCorp Consul, which decouples state files completely. Another trap is that S3 bucket versioning does not protect against state file deletion — teams must also enable MFA Delete or use S3 Object Lock for regulatory environments.

Code Example

# backend.tf — Remote backend configuration for the payments data layer
terraform {
  # Use S3 as the remote state storage backend
  backend "s3" {
    # S3 bucket dedicated to Terraform state files
    bucket = "company-terraform-state-prod"
    # State file path scoped to team and layer
    key    = "payments/data-layer/terraform.tfstate"
    # AWS region for the state bucket
    region = "us-east-1"
    # DynamoDB table for state locking and consistency checking
    dynamodb_table = "terraform-state-locks"
    # Enable server-side encryption for state at rest
    encrypt = true
    # Use a specific KMS key for encryption
    kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/payments-tf-state-key"
  }
}

# Reference outputs from the network layer state without tight coupling
# Using SSM Parameter Store instead of terraform_remote_state
data "aws_ssm_parameter" "vpc_id" {
  # Parameter path set by the network layer's terraform apply
  name = "/infrastructure/network/vpc-id"
}

data "aws_ssm_parameter" "private_subnet_ids" {
  # Comma-separated subnet IDs stored by the network team
  name = "/infrastructure/network/private-subnet-ids"
}

# Use the decoupled values in resource configuration
resource "aws_db_subnet_group" "payments" {
  # Subnet group name for the payments database
  name = "payments-db-subnets"
  # Split the comma-separated parameter value into a list
  subnet_ids = split(",", data.aws_ssm_parameter.private_subnet_ids.value)
  # Tag for operational identification
  tags = {
    Team = "payments"
    Layer = "data"
  }
}

# DynamoDB lock table must exist before backend configuration
# This is typically created by a bootstrap state or manually
# aws dynamodb create-table \
#   --table-name terraform-state-locks \
#   --attribute-definitions AttributeName=LockID,AttributeType=S \
#   --key-schema AttributeName=LockID,KeyType=HASH \
#   --billing-mode PAY_PER_REQUEST

◈ Architecture Diagram

┌──────────┐
│ tf apply │
└────┬─────┘
     │
┌────┴─────┐
│ Lock     │
│ DynamoDB │
└────┬─────┘
     │
┌────┴─────┐
│ Read     │
│ S3 State │
└────┬─────┘
     │
┌────┴─────┐
│ Apply    │
│ Changes  │
└────┬─────┘
     │
┌────┴─────┐
│ Write    │
│ S3 State │
└────┬─────┘
     │
┌────┴─────┐
│ Unlock   │
└──────────┘

What design patterns should architects follow when building composable Terraform modules with proper versioning, input validation, and registry publishing?

architectmodulesterraform

▼

Quick Answer

Composable modules follow a thin-wrapper pattern with clear input/output contracts, use variable validation blocks for early error detection, semantic versioning via Git tags for safe upgrades, and publish to a private registry for organizational reuse. Module composition uses outputs and data sources rather than nested module trees that create opaque dependency chains.

Detailed Answer

Think of building a house with prefabricated components. A good prefab wall panel has standard dimensions (clear interface), quality-tested materials (validation), a version number stamped on it (semantic versioning), and is available from a catalog (registry). A bad panel is custom-cut for one house, undocumented, and stored in someone's garage. Terraform module design follows the same principles. A well-designed Terraform module encapsulates a single infrastructure concern with a clear input/output contract. The module should do one thing well — create an RDS instance with standard security settings, or provision a VPC with consistent CIDR allocation — rather than trying to create an entire environment. Variable validation blocks catch configuration errors at plan time rather than during apply or, worse, at runtime when a database accepts an invalid parameter and fails to start. Validation expressions use conditions and error messages to enforce naming patterns, CIDR ranges, instance size constraints, and environment-specific rules before any API call is made. Internally, Terraform resolves module sources during terraform init. A module sourced from a Git repository with a version tag (git::https://github.com/company/terraform-aws-rds.git?ref=v2.3.1) is downloaded and cached in .terraform/modules. The version pin ensures that a new commit to the module repository does not unexpectedly change infrastructure across all consumers. When published to a private registry (Terraform Cloud, Artifactory, or a self-hosted registry), modules appear in a searchable catalog with documentation generated from variables.tf, outputs.tf, and README.md. The registry enforces semantic versioning, making it safe to specify version constraints like ~> 2.3 (any 2.x from 2.3 upward) in consumer configurations. At production scale, module composition patterns matter as much as individual module quality. The recommended pattern is flat composition: a root configuration references multiple modules at the same level, passing outputs from one module as inputs to another, rather than nesting modules three or four levels deep. Deep nesting creates opaque dependency chains where a change in a leaf module requires understanding the full tree to predict impact. Root modules should be environment-specific (payments-prod, payments-staging) and pin module versions independently per environment so that staging can test a new module version before production adopts it. Teams should run terraform validate and tflint in CI for every module change, and use automated tests with terratest or terraform test to verify module behavior. The non-obvious gotcha is that module versioning only works if teams actually bump versions. A common failure pattern is pinning to a Git branch (ref=main) instead of a tag, which means terraform init on different days pulls different code. Another trap is overusing count or for_each in modules to make them do too many things — a module that creates either an RDS instance or an Aurora cluster based on a boolean variable becomes untestable and produces confusing plans. Architects should split divergent resources into separate modules rather than adding conditional logic that makes the module's behavior unpredictable.

Code Example

# modules/rds-instance/variables.tf — Module input contract with validation
variable "instance_name" {
  # Human-readable name for the database instance
  type        = string
  description = "Name of the RDS instance, must follow naming convention"
  validation {
    # Enforce the team naming convention: team-service-env
    condition     = can(regex("^[a-z]+-[a-z]+-(?:dev|staging|prod)$", var.instance_name))
    error_message = "Instance name must match pattern: team-service-env (e.g., payments-orders-prod)."
  }
}

variable "instance_class" {
  # RDS instance type for compute sizing
  type        = string
  description = "RDS instance class, restricted to approved sizes"
  validation {
    # Only allow instance classes approved by the platform team
    condition     = contains(["db.t3.medium", "db.r6g.large", "db.r6g.xlarge", "db.r6g.2xlarge"], var.instance_class)
    error_message = "Instance class must be one of the platform-approved sizes."
  }
}

variable "allocated_storage_gb" {
  # Storage size in gigabytes
  type        = number
  description = "Allocated storage in GB, minimum 20, maximum 5000"
  validation {
    # Enforce storage boundaries to prevent cost overruns
    condition     = var.allocated_storage_gb >= 20 && var.allocated_storage_gb <= 5000
    error_message = "Storage must be between 20 and 5000 GB."
  }
}

# Root configuration consuming the module with version pin
# payments-prod/main.tf
module "orders_database" {
  # Source from private registry with semantic version constraint
  source  = "app.terraform.io/company/rds-instance/aws"
  version = "~> 2.3" # Accept any 2.x >= 2.3, reject 3.x

  # Pass validated inputs to the module
  instance_name      = "payments-orders-prod"
  instance_class     = "db.r6g.large"
  allocated_storage_gb = 500
  # Pass outputs from the network module as inputs
  subnet_group_name  = module.vpc.database_subnet_group_name
  security_group_ids = [module.vpc.database_security_group_id]
}

# Output the database endpoint for consumption by other configurations
output "orders_db_endpoint" {
  # Expose the RDS endpoint for application configuration
  value       = module.orders_database.endpoint
  description = "Connection endpoint for the orders database"
}

◈ Architecture Diagram

┌──────────┐
│ Registry │
│ v2.3.1   │
└────┬─────┘
     │
┌────┴─────┐
│ Root Cfg │
│ prod     │
└──┬───┬───┘
   │   │
   ↓   ↓
┌─────┐┌─────┐
│ VPC ││ RDS │
│ mod ││ mod │
└──┬──┘└──┬──┘
   │      │
   ↓      ↓
┌──────────┐
│ outputs  │
└──────────┘

How does Terraform Cloud at enterprise scale use Sentinel policies, cost estimation, run tasks, and VCS-driven workflows to enforce governance without slowing delivery teams?

architectworkspacesterraform

▼

Quick Answer

Terraform Cloud enforces governance through Sentinel policies that evaluate plans as code before apply, cost estimation that flags unexpected spend, run tasks that integrate external checks like security scanners, and VCS workflows that trigger plans on pull requests. This shifts enforcement left into the plan phase so teams get fast feedback without needing manual approval for every change.

Detailed Answer

Think of a highway with automated safety systems. Speed cameras (Sentinel policies) automatically flag violations, fuel cost displays (cost estimation) warn drivers before they commit to a route, roadside inspection stations (run tasks) check specific safety requirements, and GPS-guided lanes (VCS workflows) route each vehicle through the correct path. The highway keeps moving because enforcement is automated, not manual. Terraform Cloud Enterprise provides a collaborative platform where infrastructure changes follow a standardized workflow: code is committed to VCS, a plan is triggered, governance checks run, and apply executes only after all checks pass. Sentinel is HashiCorp's policy-as-code framework that evaluates Terraform plans, state, and configuration using a policy language. Policies can enforce rules like requiring encryption on all S3 buckets, restricting instance types to cost-approved sizes, mandating specific tags on every resource, or preventing deletion of production databases. Policies are organized into policy sets that are applied to specific workspaces or all workspaces in an organization. Internally, the run pipeline processes stages in order: VCS trigger, terraform plan, cost estimation, Sentinel policy check, run tasks, and terraform apply. Cost estimation parses the plan output and calculates the monthly cost delta using HashiCorp's pricing database, surfacing changes like adding a db.r6g.2xlarge that increases monthly spend by $1,200. Run tasks are webhook-based integrations that send the plan JSON to external systems — security scanners like Snyk or Prisma Cloud, compliance checkers, or custom approval systems — and wait for a pass/fail response. Each run task can be advisory (warning only) or mandatory (blocking apply). The entire pipeline runs automatically on pull request creation, giving developers feedback in minutes rather than waiting for a manual review. At production scale, governance design requires balancing safety with velocity. Hard-mandatory Sentinel policies should cover non-negotiable rules like encryption and tagging. Soft-mandatory policies allow overrides with justification for edge cases like temporary large instances for data migration. Advisory policies educate teams about best practices without blocking. Cost estimation thresholds can be set to require manager approval for changes exceeding a dollar amount. VCS workflows should use speculative plans on pull requests (plan only, no apply) so developers see the impact before merging, and auto-apply on the main branch for environments like dev where speed matters more than manual gates. The non-obvious gotcha is that Sentinel policies execute after the plan phase, so they cannot prevent Terraform from planning invalid configurations — they can only block the apply. If a Sentinel policy references a resource attribute that does not exist in the plan (because the resource was removed), the policy can fail with a confusing error rather than a clean policy violation. Teams should test Sentinel policies against mock plan data in CI using the Sentinel CLI before deploying them to Terraform Cloud. Another trap is over-engineering run tasks: each run task adds latency to the pipeline, and if the external system is slow or unreliable, it blocks every infrastructure change across the organization.

Code Example

# Sentinel policy: require encryption on all S3 buckets
# policies/s3-encryption-required.sentinel
import "tfplan/v2" as tfplan

# Find all S3 bucket resources being created or updated
s3_buckets = filter tfplan.resource_changes as _, rc {
  # Match only aws_s3_bucket resources with create or update actions
  rc.type is "aws_s3_bucket" and
  rc.mode is "managed" and
  (rc.change.actions contains "create" or rc.change.actions contains "update")
}

# Check that every bucket has a server-side encryption configuration
encryption_check = rule {
  all s3_buckets as _, bucket {
    # Verify the bucket_encryption block is not null after apply
    bucket.change.after.server_side_encryption_configuration is not null
  }
}

# Main rule that must pass for the apply to proceed
main = rule {
  encryption_check
}

# sentinel.hcl — Policy set configuration
# policy "s3-encryption-required" {
#   source            = "./policies/s3-encryption-required.sentinel"
#   enforcement_level = "hard-mandatory" # Cannot be overridden
# }

# terraform-cloud workspace configuration via CLI
# Create a workspace connected to VCS with auto-apply disabled for production
# terraform login
# terraform workspace new payments-prod -organization=company

# .terraform-cloud.auto.tfvars — Workspace variable defaults
# These are set in the Terraform Cloud UI or API for the workspace
# environment    = "prod"
# team           = "payments"
# cost_threshold = 500

# Run task configuration via API (register a security scanner)
# curl -s -X POST \
#   -H "Authorization: Bearer $TFC_TOKEN" \
#   -H "Content-Type: application/vnd.api+json" \
#   https://app.terraform.io/api/v2/organizations/company/tasks \
#   -d '{"data":{"type":"tasks","attributes":{"name":"snyk-iac-scan","url":"https://hooks.snyk.io/terraform-cloud","category":"task","hmac-key":"secret-key"}}}'

◈ Architecture Diagram

┌──────────┐
│ VCS Push │
└────┬─────┘
     ↓
┌──────────┐
│ tf plan  │
└────┬─────┘
     ↓
┌──────────┐
│ Cost Est │
└────┬─────┘
     ↓
┌──────────┐
│ Sentinel │
└────┬─────┘
     ↓
┌──────────┐
│ Run Task │
└────┬─────┘
     ↓
┌──────────┐
│ tf apply │
└──────────┘

How do you design multi-account, multi-region AWS deployments with Terraform using provider aliases and for_each on providers to avoid configuration sprawl?

architectprovidersterraform

▼

Quick Answer

Provider aliases define multiple AWS provider configurations for different regions or accounts within a single Terraform configuration. For_each on modules with provider maps enables deploying the same infrastructure across regions or accounts without duplicating code. Architects use assume_role in provider blocks to cross account boundaries securely, and module-level providers pass the correct provider to each regional deployment.

Detailed Answer

Think of a restaurant chain opening locations in different cities. Rather than writing a completely new business plan for each city, the headquarters uses one standard restaurant blueprint and customizes the local supplier list, health department contact, and rental agreement per location. Provider aliases in Terraform work the same way — one infrastructure blueprint is deployed across regions and accounts by swapping the provider configuration. In AWS, multi-account architecture is the recommended pattern for blast-radius isolation: production, staging, shared services, security, and logging each live in separate AWS accounts under an AWS Organization. Multi-region deployment adds resilience and latency optimization by placing infrastructure closer to users. Without careful Terraform design, this creates an explosion of near-identical configuration files — one per account-region combination — that diverge over time and become unmaintainable. Internally, Terraform provider aliases allow declaring multiple instances of the same provider with different configurations. A provider block with alias = "us_west" and region = "us-west-2" coexists with the default provider using region = "us-east-1". Resources and modules reference a specific provider using the provider or providers argument. For cross-account access, each aliased provider uses assume_role to temporarily adopt an IAM role in the target account. The for_each meta-argument on modules, combined with a providers map, enables deploying the same module across multiple regions or accounts from a single configuration. Each module instance receives its own provider, which determines where the infrastructure is created. At production scale, the pattern involves a locals block that defines a map of regions or account-region pairs, a module block with for_each over that map, and a dynamic providers assignment that passes the correct aliased provider to each module instance. State should be split so that each account-region combination has its own state file — otherwise a single state file becomes a massive blast radius. The deployment pipeline should use separate workspaces or directories per account-region pair, with the provider configuration driven by workspace-specific variables. Teams should also use terraform_remote_state or SSM parameters to share outputs like VPC IDs across account-region boundaries. The non-obvious gotcha is that Terraform does not support for_each on provider blocks themselves — you cannot dynamically generate provider aliases from a map. Each provider alias must be declared statically in the configuration. This means if you add a new region, you must add a new provider alias block, which is a manual step that cannot be fully automated. Some teams work around this by using Terragrunt to generate provider blocks from a configuration file, or by using a code generator that produces the provider declarations. Another trap is that assume_role credentials expire during long applies — for large deployments, the role session duration must be set high enough (up to 12 hours for chained roles) or the apply will fail midway with an expired token error.

Code Example

# providers.tf — Static provider aliases for each target region
provider "aws" {
  # Default provider for the primary region
  region = "us-east-1"
  # Assume a role in the production account
  assume_role {
    role_arn     = "arn:aws:iam::111111111111:role/TerraformDeployRole"
    session_name = "terraform-payments-prod"
  }
}

provider "aws" {
  # Aliased provider for the secondary region
  alias  = "us_west_2"
  region = "us-west-2"
  # Same account, different region
  assume_role {
    role_arn     = "arn:aws:iam::111111111111:role/TerraformDeployRole"
    session_name = "terraform-payments-prod-west"
  }
}

provider "aws" {
  # Aliased provider for the EU region in a separate account
  alias  = "eu_west_1"
  region = "eu-west-1"
  # Assume role in the EU production account
  assume_role {
    role_arn     = "arn:aws:iam::222222222222:role/TerraformDeployRole"
    session_name = "terraform-payments-eu-prod"
  }
}

# main.tf — Deploy the same networking module to each region
locals {
  # Map of regions to their provider references and CIDR allocations
  regions = {
    us_east_1 = { cidr = "10.1.0.0/16", provider_key = "aws" }
    us_west_2 = { cidr = "10.2.0.0/16", provider_key = "aws.us_west_2" }
    eu_west_1 = { cidr = "10.3.0.0/16", provider_key = "aws.eu_west_1" }
  }
}

# Deploy VPC module to US East (default provider)
module "vpc_us_east_1" {
  # Source from the internal registry with version pin
  source  = "app.terraform.io/company/vpc/aws"
  version = "~> 3.1"
  # Pass region-specific CIDR
  vpc_cidr    = "10.1.0.0/16"
  environment = "prod"
  region_name = "us-east-1"
}

# Deploy VPC module to US West with aliased provider
module "vpc_us_west_2" {
  source  = "app.terraform.io/company/vpc/aws"
  version = "~> 3.1"
  providers = {
    # Pass the US West provider alias to the module
    aws = aws.us_west_2
  }
  vpc_cidr    = "10.2.0.0/16"
  environment = "prod"
  region_name = "us-west-2"
}

# Deploy VPC module to EU West with cross-account provider
module "vpc_eu_west_1" {
  source  = "app.terraform.io/company/vpc/aws"
  version = "~> 3.1"
  providers = {
    # Pass the EU provider alias (different account + region)
    aws = aws.eu_west_1
  }
  vpc_cidr    = "10.3.0.0/16"
  environment = "prod"
  region_name = "eu-west-1"
}

◈ Architecture Diagram

┌──────────────────────┐
│   Root Config        │
│                      │
│ provider aws         │
│ provider aws.west    │
│ provider aws.eu      │
└──┬───────┬───────┬───┘
   │       │       │
   ↓       ↓       ↓
┌──────┐┌──────┐┌──────┐
│US-E-1││US-W-2││EU-W-1│
│Acct A││Acct A││Acct B│
│VPC   ││VPC   ││VPC   │
└──────┘└──────┘└──────┘

How does Terraform Enterprise differ from open-source Terraform, and when does an organization need TFE?

intermediategeneralterraform

▼

Quick Answer

Terraform Enterprise adds remote state management with RBAC, Sentinel policy-as-code for compliance enforcement, private module registries, team-based workspace access, and audit logging. Organizations need TFE when they require governance, collaboration at scale, and regulatory compliance that open-source cannot provide.

Detailed Answer

Think of open-source Terraform as a skilled carpenter with excellent tools who works alone from blueprints. Terraform Enterprise is a construction firm — it has the same carpenter, but adds project managers (RBAC), building inspectors (Sentinel policies), a parts catalog (private registry), apprentice supervision (workspace permissions), and a complete paper trail of every nail hammered (audit logs). A solo developer building a shed does not need a construction firm. A bank building a skyscraper absolutely does. Open-source Terraform provides the core functionality: HCL language for defining infrastructure, a provider ecosystem for interacting with cloud APIs, state management, plan and apply workflows, and module reuse. When a single engineer or small team manages infrastructure, open-source Terraform with a remote state backend (S3 + DynamoDB) works well. The limitations emerge at scale: who can run terraform apply against production? How do you enforce that all S3 buckets have encryption enabled? How do you share vetted modules across 20 teams without everyone copy-pasting and diverging? How do you prove to auditors that every infrastructure change was reviewed, approved, and logged? Open-source Terraform has no answers to these questions — it is a tool, not a platform. Terraform Enterprise addresses these gaps through several key features. Workspaces provide isolated environments with their own state, variables, and permissions — the payments-api infrastructure can be in one workspace with access limited to the payments team, while the settlements-db workspace is restricted to the database team. Sentinel policies run before every apply and enforce organizational rules as code — 'all RDS instances must have encryption at rest,' 'no IAM policies can use wildcard actions,' 'all resources must have cost-center and owner tags.' These policies are version-controlled, peer-reviewed, and provide automated compliance enforcement that auditors love. The private module registry lets the platform team publish vetted, hardened modules (like an approved EKS cluster configuration) that application teams consume — ensuring consistency without restricting autonomy. Remote execution in TFE solves the 'works on my machine' problem. Instead of engineers running terraform apply from their laptops (with their personal AWS credentials and whatever Terraform version they have installed), all plans and applies execute in TFE's managed runners with consistent Terraform versions, standardized provider credentials (injected via workspace variables), and no local state. This eliminates credential sprawl — engineers never need direct AWS access for infrastructure changes, because TFE holds the credentials and applies changes on their behalf. For banking, this is a massive security improvement: credentials are centralized, rotated, and never touch developer machines. In production at a bank, TFE becomes the control plane for all infrastructure changes. The typical workflow is: engineer creates a branch, makes infrastructure changes, opens a PR. TFE runs a speculative plan on the PR (visible as a GitHub check), showing exactly what will change. Team members review the plan diff alongside the code diff. After PR approval and merge, TFE runs the real plan and waits for workspace-level approval (configurable — some workspaces auto-apply, production workspaces require manual confirmation from an authorized approver). Sentinel policies gate the apply, and the entire execution is logged with timestamps, user identity, plan output, and apply results. These logs are retained for compliance and can be exported to SIEM systems for security monitoring. The gotcha is cost and operational overhead. TFE is expensive — it requires either a self-hosted installation (on-premise or in your cloud account) or a Terraform Cloud Business subscription. Self-hosted TFE needs its own infrastructure (compute, database, object storage), monitoring, backup, and upgrades. Many organizations start with Terraform Cloud (the SaaS version) for its lower operational burden and migrate to self-hosted TFE only when data residency requirements or network isolation mandates require it. Another common mistake is over-governing — applying strict Sentinel policies and manual approval gates to every workspace, including development and sandbox environments, which slows down experimentation. Use tiered governance: strict policies and approvals for production, advisory-only policies for staging, and minimal controls for development.

Code Example

# Terraform Enterprise workspace configuration via Terraform
# (yes, you manage TFE with Terraform itself)
resource "tfe_organization" "bank" {
  name  = "bank-platform"
  email = "[email protected]"
}

# Private module registry - vetted EKS module
resource "tfe_registry_module" "eks_cluster" {
  organization = tfe_organization.bank.name
  vcs_repo {
    display_identifier = "bank/terraform-aws-eks-cluster"
    identifier         = "bank/terraform-aws-eks-cluster"
    oauth_token_id     = var.github_oauth_token_id
  }
}

# Production workspace with approval gate
resource "tfe_workspace" "payments_infra_prod" {
  name              = "payments-infra-production"
  organization      = tfe_organization.bank.name
  terraform_version = "1.7.0"         # Pinned version
  auto_apply        = false            # Require manual approval
  queue_all_runs    = true
  working_directory = "environments/production"

  vcs_repo {
    identifier     = "bank/payments-infrastructure"
    branch         = "main"
    oauth_token_id = var.github_oauth_token_id
  }
}

# RBAC - only senior engineers can approve production applies
resource "tfe_team" "payments_admins" {
  name         = "payments-infra-admins"
  organization = tfe_organization.bank.name
}

resource "tfe_team_access" "payments_prod" {
  access       = "write"   # Can queue plans and approve applies
  team_id      = tfe_team.payments_admins.id
  workspace_id = tfe_workspace.payments_infra_prod.id
}

resource "tfe_team_access" "payments_dev" {
  access       = "plan"    # Can only view plans, not approve
  team_id      = tfe_team.payments_developers.id
  workspace_id = tfe_workspace.payments_infra_prod.id
}
---
# Sentinel policy set attached to production workspaces
resource "tfe_policy_set" "pci_dss_compliance" {
  name         = "pci-dss-compliance"
  organization = tfe_organization.bank.name
  kind         = "sentinel"
  enforcement_mode = "hard-mandatory"  # Cannot override

  vcs_repo {
    identifier     = "bank/sentinel-policies"
    branch         = "main"
    oauth_token_id = var.github_oauth_token_id
  }

  workspace_ids = [
    tfe_workspace.payments_infra_prod.id,
    tfe_workspace.settlements_infra_prod.id,
  ]
}
---
# Comparing open-source vs TFE workflow
# Open-source: engineer runs locally
# $ terraform init       # Downloads providers to laptop
# $ terraform plan       # Uses personal AWS credentials
# $ terraform apply      # No approval gate, no audit log
# $ terraform state pull # State in S3, no RBAC on who reads it

# TFE: governed workflow
# 1. Engineer pushes to branch → TFE speculative plan on PR
# 2. Peer reviews plan output in GitHub PR check
# 3. Merge to main → TFE queues real plan
# 4. Sentinel policies evaluate → hard-mandatory must pass
# 5. Authorized team member approves apply
# 6. TFE applies using centralized credentials
# 7. Full audit log: who, when, what changed, plan output

◈ Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│          Open-Source Terraform vs Terraform Enterprise          │
│                                                                 │
│  ┌─────────────────────────┐  ┌──────────────────────────────┐  │
│  │   Open-Source Terraform │  │   Terraform Enterprise       │  │
│  │                         │  │                              │  │
│  │  ┌───────────────────┐  │  │  ┌────────────────────────┐  │  │
│  │  │ Local execution   │  │  │  │ Remote execution       │  │  │
│  │  │ Personal creds    │  │  │  │ Centralized creds      │  │  │
│  │  │ No approval gate  │  │  │  │ Workspace approval     │  │  │
│  │  │ S3 state (no RBAC)│  │  │  │ RBAC on state          │  │  │
│  │  │ No policy engine  │  │  │  │ Sentinel policies      │  │  │
│  │  │ Copy-paste modules│  │  │  │ Private registry       │  │  │
│  │  │ No audit trail    │  │  │  │ Full audit logging     │  │  │
│  │  └───────────────────┘  │  │  └────────────────────────┘  │  │
│  │                         │  │                              │  │
│  │  Good for:              │  │  Good for:                   │  │
│  │  - Solo/small teams     │  │  - Enterprise teams          │  │
│  │  - Non-regulated envs   │  │  - Regulated industries      │  │
│  │  - Experimentation      │  │  - Multi-team governance     │  │
│  └─────────────────────────┘  └──────────────────────────────┘  │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │  TFE Workflow: PR → Speculative Plan → Sentinel Check    │  │
│  │  → Merge → Real Plan → Approval → Apply → Audit Log     │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

How do you use Terraform workspaces versus directory-based separation for managing multiple environments, and what are the tradeoffs?

intermediateworkspacesterraform

▼

Quick Answer

Terraform workspaces use a single configuration with separate state files selected by workspace name (terraform workspace select prod), while directory-based separation duplicates the configuration into per-environment directories (envs/dev/, envs/prod/) each with their own state. Workspaces are simpler but share all code paths; directories offer full isolation but require synchronization effort.

Detailed Answer

Choosing between workspaces and directory-based separation is like choosing between a multi-tenant apartment building (workspaces) and separate houses (directories). The apartment building shares plumbing and electrical systems — cheaper and easier to maintain, but a pipe burst affects everyone. Separate houses cost more to build and maintain, but a problem in one house never reaches another. Terraform workspaces create named instances of state within the same backend configuration. When you run terraform workspace new qa, Terraform creates a new state file at env:/qa/terraform.tfstate in your backend (S3 key prefix changes to include the workspace name). The configuration stays identical — same main.tf, same modules — and you use terraform.workspace in conditionals or lookups to vary behavior. For example, locals { instance_type = terraform.workspace == "prod" ? "m5.2xlarge" : "t3.large" }. The advantage is zero code duplication: a module upgrade applies to all environments by running apply in each workspace sequentially. Directory-based separation creates independent root configurations per environment: infrastructure/envs/dev/main.tf, infrastructure/envs/prod/main.tf. Each directory has its own backend configuration, its own provider setup, and its own terraform.tfvars. Shared logic lives in modules referenced by relative path or registry source. The directories might look identical initially, but they can diverge intentionally — Prod might have a WAF module that Dev does not, or Dev might test a newer provider version before promoting it. The workspace approach has several tradeoffs. On the positive side: single source of truth for infrastructure code, atomic module upgrades, and less repository sprawl. On the negative side: a syntax error in main.tf breaks all environments simultaneously, there is no way to run different Terraform or provider versions per workspace, and terraform.workspace conditionals scattered through the code create hidden complexity. Most critically, a careless terraform apply in the wrong workspace can destroy production — there is no structural guardrail preventing this, only the workspace prompt. Directory-based separation trades DRY for safety. Each environment is fully independent: you can upgrade Dev to Terraform 1.8 while Prod stays on 1.7, you can test a new module version in QA without touching UAT, and a broken configuration in one directory cannot affect another. The cost is synchronization: when you fix a bug in the Dev VPC module call, you must remember to propagate the fix to QA, UAT, and Prod directories. Tooling like Terragrunt mitigates this by generating directory-based configurations from a DRY template, giving you the isolation of directories with the maintainability of workspaces. The production recommendation for most teams at Value Momentum's scale is a hybrid approach: use directory-based separation for major infrastructure boundaries (networking, EKS cluster, databases) and workspaces within each directory for environment selection only when the configurations are truly identical. Never use workspaces as a substitute for separate AWS accounts — they operate within a single provider configuration and do not provide IAM or network isolation.

Code Example

# Workspace-based approach — single configuration, multiple states
# main.tf — same code serves all environments
locals {
  # Map workspace names to environment-specific configurations
  env_config = {
    dev = {
      instance_type    = "t3.large"
      min_nodes        = 1
      max_nodes        = 3
      db_instance      = "db.t3.medium"
      multi_az         = false
    }
    qa = {
      instance_type    = "t3.xlarge"
      min_nodes        = 2
      max_nodes        = 4
      db_instance      = "db.t3.large"
      multi_az         = false
    }
    prod = {
      instance_type    = "m5.2xlarge"
      min_nodes        = 3
      max_nodes        = 15
      db_instance      = "db.r6g.2xlarge"
      multi_az         = true
    }
  }
  # Select the configuration for the current workspace
  current = local.env_config[terraform.workspace]
}

# RDS cluster using workspace-driven configuration
resource "aws_rds_cluster" "payments_db" {
  # Cluster name includes workspace for uniqueness
  cluster_identifier  = "payments-db-${terraform.workspace}"
  # Engine and version are environment-agnostic
  engine              = "aurora-postgresql"
  engine_version      = "15.4"
  # Instance class from the workspace-specific config map
  # (applied to cluster instances, shown here for clarity)
  # Multi-AZ driven by workspace config
  # Note: Aurora handles AZ distribution via cluster instances
  deletion_protection = terraform.workspace == "prod"
  backup_retention_period = terraform.workspace == "prod" ? 30 : 7
}

# Workspace CLI commands
# terraform workspace new dev
# terraform workspace new qa
# terraform workspace new prod
# terraform workspace select prod
# terraform plan -out=prod.tfplan
# terraform apply prod.tfplan

# ─────────────────────────────────────────────────────
# Directory-based approach — separate configs per environment
# infrastructure/envs/dev/backend.tf
# terraform {
#   backend "s3" {
#     bucket         = "valuemomentum-terraform-state"
#     key            = "dev/payments/terraform.tfstate"
#     region         = "us-east-1"
#     dynamodb_table = "terraform-locks"
#     encrypt        = true
#   }
# }

# infrastructure/envs/prod/backend.tf
# terraform {
#   backend "s3" {
#     bucket         = "valuemomentum-terraform-state"
#     key            = "prod/payments/terraform.tfstate"
#     region         = "us-east-1"
#     dynamodb_table = "terraform-locks"
#     encrypt        = true
#   }
# }

# infrastructure/envs/prod/main.tf
# module "payments_vpc" {
#   source  = "../../modules/vpc"
#   vpc_name = "payments-vpc-prod"
#   vpc_cidr = "10.4.0.0/16"
#   environment = "prod"
# }

◈ Architecture Diagram

┌───────────────────────────────────────────────────────────────┐
│     Workspaces vs Directory-Based Environment Separation      │
├───────────────────────────────────────────────────────────────┤
│                                                               │
│  Workspace Approach:                                          │
│  ┌─────────────────────────────────────────────┐             │
│  │  Single Configuration (main.tf)              │             │
│  │  ┌──────────────────────────────────────┐   │             │
│  │  │  terraform.workspace → selects state  │   │             │
│  │  └──────────────────────────────────────┘   │             │
│  │      │            │            │             │             │
│  │      ↓            ↓            ↓             │             │
│  │  ┌───────┐   ┌───────┐   ┌───────┐          │             │
│  │  │ dev   │   │ qa    │   │ prod  │  states  │             │
│  │  │ state │   │ state │   │ state │          │             │
│  │  └───────┘   └───────┘   └───────┘          │             │
│  └─────────────────────────────────────────────┘             │
│  Pros: DRY, single upgrade path                               │
│  Cons: shared code = shared failures, no version isolation    │
│                                                               │
│  Directory Approach:                                          │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐                │
│  │ envs/dev/ │  │ envs/qa/  │  │ envs/prod/│                │
│  │           │  │           │  │           │                │
│  │ main.tf   │  │ main.tf   │  │ main.tf   │                │
│  │ backend.tf│  │ backend.tf│  │ backend.tf│                │
│  │ dev.tfvars│  │ qa.tfvars │  │ prod.tfvar│                │
│  │           │  │           │  │ + WAF mod │                │
│  │ TF 1.7    │  │ TF 1.7    │  │ TF 1.7    │                │
│  │ state: dev│  │ state: qa │  │ state:prod│                │
│  └─────┬─────┘  └─────┬─────┘  └─────┬─────┘                │
│        │              │              │                       │
│        └──────────────┴──────────────┘                       │
│                       │                                       │
│               ┌───────┴───────┐                              │
│               │ Shared Modules│                              │
│               │ modules/vpc   │                              │
│               │ modules/eks   │                              │
│               │ modules/rds   │                              │
│               └───────────────┘                              │
│  Pros: full isolation, independent versions                   │
│  Cons: code duplication, sync overhead                        │
└───────────────────────────────────────────────────────────────┘

How do Terraform modules work and what makes a good module design?

intermediatemodulesterraform

▼

Quick Answer

Terraform modules are reusable containers of related resources defined in a directory with its own variables, outputs, and resource blocks. Good module design follows single-responsibility, exposes minimal required variables, uses sensible defaults, and avoids hardcoding environment-specific values.

Detailed Answer

A Terraform module is essentially a directory containing .tf files that encapsulate a logical group of resources. Think of modules like functions in programming: they take inputs (variables), do something (create resources), and return outputs. The root module is your working directory where you run terraform commands, and any module you call from there is a child module. When you write module "payments_vpc" { source = "./modules/vpc" }, Terraform loads that directory as an isolated configuration unit with its own namespace. Internally, when Terraform processes a module call, it creates a separate resource namespace prefixed with module.payments_vpc. Resources inside the module cannot directly access resources outside it — they communicate only through input variables and output values. This enforced encapsulation is what makes modules safe to reuse. Terraform also supports module sources from Git repositories, the Terraform Registry, S3 buckets, and HTTP URLs, enabling organization-wide module libraries. Good module design starts with the single-responsibility principle. A VPC module should create a VPC, subnets, route tables, and NAT gateways — it should not also create your RDS database. Each module should represent one logical infrastructure component. Variables should have descriptions, type constraints, and sensible defaults where possible. For example, a VPC module might default to three availability zones and a /16 CIDR block but allow overrides. Outputs should expose the identifiers that downstream modules need — VPC ID, subnet IDs, security group IDs — nothing more. One critical design principle is avoiding hardcoded provider configurations inside modules. The module should inherit the provider from the calling module, not declare its own. This allows the same module to be used across multiple AWS accounts or regions by simply changing the provider in the root module. Similarly, avoid hardcoding backend configurations or environment-specific values like account IDs inside modules. Version pinning is essential for module stability. When sourcing modules from a registry or Git, always pin to a specific version or Git tag. Using version = "~> 2.0" ensures you get patch updates but not breaking major version changes. Without version pinning, a terraform init on Monday might pull different module code than the same command on Friday, leading to unpredictable infrastructure changes. Production-grade modules also include validation blocks for input variables, meaningful error messages, comprehensive README documentation, and example configurations. Teams that invest in a well-designed internal module library see dramatic reductions in infrastructure provisioning time and configuration drift across environments.

Code Example

# Root module calling the reusable VPC module for the payments platform
module "payments_vpc" {
  # Source the module from the internal Git repository at a pinned version tag
  source = "git::https://github.com/fintech-infra/terraform-modules.git//vpc?ref=v2.4.1"
  # Name the VPC after the service and environment
  vpc_name = "payments-platform-prod"
  # Use a /16 CIDR block giving 65536 addresses for the payments network
  vpc_cidr = "10.20.0.0/16"
  # Deploy across three availability zones for high availability
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  # Enable NAT gateway for private subnet internet access
  enable_nat_gateway = true
  # Use a single NAT gateway in non-prod to save costs, one per AZ in prod
  single_nat_gateway = false
  # Enable DNS hostnames so RDS instances get resolvable DNS names
  enable_dns_hostnames = true
  # Tags applied to every resource the module creates
  common_tags = {
    Environment = "production"
    Team        = "payments-backend"
    CostCenter  = "CC-4421"
    ManagedBy   = "terraform"
  }
}

# Inside modules/vpc/variables.tf — well-designed module inputs
variable "vpc_name" {
  # Human-readable description shown in terraform plan output
  description = "Name prefix for all VPC resources"
  # Enforce string type at plan time
  type = string
  # Validate that the name follows the org naming convention
  validation {
    condition     = can(regex("^[a-z][a-z0-9-]+$", var.vpc_name))
    error_message = "VPC name must be lowercase alphanumeric with hyphens."
  }
}

variable "vpc_cidr" {
  # Describe what this CIDR block is used for
  description = "CIDR block for the VPC network range"
  # Enforce string type
  type = string
  # Default to a /16 block if not specified
  default = "10.0.0.0/16"
}

# Inside modules/vpc/outputs.tf — expose only what consumers need
output "vpc_id" {
  # Describe the output for documentation and discoverability
  description = "The ID of the created VPC"
  # Reference the VPC resource's ID attribute
  value = aws_vpc.main.id
}

output "private_subnet_ids" {
  # Consumers use these to place databases and internal services
  description = "List of private subnet IDs across all availability zones"
  # Collect all private subnet IDs into a list
  value = aws_subnet.private[*].id
}

◈ Architecture Diagram

┌─────────────────────────────────────────────────────┐
│              Root Module (Working Dir)               │
│                                                     │
│  ┌───────────────┐       ┌────────────────────┐     │
│  │ main.tf       │       │ variables.tf       │     │
│  │               │       │ env = "prod"       │     │
│  │ module call───┼──┐    │ region = "us-east" │     │
│  └───────────────┘  │    └────────────────────┘     │
└─────────────────────┼───────────────────────────────┘
                      │
          ┌───────────▼───────────┐
          │  Module: payments_vpc │
          │  (modules/vpc/)       │
          │                       │
          │  ┌─────────────────┐  │
          │  │ variables.tf    │  │
          │  │ (inputs)        │  │
          │  └────────┬────────┘  │
          │           │           │
          │  ┌────────▼────────┐  │
          │  │ main.tf         │  │
          │  │ aws_vpc         │  │
          │  │ aws_subnet      │  │
          │  │ aws_nat_gateway │  │
          │  └────────┬────────┘  │
          │           │           │
          │  ┌────────▼────────┐  │
          │  │ outputs.tf      │  │
          │  │ vpc_id          │  │
          │  │ subnet_ids      │  │
          │  └─────────────────┘  │
          └───────────────────────┘

How do Terraform workspaces work and when should you use them vs separate directories?

intermediateworkspacesterraform

▼

Quick Answer

Terraform workspaces allow you to maintain multiple state files for the same configuration, enabling environment separation (dev/staging/prod) with a single codebase. Separate directories are preferred when environments have significantly different resource compositions or provider configurations.

Detailed Answer

Terraform workspaces are a built-in mechanism for managing multiple instances of the same infrastructure configuration. Think of workspaces like branches in a photo editing app — you have the same source image but can apply different filters and adjustments to each branch independently. Each workspace gets its own state file, so resources in the 'dev' workspace are completely isolated from resources in the 'prod' workspace, even though they share the same Terraform code. Internally, workspaces work by modifying the state file path. When using the default local backend, Terraform stores state files in a terraform.tfstate.d/ directory with subdirectories for each workspace. With an S3 backend, workspace state files are stored under the key prefix with the workspace name appended — for example, payments-platform/dev/terraform.tfstate and payments-platform/prod/terraform.tfstate. The terraform.workspace variable is available in your HCL code, letting you conditionally set values based on the active workspace. Workspaces shine when your environments are structurally identical but differ in scale or configuration. If dev, staging, and production all need the same VPC, RDS cluster, ECS service, and ALB, but dev uses smaller instance types and fewer replicas, workspaces with conditional expressions or workspace-specific tfvars files work beautifully. You can use terraform.workspace in locals to set instance sizes, replica counts, and CIDR ranges per environment. However, workspaces have significant limitations that push many teams toward separate directories. First, all environments share the same provider configuration — you cannot easily use different AWS accounts per workspace without complex provider alias tricks. Second, if your production environment has additional resources that dev does not need (WAF rules, CloudFront distributions, compliance monitoring), you end up with count = terraform.workspace == "prod" ? 1 : 0 scattered throughout your code, which becomes unreadable. Third, workspaces provide no protection against accidentally running apply in the wrong workspace. A sleep-deprived engineer who forgets to run terraform workspace select prod before applying a production hotfix could accidentally modify the dev environment. Separate directories (often called the directory-per-environment pattern) give you complete isolation. Each environment has its own directory with its own backend configuration, provider configuration, and state. This means prod can use a different AWS account, different provider version constraints, and completely different resource compositions. The tradeoff is code duplication — you need to keep common modules in sync across directories. The modern consensus in the Terraform community is to use workspaces for lightweight environment differentiation within a single account and team, and separate directories (or separate Terraform Cloud workspaces with VCS integration) for production-grade multi-account setups. Many teams use a hybrid approach: modules contain the shared logic, and each environment directory calls those modules with environment-specific variables.

Code Example

# Using workspaces with conditional configuration for the payments platform
# Define local values that change based on the active workspace
locals {
  # Map workspace names to environment-specific configurations
  environment_config = {
    # Development environment uses minimal resources to save costs
    dev = {
      instance_class   = "db.t3.medium"
      replica_count    = 1
      vpc_cidr         = "10.10.0.0/16"
      enable_waf       = false
      backup_retention = 3
    }
    # Staging mirrors production structure but at reduced scale
    staging = {
      instance_class   = "db.r6g.large"
      replica_count    = 2
      vpc_cidr         = "10.20.0.0/16"
      enable_waf       = true
      backup_retention = 7
    }
    # Production runs at full scale with maximum protection
    prod = {
      instance_class   = "db.r6g.2xlarge"
      replica_count    = 3
      vpc_cidr         = "10.30.0.0/16"
      enable_waf       = true
      backup_retention = 35
    }
  }
  # Look up the current workspace's config from the map above
  config = local.environment_config[terraform.workspace]
}

# Configure the S3 backend — workspace name is automatically appended to key
terraform {
  # S3 backend stores each workspace's state at a separate key path
  backend "s3" {
    # Shared state bucket for all environments
    bucket = "fintech-terraform-state"
    # Base key path — workspace name is appended automatically
    key = "payments-platform/terraform.tfstate"
    # Region where the state bucket resides
    region = "us-east-1"
    # Lock table shared across all workspaces
    dynamodb_table = "fintech-terraform-locks"
    # Encrypt state at rest for compliance
    encrypt = true
  }
}

# VPC sized according to the current workspace
resource "aws_vpc" "payments_vpc" {
  # CIDR block varies by environment — dev is /16, staging is /16, prod is /16
  cidr_block = local.config.vpc_cidr
  # Enable DNS hostnames for service discovery within the VPC
  enable_dns_hostnames = true
  # Tag with the workspace name so resources are identifiable in the console
  tags = {
    Name        = "payments-vpc-${terraform.workspace}"
    Environment = terraform.workspace
    ManagedBy   = "terraform"
  }
}

# RDS cluster scaled per environment using workspace-driven locals
resource "aws_rds_cluster_instance" "payments_db_instances" {
  # Create the number of replicas specified for this workspace
  count = local.config.replica_count
  # Unique identifier includes workspace and instance index
  identifier = "payments-db-${terraform.workspace}-${count.index}"
  # Associate with the payments database cluster
  cluster_identifier = aws_rds_cluster.payments_db.id
  # Instance class varies by workspace — t3.medium in dev, r6g.2xlarge in prod
  instance_class = local.config.instance_class
  # Use the same Aurora PostgreSQL engine as the cluster
  engine = aws_rds_cluster.payments_db.engine
  # Tag for environment identification and cost tracking
  tags = {
    Environment = terraform.workspace
    Service     = "payments-db"
  }
}

◈ Architecture Diagram

┌──────────────────────────────────────────────────────────┐
│              Workspace Approach                           │
│                                                          │
│  ┌──────────────────────────────────────────────┐        │
│  │          Shared HCL Configuration            │        │
│  │   main.tf  │  variables.tf  │  outputs.tf    │        │
│  └──────────────────┬───────────────────────────┘        │
│                     │                                    │
│        ┌────────────┼────────────┐                       │
│        │            │            │                       │
│  ┌─────▼─────┐ ┌────▼─────┐ ┌───▼──────┐                │
│  │ Workspace │ │Workspace │ │Workspace │                │
│  │ dev       │ │ staging  │ │ prod     │                │
│  │           │ │          │ │          │                │
│  │ State: A  │ │ State: B │ │ State: C │                │
│  │ t3.medium │ │ r6g.large│ │r6g.2xl   │                │
│  └───────────┘ └──────────┘ └──────────┘                │
└──────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────┐
│              Directory Approach                           │
│                                                          │
│  ┌────────────┐  ┌────────────┐  ┌────────────┐         │
│  │ envs/dev/  │  │envs/stage/ │  │ envs/prod/ │         │
│  │ main.tf    │  │ main.tf    │  │ main.tf    │         │
│  │ backend.tf │  │ backend.tf │  │ backend.tf │         │
│  │ State: X   │  │ State: Y   │  │ State: Z   │         │
│  │ AcctID: 111│  │ AcctID: 222│  │ AcctID: 333│         │
│  └─────┬──────┘  └─────┬──────┘  └─────┬──────┘         │
│        │               │               │                │
│        └───────────────┼───────────────┘                │
│                        │                                │
│              ┌─────────▼─────────┐                      │
│              │  Shared Modules   │                      │
│              │  modules/vpc/     │                      │
│              │  modules/rds/     │                      │
│              │  modules/ecs/     │                      │
│              └───────────────────┘                      │
└──────────────────────────────────────────────────────────┘