Terraform

92 reviewed items across 5 content types.

Interview Questions (60)Use Cases (1)Production Issues (17)Best Practices (9)Company Stories (5)

Asked by:

0/60 reviewed

🏢

Company Stories

How Stripe Manages 1000+ AWS Accounts with Terraform Modules and Custom Provider ExtensionsStripe

▼

Challenge

Stripe processes hundreds of billions of dollars in payments annually and operates under some of the strictest financial compliance requirements in the technology industry, including PCI DSS Level 1, SOC 2, and multiple international banking regulations. By 2024, their AWS footprint had grown to over 1,000 accounts organized in a complex AWS Organizations hierarchy to enforce security boundaries between payment processing, merchant data, analytics, and corporate workloads. Each account required consistent baseline security configurations including CloudTrail, GuardDuty, Config Rules, IAM boundaries, VPC architectures, and encryption policies. Manual provisioning of new accounts took 3-4 weeks and required coordination across security, networking, and compliance teams. Configuration drift across existing accounts was rampant, with audits revealing that 15 percent of accounts had deviated from baseline security requirements, creating compliance gaps that threatened their PCI certification.

Solution

Stripe built a comprehensive Terraform-based account vending machine and infrastructure management platform. At the core is a set of highly opinionated Terraform modules organized in a layered architecture: foundation modules handle account baseline (CloudTrail, GuardDuty, Config, IAM boundaries), networking modules provision standardized VPC architectures with transit gateway connectivity, and service modules provide pre-approved patterns for common workloads like ECS services, RDS databases, and Lambda functions. They developed a custom Terraform provider that integrates with Stripe's internal service catalog and compliance system, automatically tagging resources with ownership, cost center, data classification, and compliance scope. The account vending pipeline is triggered through an internal portal where teams request new accounts by specifying their workload type and compliance requirements, which generates Terraform configurations from templates and applies them through an automated pipeline. They implemented a Terraform state management architecture using S3 backends with DynamoDB locking, organized hierarchically to match their AWS Organizations structure, with cross-account IAM roles for centralized state access. Sentinel policies enforce compliance rules at plan time, preventing non-compliant configurations from ever being applied. A continuous drift detection system runs terraform plan on every account daily and automatically generates pull requests to remediate detected drift.

Outcome

New AWS account provisioning time reduced from 3-4 weeks to under 2 hours including all security baseline configurations. Configuration drift across accounts reduced from 15 percent non-compliance rate to under 1 percent through continuous drift detection and automated remediation. PCI DSS audit preparation time cut from 6 weeks to 1 week as compliance evidence is automatically generated from Terraform state. Infrastructure changes across all accounts are now fully auditable with complete provenance from pull request to applied change. Onboarding new infrastructure engineers reduced from months to weeks through standardized module patterns.

Scale

Over 1,000 AWS accounts, hundreds of Terraform modules, thousands of Terraform runs per day across accounts, hundreds of engineers authoring Terraform configurations, PCI DSS Level 1 compliance across all payment-processing accounts.

Key Learnings

Layered module architecture with strict interface contracts enables independent evolution of foundation, networking, and application infrastructure without breaking consumers
Custom Terraform providers bridge the gap between generic cloud APIs and organization-specific compliance and cataloging requirements that no upstream provider addresses
Continuous drift detection with automated remediation PRs is essential at scale — policy enforcement at apply time is necessary but not sufficient when humans can modify resources outside Terraform

How HashiCorp Cloud Platform Uses Terraform Internally to Provision Customer Infrastructure at ScaleHashiCorp

▼

Challenge

HashiCorp Cloud Platform (HCP) provides managed versions of HashiCorp products including Vault, Consul, Boundary, Waypoint, and Packer as cloud services running on AWS, Azure, and GCP. Each customer cluster requires dedicated infrastructure — VPCs, compute instances, load balancers, DNS records, TLS certificates, and product-specific resources — that must be provisioned in the customer's chosen cloud region within minutes of purchase. By 2024, HCP was managing thousands of customer clusters across 20+ regions on multiple cloud providers. The challenge was achieving reliable, fast, and consistent infrastructure provisioning at scale while handling the inherent complexity of multi-cloud deployment. Early provisioning pipelines had a 5 percent failure rate due to cloud provider API rate limits, transient errors, and resource dependency race conditions. Each failed provisioning meant a customer waiting and a support ticket. The provisioning system needed to handle concurrent provisioning of hundreds of clusters during peak signup periods while maintaining isolation guarantees between customer workloads.

Solution

HashiCorp built their internal provisioning platform using Terraform as the core infrastructure-as-code engine, orchestrated through a custom control plane that manages the lifecycle of customer clusters. Each HCP product has a set of Terraform modules that encode the complete infrastructure specification for a cluster — from VPC networking to product-specific compute and storage resources. The control plane generates customer-specific Terraform variable files from the cluster specification and executes Terraform through a managed workspace system built on the same technology that powers Terraform Cloud. They implemented intelligent retry logic with exponential backoff and jitter at the Terraform provider level to handle cloud API rate limits and transient failures. A custom Terraform provider called terraform-provider-hcp-internal manages HCP-specific resources like cluster registrations, license allocations, and cross-cluster networking configurations. Provisioning operations are queued and executed through a worker pool that rate-limits concurrent operations per cloud region to stay within API quotas. For multi-cloud support, they maintain parallel module implementations for AWS, Azure, and GCP with a common interface layer that abstracts cloud-specific details. State management uses Terraform Cloud's built-in state storage with workspace-per-cluster isolation ensuring that operations on one customer's infrastructure never affect another.

Outcome

Provisioning failure rate reduced from 5 percent to under 0.5 percent through intelligent retry logic and rate limit management. Average cluster provisioning time reduced from 25 minutes to 12 minutes through parallelized resource creation and optimized module dependency graphs. Infrastructure drift across customer clusters eliminated through regular reconciliation runs that detect and remediate unexpected changes. Multi-cloud parity achieved with identical customer experience across AWS, Azure, and GCP deployments. Engineering team velocity improved as new HCP products can reuse foundational provisioning modules, reducing time-to-market for new managed services from months to weeks.

Scale

Thousands of managed customer clusters, 20+ cloud regions across AWS, Azure, and GCP, hundreds of concurrent provisioning operations during peak, millions of Terraform resource instances managed, 99.9 percent provisioning success rate target.

Key Learnings

Using your own product at scale is the ultimate testing strategy — HCP's internal Terraform usage surfaces edge cases and scale limitations that no external testing could replicate
Cloud API rate limits are the primary reliability bottleneck for large-scale Terraform operations — intelligent queuing and per-region rate limiting are essential, not optional
Workspace-per-customer isolation in state management is non-negotiable for multi-tenant platforms — shared state creates unacceptable blast radius for customer-impacting operations

How Coinbase Standardized Multi-Cloud Infrastructure with Terraform Modules and Sentinel PoliciesCoinbase

▼

Challenge

Coinbase operates one of the largest cryptocurrency exchanges, serving over 100 million verified users across 100+ countries. As a financial services company handling digital assets worth billions of dollars, they face extreme regulatory scrutiny from the SEC, NYDFS, and international regulators. By 2024, their infrastructure spanned AWS as the primary cloud with GCP for specific workloads, totaling hundreds of AWS accounts and dozens of GCP projects. Different engineering teams had adopted ad-hoc infrastructure provisioning practices — some used Terraform, others used CloudFormation, and some still provisioned manually through the AWS Console. This fragmentation created massive security and compliance risks: a 2023 internal audit found that 20 percent of cloud resources lacked proper encryption configuration, 30 percent of S3 buckets had overly permissive access policies, and IAM roles across accounts had accumulated excessive permissions over years of manual changes. Regulatory examiners flagged the lack of consistent infrastructure governance as a material risk finding.

Solution

Coinbase launched an initiative called Infrastructure Standards to consolidate all cloud provisioning onto Terraform with mandatory policy enforcement through HashiCorp Sentinel. They built a comprehensive Terraform module library called coinbase-terraform-modules containing over 200 versioned, security-hardened modules covering every cloud resource type used across the organization. Each module embeds security best practices by default — encryption at rest and in transit enabled, public access blocked, logging enabled, and least-privilege IAM policies generated automatically. A custom module called secure-account-baseline provisions the complete security foundation for new AWS accounts and GCP projects in one apply. Sentinel policies are organized into three tiers: critical policies that block deployment (no unencrypted storage, no public endpoints without WAF, no wildcard IAM permissions), standard policies that require team lead override (non-standard instance types, cross-account access), and advisory policies that generate recommendations. They built an internal Terraform platform called Terraform-at-Coinbase (TaC) that provides a web UI for plan visualization, cost estimation, policy evaluation results, and approval workflows. All Terraform changes require peer review through GitHub pull requests with automated plan output posted as PR comments. A migration team spent 6 months importing existing manually-created resources into Terraform state using terraform import and custom state manipulation tooling.

Outcome

Unencrypted resources reduced from 20 percent to zero within 6 months of Sentinel policy enforcement. Overly permissive S3 buckets reduced from 30 percent to under 2 percent. Regulatory audit findings related to infrastructure governance reduced from 12 material findings to 1. Average time to provision production-ready infrastructure for new services reduced from 2 weeks to 4 hours. Security incident rate related to misconfigured infrastructure decreased by 75 percent. All infrastructure changes now have complete audit trail from PR to applied change, satisfying SOX and regulatory requirements.

Scale

Hundreds of AWS accounts and dozens of GCP projects, over 200 Terraform modules in the internal registry, 500+ engineers authoring Terraform, thousands of Terraform applies per week, managing infrastructure supporting 100+ million users.

Key Learnings

Terraform import at scale is a massive engineering effort — automated tooling for discovering unmanaged resources and generating import configurations is essential for brownfield adoption
Three-tier Sentinel policy organization balances security rigor with developer velocity — not every policy needs to be a hard block, but critical security controls must be non-negotiable
Security-hardened default modules eliminate entire categories of misconfigurations — when the secure path is the easy path, compliance becomes automatic rather than aspirational

How Capital One Moved from Manual Cloud Provisioning to Terraform-Driven Infrastructure Across 100+ TeamsCapital One

▼

Challenge

Capital One was the first major US bank to go all-in on public cloud, completing their data center exit in 2020. By 2024, they operated entirely on AWS with hundreds of accounts supporting over 100 engineering teams building banking applications, credit card systems, fraud detection platforms, and customer-facing mobile applications. Despite being cloud-native, many teams had adopted AWS Console-based provisioning and custom scripts for infrastructure management, creating a fragmented landscape where no two teams managed infrastructure the same way. The Office of the Comptroller of the Currency (OCC) and Federal Reserve examiners increasingly required demonstrable infrastructure change controls, audit trails, and consistent security configurations — requirements that manual provisioning fundamentally could not satisfy. A 2023 internal assessment found that infrastructure changes across teams had no consistent approval workflow, rollback procedures varied wildly, and disaster recovery configurations were inconsistent. With over 50 million customers' financial data at stake, the regulatory and security implications were severe.

Solution

Capital One built an enterprise-wide Terraform adoption program called Cloud Infrastructure as Code (CIaC) that standardized infrastructure provisioning across all 100+ engineering teams. The program had three pillars: a curated Terraform module catalog, an automated pipeline platform, and a governance framework. The module catalog provides pre-approved, security-reviewed Terraform modules for every AWS service used at Capital One, with modules encoding bank-specific compliance requirements like encryption with customer-managed KMS keys, VPC flow log retention for 7 years, and mandatory tagging for cost allocation and regulatory reporting. The pipeline platform, built on Jenkins and later migrated to GitHub Actions, provides a standardized Terraform workflow with automated plan, policy check, approval gate, and apply stages. Every infrastructure change goes through a peer review process where Terraform plan output is posted to pull requests and must be approved by both a team member and a designated infrastructure reviewer. They built a custom Open Policy Agent (OPA) integration that evaluates Terraform plans against over 300 security and compliance policies before any apply is permitted. For the migration itself, they created a Terraform adoption toolkit that helped teams import existing AWS resources into Terraform management, providing automated resource discovery using AWS Config data, HCL code generation, and guided import workflows. A dedicated platform engineering team of 25 engineers supported the rollout with office hours, training sessions, and embedded coaching.

Outcome

Infrastructure change audit trail coverage went from approximately 40 percent to 100 percent within 18 months. Mean time to provision new production environments decreased from 3 weeks to 2 days. Regulatory examination findings related to infrastructure change management reduced from 8 findings to zero. Disaster recovery infrastructure consistency improved from 60 percent to 98 percent as standardized DR modules ensured every production workload had matching DR configurations. Security misconfigurations caught by OPA policies in the first year: over 15,000 blocked non-compliant changes that would have previously been applied without review. Team satisfaction surveys showed 78 percent of engineers preferred the new Terraform workflow over previous manual processes.

Scale

Over 100 engineering teams, hundreds of AWS accounts, 300+ OPA compliance policies, thousands of Terraform applies per week, managing infrastructure for 50+ million customer accounts, full compliance with OCC and Federal Reserve requirements.

Key Learnings

Enterprise Terraform adoption requires dedicated enablement teams — technology alone does not drive adoption, humans need training, coaching, and support through the transition
OPA policies must be versioned and tested like application code — policy regressions are as dangerous as application bugs in regulated environments
Importing existing infrastructure into Terraform is 80 percent of the effort in brownfield enterprise adoption — module authoring is the easy part

How Datadog Uses Terraform to Manage Their Monitoring Infrastructure Across 20+ AWS RegionsDatadog

▼

Challenge

Datadog is one of the largest cloud monitoring and observability platforms, ingesting trillions of data points per day from millions of hosts and containers across their customers' infrastructure. Their own infrastructure spans over 20 AWS regions to provide low-latency data ingestion close to customers worldwide. By 2024, they operated thousands of AWS resources per region including EC2 instances, EKS clusters, RDS databases, ElastiCache clusters, Kinesis streams, and hundreds of custom networking configurations. The infrastructure challenge was unique: each region needed to be a near-identical copy of their core architecture but with region-specific customizations for capacity, networking peering, and compliance requirements (especially for EU data residency). Manual region expansion took 6-8 weeks and required a senior engineer's full attention, limiting their ability to expand into new regions to meet customer demand. Configuration drift between regions was a persistent problem — a fix applied to one region's Kinesis configuration might not be propagated to all other regions for weeks, leading to inconsistent behavior and debugging nightmares where issues could not be reproduced across regions.

Solution

Datadog built a Terraform-based region management platform called RegionForge that treats each AWS region as an instantiation of a parameterized Terraform configuration. The core architecture is defined in a set of base modules that encode the complete Datadog region topology — from VPC CIDR allocation and transit gateway setup to EKS cluster configuration and data pipeline infrastructure. Region-specific customizations are layered on top through a variable hierarchy: global defaults, region-tier overrides (primary vs secondary vs edge), and individual region overrides. RegionForge uses a custom orchestration layer built on top of Terraform that manages the dependency graph across regions — for example, ensuring that global resources like Route 53 hosted zones and IAM roles are provisioned before any region-specific resources that depend on them. They implemented a region expansion pipeline where adding a new region requires only a single YAML file specifying the region identifier, capacity tier, and any specific overrides, which generates the complete Terraform configuration and applies it through an automated pipeline. Cross-region consistency is enforced through a continuous reconciliation system that runs terraform plan across all regions every 2 hours, comparing actual state to desired state and alerting on any drift. They built a custom Terraform provider for Datadog-internal resources that manages their own monitoring, alerting, and dashboard configurations as code — effectively using Datadog to monitor Datadog, configured through Terraform.

Outcome

New region expansion time reduced from 6-8 weeks to under 3 days including validation and burn-in testing. Cross-region configuration drift incidents reduced from approximately 15 per quarter to near zero through continuous reconciliation. Infrastructure team productivity improved by 4x as measured by regions managed per engineer. Incident response time for infrastructure issues improved by 60 percent because all regions have identical, well-documented Terraform configurations. Successfully expanded into 5 new regions in 2024 to support customer demand, compared to 2 regions the previous year.

Scale

20+ AWS regions, thousands of resources per region, trillions of data points ingested daily, millions of monitored hosts, hundreds of Terraform modules, infrastructure supporting 26,000+ customers.

Key Learnings

Treating regions as parameterized instances of a base configuration eliminates the multi-region consistency problem — differences between regions should be explicit parameters, not implicit drift
Continuous reconciliation every 2 hours catches drift before it causes incidents — waiting for the next deployment to detect drift means living with unknown inconsistencies
Using your own product to monitor your own infrastructure creates a powerful feedback loop — Datadog monitoring Datadog through Terraform-managed dashboards and alerts