17 interview questions · kubernetes, prometheus, terraform
Quick Answer
Cilium loads eBPF programs into the Linux kernel to handle packet forwarding, service load balancing, network policy, and L7 observability without iptables rules or per-pod sidecar proxies. Architects must evaluate kernel version requirements, observability maturity via Hubble, CNI migration complexity, and the loss of fine-grained L7 control that a full sidecar proxy provides.
Detailed Answer
Think of a highway toll system. Traditional kube-proxy is like a toll booth where every car stops, gets checked, and is directed to its lane. EBPF with Cilium is like an electronic pass reader embedded in the road surface — the car never stops, the toll is processed at wire speed, and the road itself knows which lane to direct traffic into without a booth. Cilium replaces the iptables-based kube-proxy and the user-space proxy model used by traditional service meshes. Instead of maintaining thousands of iptables rules that the kernel evaluates linearly, Cilium attaches eBPF programs to network hooks inside the kernel. These programs handle service IP translation, load balancing across endpoints, network policy enforcement, and even some L7 protocol parsing without packets ever leaving kernel space. This eliminates the context switches between kernel and user space that Envoy-based sidecars require for every connection. Internally, Cilium uses several eBPF map types to store service endpoints, identity labels, policy rules, and connection tracking state. When a packet arrives, the eBPF program attached to the network interface or socket looks up the destination service, selects a backend Pod using consistent hashing or round-robin, rewrites headers, and forwards the packet — all within a single kernel function call chain. Hubble, the observability layer built on top of Cilium, taps into these eBPF data paths to provide flow logs, DNS visibility, and HTTP metrics without injecting any proxy. At production scale, Cilium handles over 5,000 production deployments as of 2025, including platforms at Adobe, Bell Canada, and multiple hyperscalers. Teams should monitor eBPF program load errors, map memory usage, endpoint synchronization latency, dropped flow events in Hubble, and kernel version compatibility. Cilium requires Linux kernel 5.10 or later for full feature support, and some advanced features like bandwidth manager or BBR congestion control need even newer kernels. The non-obvious gotcha is that Cilium does not fully replicate every L7 feature of Envoy-based meshes. While it handles mTLS via SPIFFE identities, basic HTTP routing, and L7 policy, complex traffic management like retries with budgets, circuit breaking with outlier detection, or gRPC-aware load balancing may still require a sidecar or gateway proxy. Architects should map their actual L7 requirements before declaring a full service mesh unnecessary, because removing sidecars and then re-adding them later is a painful migration.
Code Example
# Install Cilium with kube-proxy replacement enabled on a fresh cluster
helm install cilium cilium/cilium --version 1.16.4 \
--namespace kube-system \
--set kubeProxyReplacement=true \
--set k8sServiceHost=api.payments-cluster.internal \
--set k8sServicePort=6443 \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true
# Verify Cilium replaced kube-proxy and is handling service translation
kubectl -n kube-system exec ds/cilium -- cilium status --verbose | grep KubeProxyReplacement
# View real-time network flows for the payments namespace using Hubble
kubectl -n kube-system exec deploy/hubble-relay -- hubble observe --namespace payments --protocol http
# Check eBPF program load status on a specific node
kubectl -n kube-system exec ds/cilium -- cilium bpf lb list
# Apply an L7 network policy that restricts HTTP methods on the checkout API
apiVersion: cilium.io/v2 # Cilium-specific CRD for extended network policy
kind: CiliumNetworkPolicy # Extends Kubernetes NetworkPolicy with L7 rules
metadata:
name: checkout-api-l7-policy # Policy name describing its scope
namespace: payments # Applies to the payments namespace
spec:
endpointSelector:
matchLabels:
app: checkout-api # Targets the checkout API pods
ingress:
- fromEndpoints:
- matchLabels:
app: web-frontend # Allows traffic only from the frontend
toPorts:
- ports:
- port: "8080" # The checkout API listening port
protocol: TCP # HTTP runs over TCP
rules:
http:
- method: POST # Allows POST for creating orders
path: /api/v2/orders # Restricts to the orders endpoint
- method: GET # Allows GET for reading order status
path: /api/v2/orders/.* # Permits path parameters for order lookups◈ Architecture Diagram
┌──────────┐ ┌──────────┐
│ Pod A │ │ Pod B │
└────┬─────┘ └────┬─────┘
│ │
↓ ↓
┌─────────────────────────────┐
│ eBPF (kernel) │
│ ┌────────┐ ┌────────────┐ │
│ │Svc LB │ │L7 Policy │ │
│ └────────┘ └────────────┘ │
│ ┌────────┐ ┌────────────┐ │
│ │ConnTrk │ │Hubble Tap │ │
│ └────────┘ └────────────┘ │
└─────────────────────────────┘Quick Answer
Istio ambient mesh replaces per-pod Envoy sidecars with two shared components: ztunnel, a per-node L4 proxy handling mTLS and basic routing, and optional waypoint proxies for L7 policy. Architects must evaluate the migration path for existing sidecar workloads, L7 feature parity, multi-cluster ambient support maturity, and the operational tradeoff of shared node-level proxies versus isolated per-pod proxies.
Detailed Answer
Think of an apartment building with two security options. The sidecar model gives every apartment its own security guard who checks visitors at the apartment door — effective but expensive. The ambient model puts a guard at the building entrance who checks IDs for everyone, and only apartments that need advanced screening get a shared floor-level inspector. You get security everywhere with far fewer guards. Istio ambient mesh reached general availability with Istio 1.22 in late 2024 and has become production-stable through 2025 and into 2026. It fundamentally changes how the data plane is deployed. Traditional Istio injects an Envoy sidecar into every Pod, which adds memory overhead (typically 50-100 MB per Pod), increases startup latency, and creates operational complexity around sidecar injection, upgrade ordering, and resource accounting. Ambient mesh removes all of this by separating L4 and L7 concerns into shared infrastructure. The architecture has two layers. Ztunnel is a lightweight Rust-based proxy that runs as a DaemonSet on every node. It handles all L4 concerns: mTLS encryption and identity using SPIFFE certificates, TCP-level authorization policy, and basic connection routing. Ztunnel performance has improved 75 percent over recent releases and adds negligible latency. For workloads that need L7 features — HTTP routing, retries, header-based authorization, traffic splitting — architects deploy waypoint proxies, which are shared Envoy instances scoped to a namespace or service account rather than injected per Pod. In production migration, teams should start by enabling ambient mode on a namespace using the label istio.io/dataplane-mode=ambient. Existing sidecar workloads can coexist with ambient workloads during migration. The key evaluation points are: L7 feature gaps between sidecar and waypoint proxy configurations, whether multi-cluster ambient mesh is mature enough (alpha planned for Istio 1.27), how existing Istio AuthorizationPolicy and VirtualService resources translate, and whether shared ztunnel on a node creates a blast radius concern where a ztunnel crash affects all Pods on that node. The non-obvious gotcha is that ambient mesh changes the failure domain. In sidecar mode, a proxy crash affects one Pod. In ambient mode, a ztunnel crash can disrupt networking for every Pod on that node. This makes ztunnel reliability, resource limits, and upgrade strategy (rolling DaemonSet updates) critical. Architects should also verify that their observability stack captures ztunnel metrics and waypoint proxy metrics in the same dashboards, because the telemetry surface shifts from per-pod to per-node and per-namespace.
Code Example
# Enable ambient mesh mode on the payments namespace
kubectl label namespace payments istio.io/dataplane-mode=ambient
# Verify ztunnel is running on every node in the mesh
kubectl get pods -n istio-system -l app=ztunnel -o wide
# Deploy a waypoint proxy for L7 policy in the payments namespace
istioctl waypoint apply --namespace payments --name payments-waypoint
# Verify the waypoint proxy is ready and accepting traffic
kubectl get gateway payments-waypoint -n payments
# Apply an L7 AuthorizationPolicy that requires the waypoint proxy
apiVersion: security.istio.io/v1 # Istio security API for authorization
kind: AuthorizationPolicy # Controls which requests are allowed
metadata:
name: checkout-api-auth # Policy name describing its scope
namespace: payments # Namespace where the waypoint proxy runs
spec:
targetRefs:
- kind: Service # Targets a specific Kubernetes Service
group: "" # Core API group
name: checkout-api # The service to protect
action: ALLOW # Permits matching requests
rules:
- from:
- source:
principals: ["cluster.local/ns/payments/sa/web-frontend"] # SPIFFE identity of the caller
to:
- operation:
methods: ["POST"] # Allows only POST requests
paths: ["/api/v2/orders"] # Restricts to the orders endpoint
# Check ztunnel connection metrics on a specific node
kubectl -n istio-system exec ds/ztunnel -- curl -s localhost:15020/metrics | grep ztunnel_tcp_connections◈ Architecture Diagram
┌───── Node ─────────────────┐
│ ┌────────┐ ┌────────┐ │
│ │ Pod A │ │ Pod B │ │
│ │(no sidecar)(no sidecar) │
│ └───┬────┘ └───┬────┘ │
│ └─────┬─────┘ │
│ ┌─────┴─────┐ │
│ │ ztunnel │ (L4) │
│ │ mTLS+auth │ │
│ └─────┬─────┘ │
└───────────┼───────────────┘
↓
┌──────────┐
│ Waypoint │ (L7)
│ Proxy │
└──────────┘Quick Answer
Publish vetted, hardened Terraform modules to a private registry (TFE or Artifactory) with semantic versioning. Each module has an owning team responsible for maintenance, security updates, and documentation. Consumer teams pin module versions and upgrade through a managed process.
Detailed Answer
Think of a private module registry like an internal app store for a bank. Instead of every team building their own RDS setup from scratch (with varying levels of security hardening), the platform team publishes a vetted 'RDS Module' to the internal store. Application teams install it, configure their specific parameters (database name, size), and get encryption, backup, monitoring, and compliance built in. The platform team updates the module when security requirements change, and consumer teams upgrade on their own schedule — just like updating an app on your phone. A private Terraform module registry serves as the single source of truth for approved infrastructure patterns. In Terraform Enterprise, the private registry is built in — you publish modules from Git repositories with the naming convention terraform-<provider>-<name> (like terraform-aws-eks-cluster or terraform-aws-rds-postgresql). In organizations using open-source Terraform, alternatives include JFrog Artifactory's Terraform provider, a self-hosted registry using the Terraform Registry Protocol, or even Git-based module references with version tags. The key principle is that production infrastructure should only use modules from the private registry, never ad-hoc inline resources — this is enforced via Sentinel policies that check the module source in every Terraform plan. Semantic versioning is critical for module governance. Every module follows semver (major.minor.patch): patch versions fix bugs without changing behavior, minor versions add features backward-compatibly, and major versions include breaking changes that require consumer updates. When the platform team updates the RDS module to require a new mandatory tag (minor version bump), consumer teams see the new version in the registry but continue using their pinned version until they are ready to upgrade. When a major version changes the module interface (removing an input variable or changing an output format), consumer teams must explicitly update their code. Version constraints in consumer code (version = '~> 2.0' means any 2.x version) allow automatic adoption of patches and minor updates while protecting against breaking changes. Team ownership is formalized through module CODEOWNERS files and documentation. Each module has an owning team listed in the repository's CODEOWNERS file, ensuring that any PR to the module requires review from the owners. The owning team is responsible for security patching (updating provider versions, fixing CVEs), documentation (README with usage examples, input/output descriptions, and architecture diagrams), testing (automated tests using Terratest or terraform-compliance that run in CI), and deprecation communication (announcing when older versions will lose support). In a banking organization, module ownership maps to infrastructure domains: the networking team owns the VPC and transit gateway modules, the database team owns the RDS and ElastiCache modules, and the platform team owns the EKS and observability modules. The module development lifecycle follows a structured process. A team identifies a repeated infrastructure pattern (every team needs an S3 bucket with encryption, versioning, access logging, and lifecycle policies). They build a module, test it with Terratest (creating real infrastructure in a sandbox account, validating it, then destroying it), write documentation, and publish version 1.0.0 to the private registry. Consumer teams adopt the module with a pinned version constraint. When a security requirement changes (for example, a new PCI-DSS control requires S3 Object Lock), the module team releases version 1.1.0 with the new feature as an optional input, and version 2.0.0 if the feature must be mandatory (breaking change for consumers not passing the new input). The module team announces the update through internal channels and provides migration guides for major version bumps. The biggest gotcha is creating modules that are either too opinionated or too flexible. A module that hardcodes the instance type, subnet, and tags is useless because every consumer has different requirements. A module that exposes every single AWS resource argument as an input variable is just a wrapper around the provider with no added value. Good modules encode organizational opinions (encryption is always on, backups are always enabled, monitoring is always configured) while exposing legitimate customization points (instance size, database name, backup retention period). Another gotcha is orphaned modules — modules published to the registry but never updated, with no clear owner. Implement a quarterly module health review where each module is checked for dependency updates, provider compatibility, and active ownership. Deprecate modules that are no longer maintained with clear migration paths to replacements.
Code Example
# Private module structure: terraform-aws-rds-postgresql
# Published to TFE private registry
#
# terraform-aws-rds-postgresql/
# ├── main.tf # Core RDS resources
# ├── variables.tf # Input variables
# ├── outputs.tf # Output values
# ├── versions.tf # Provider version constraints
# ├── sentinel.hcl # Policy tests
# ├── README.md # Usage documentation
# ├── CODEOWNERS # Team ownership
# ├── CHANGELOG.md # Version history
# └── test/
# └── rds_test.go # Terratest integration tests
# main.tf - Encodes banking compliance requirements
resource "aws_db_instance" "this" {
identifier = "${var.team}-${var.service_name}-${var.environment}"
engine = "postgres"
engine_version = var.engine_version
instance_class = var.instance_class
# MANDATORY: Encryption at rest (PCI-DSS)
storage_encrypted = true
kms_key_id = var.kms_key_arn
# MANDATORY: Automated backups
backup_retention_period = max(var.backup_retention_days, 7) # Min 7 days
backup_window = "03:00-04:00"
# MANDATORY: Multi-AZ for production
multi_az = var.environment == "production" ? true : var.multi_az
# MANDATORY: No public access
publicly_accessible = false
db_subnet_group_name = aws_db_subnet_group.this.name
vpc_security_group_ids = [aws_security_group.this.id]
# MANDATORY: Audit logging for compliance
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
# MANDATORY: Deletion protection in production
deletion_protection = var.environment == "production" ? true : false
tags = merge(var.additional_tags, {
ManagedBy = "terraform"
Module = "terraform-aws-rds-postgresql"
ModuleVersion = "2.3.0"
Team = var.team
DataClassification = var.data_classification
ComplianceScope = "pci-dss"
})
}
# variables.tf - Expose customization, encode opinions
variable "service_name" {
description = "Name of the service this database supports"
type = string
validation {
condition = can(regex("^[a-z][a-z0-9-]+$", var.service_name))
error_message = "Service name must be lowercase alphanumeric with hyphens."
}
}
variable "instance_class" {
description = "RDS instance class"
type = string
default = "db.r6g.large"
validation {
condition = can(regex("^db\\.", var.instance_class))
error_message = "Must be a valid RDS instance class."
}
}
variable "data_classification" {
description = "Data classification level (required for compliance)"
type = string
validation {
condition = contains(["public", "internal", "confidential", "restricted"], var.data_classification)
error_message = "Must be one of: public, internal, confidential, restricted."
}
}
---
# Consumer usage - payments team
# services/payments-api/infra/database.tf
module "settlements_db" {
source = "app.terraform.io/bank-platform/rds-postgresql/aws"
version = "~> 2.3" # Accept patches, not major bumps
service_name = "settlements-processor"
team = "payments"
environment = "production"
instance_class = "db.r6g.xlarge"
engine_version = "15.4"
kms_key_arn = data.aws_kms_key.banking.arn
data_classification = "restricted" # PII and financial data
vpc_id = data.terraform_remote_state.network.outputs.vpc_id
subnet_ids = data.terraform_remote_state.network.outputs.database_subnets
additional_tags = {
CostCenter = "payments-engineering"
}
}
---
# Sentinel policy: enforce private registry usage
# policies/governance/require-private-registry.sentinel
import "tfconfig/v2" as tfconfig
approved_registry = "app.terraform.io/bank-platform"
module_calls = filter tfconfig.module_calls as _, mc {
mc.source is not ""
}
all_modules_from_registry = rule {
all module_calls as _, mc {
mc.source matches approved_registry + "/.+"
}
}
main = rule { all_modules_from_registry }
---
# Terratest - automated module validation
# test/rds_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestRdsModule(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../",
Vars: map[string]interface{}{
"service_name": "test-settlements",
"team": "platform-test",
"environment": "sandbox",
"data_classification": "internal",
"kms_key_arn": "arn:aws:kms:us-east-1:123456:key/test",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Validate encryption is enabled
encrypted := terraform.Output(t, terraformOptions, "storage_encrypted")
assert.Equal(t, "true", encrypted)
}◈ Architecture Diagram
┌─────────────────────────────────────────────────────────────────┐
│ Terraform Module Governance Lifecycle │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Module Development (Platform Team) │ │
│ │ │ │
│ │ Identify Pattern → Build Module → Terratest → Publish │ │
│ │ │ │
│ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ Hardened │→ │ Unit + │→ │ Semver │→ │ Private │ │ │
│ │ │ Defaults │ │ Integ │ │ Tag │ │ Registry │ │ │
│ │ │ (encrypt,│ │ Tests │ │ v2.3.0 │ │ Publish │ │ │
│ │ │ backup) │ │ │ │ │ │ │ │ │
│ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Private Registry (app.terraform.io/bank-platform) │ │
│ │ │ │
│ │ terraform-aws-rds-postgresql v2.3.0 (DB Team) │ │
│ │ terraform-aws-eks-cluster v3.1.0 (Platform Team) │ │
│ │ terraform-aws-vpc-banking v1.8.0 (Network Team) │ │
│ │ terraform-aws-s3-compliant v2.0.0 (Platform Team) │ │
│ └───────────────────────────┬───────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Consumer Teams (Pin versions, upgrade on schedule) │ │
│ │ │ │
│ │ module "settlements_db" { │ │
│ │ source = "app.terraform.io/.../rds-postgresql/aws" │ │
│ │ version = "~> 2.3" # Accept patches only │ │
│ │ } │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Sentinel: All prod modules MUST come from private registry │
│ └───────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘Quick Answer
Deploy EKS using a layered module structure: a networking module for VPC/subnets, a cluster module for the EKS control plane with OIDC provider, and a node-groups module for managed node groups with launch templates. Each module has its own state file and communicates through remote state data sources or SSM parameters.
Detailed Answer
Deploying EKS with Terraform is like building a three-story building: the foundation is networking (VPC, subnets, NAT gateways), the structural frame is the EKS control plane (API server, etcd, OIDC provider), and the floors are the node groups (compute capacity where workloads actually run). Each layer depends on the one below it, and you should be able to rebuild any floor without demolishing the entire building. The networking module provisions a dedicated VPC with public subnets for load balancers, private subnets for worker nodes, and optionally isolated subnets for databases. EKS requires specific subnet tags: kubernetes.io/cluster/<cluster-name> = shared on all subnets, kubernetes.io/role/elb = 1 on public subnets for internet-facing ALBs, and kubernetes.io/role/internal-elb = 1 on private subnets for internal services. The module outputs subnet IDs and the VPC ID for consumption by the cluster module. NAT gateways should be deployed per-AZ in production for high availability, meaning three NAT gateways across us-east-1a, us-east-1b, and us-east-1c. The cluster module creates the EKS control plane using the aws_eks_cluster resource or the terraform-aws-modules/eks/aws community module. Critical configurations include the Kubernetes version (pin to a specific minor version like 1.29), the cluster endpoint access (private-only or public-and-private with CIDR restrictions), envelope encryption for secrets using a dedicated KMS key, and the OIDC provider for IAM Roles for Service Accounts (IRSA). The OIDC provider is frequently missed but essential: it enables pods to assume IAM roles without injecting AWS credentials, which is the only secure way to grant AWS access to workloads. The node groups module manages EKS managed node groups with launch templates. Production clusters typically need multiple node groups: a system node group (t3.xlarge, 3 nodes, taints for system workloads like CoreDNS and kube-proxy), an application node group (m5.2xlarge, 3-15 nodes with cluster autoscaler), and optionally a GPU node group (g4dn.xlarge for ML inference). Each node group uses a custom launch template to specify the AMI (Amazon EKS-optimized AMI), bootstrap arguments, block device mappings (100Gi gp3 root volume), and user data for kubelet configuration. Instance refresh policies ensure rolling updates when the launch template changes. In production, state separation between these modules is critical. If your node group Terraform runs into an error, you do not want it to affect the EKS control plane state. Use separate state files: one for networking, one for the cluster, and one per node group pool. Pass data between modules using terraform_remote_state data sources or aws_ssm_parameter lookups. This blast radius isolation means a botched node group change cannot accidentally destroy the control plane. A common gotcha is the chicken-and-egg problem with EKS add-ons. The aws-auth ConfigMap (which controls IAM-to-Kubernetes RBAC mapping) requires a running cluster, but node groups need the aws-auth ConfigMap to join the cluster. The solution is to use the EKS access entries API (available since EKS platform version eks.8) instead of managing aws-auth directly, or to use the kubernetes provider with the EKS cluster's endpoint and token to manage the ConfigMap in the same apply as the cluster creation.
Code Example
# modules/eks-cluster/main.tf — EKS control plane module
# Create the EKS cluster with private endpoint and envelope encryption
resource "aws_eks_cluster" "payments_cluster" {
# Cluster name following org naming convention
name = "payments-eks-${var.environment}"
# Pin to specific Kubernetes minor version
version = "1.29"
# IAM role for the EKS control plane service
role_arn = aws_iam_role.eks_cluster_role.arn
# VPC configuration for the control plane ENIs
vpc_config {
# Private subnets from the networking module
subnet_ids = var.private_subnet_ids
# Enable private API endpoint for in-VPC access
endpoint_private_access = true
# Restrict public endpoint to CI/CD runner CIDRs only
endpoint_public_access = true
# CIDR blocks allowed to reach the public endpoint
public_access_cidrs = ["10.0.0.0/8", "172.16.0.0/12"]
# Security group for additional control plane access rules
security_group_ids = [aws_security_group.eks_cluster_sg.id]
}
# Envelope encryption for Kubernetes secrets using KMS
encryption_config {
# Encrypt the secrets resource type stored in etcd
resources = ["secrets"]
provider {
# Dedicated KMS key for EKS secrets encryption
key_arn = aws_kms_key.eks_secrets_key.arn
}
}
# Enable all control plane logging for audit and troubleshooting
enabled_cluster_log_types = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
# Tags for cost allocation and ownership tracking
tags = {
Team = "payments-platform"
Environment = var.environment
ManagedBy = "terraform"
}
}
# OIDC provider for IAM Roles for Service Accounts (IRSA)
resource "aws_iam_openid_connect_provider" "eks_oidc" {
# OIDC issuer URL from the EKS cluster
url = aws_eks_cluster.payments_cluster.identity[0].oidc[0].issuer
# Audience for the STS AssumeRoleWithWebIdentity call
client_id_list = ["sts.amazonaws.com"]
# TLS certificate thumbprint for the OIDC provider
thumbprint_list = [data.tls_certificate.eks_oidc.certificates[0].sha1_fingerprint]
}
# modules/eks-nodegroups/main.tf — Managed node groups
resource "aws_eks_node_group" "application_nodes" {
# Reference the payments EKS cluster by name
cluster_name = var.cluster_name
# Node group name identifying workload type
node_group_name = "payments-app-nodes-${var.environment}"
# IAM role for EC2 instances in this node group
node_role_arn = aws_iam_role.node_group_role.arn
# Deploy nodes into private subnets only
subnet_ids = var.private_subnet_ids
# Instance types optimized for payment processing workloads
instance_types = ["m5.2xlarge"]
# Use AL2023 EKS-optimized AMI
ami_type = "AL2023_x86_64_STANDARD"
# Autoscaling configuration for the node group
scaling_config {
# Minimum nodes for baseline capacity
min_size = 3
# Desired nodes for normal transaction volume
desired_size = 6
# Maximum nodes for peak shopping events
max_size = 15
}
# Launch template for custom node configuration
launch_template {
# Reference the custom launch template
id = aws_launch_template.app_nodes.id
# Use the latest version of the launch template
version = aws_launch_template.app_nodes.latest_version
}
# Rolling update strategy to avoid downtime
update_config {
# Update 1 node at a time for safe rolling deploys
max_unavailable = 1
}
}
# Launch template for application node group
resource "aws_launch_template" "app_nodes" {
# Template name matching the node group convention
name_prefix = "payments-app-nodes-${var.environment}"
# 100Gi gp3 root volume for container images and logs
block_device_mappings {
device_name = "/dev/xvda"
ebs {
# 100GB root volume for container runtime storage
volume_size = 100
# gp3 for consistent baseline IOPS without cost of io2
volume_type = "gp3"
# Encrypt node volumes with the account default KMS key
encrypted = true
}
}
# Tag instances for cost tracking and identification
tag_specifications {
resource_type = "instance"
tags = {
Name = "payments-app-node-${var.environment}"
NodeGroup = "application"
Environment = var.environment
}
}
}◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐ │ EKS Terraform Module Structure │ ├───────────────────────────────────────────────────────────────┤ │ │ │ Layer 1: Networking Module (state: networking.tfstate) │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ payments-vpc (10.0.0.0/16) │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Public │ │ Public │ │ Public │ │ │ │ │ │ Subnet │ │ Subnet │ │ Subnet │ │ │ │ │ │ 1a (ALB) │ │ 1b (ALB) │ │ 1c (ALB) │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Private │ │ Private │ │ Private │ │ │ │ │ │ Subnet │ │ Subnet │ │ Subnet │ │ │ │ │ │ 1a (Nodes)│ │ 1b (Nodes)│ │ 1c (Nodes)│ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ NAT GW x3 (one per AZ for HA) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ outputs: vpc_id, subnet_ids │ │ ↓ │ │ Layer 2: Cluster Module (state: eks-cluster.tfstate) │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ payments-eks-prod │ │ │ │ ┌──────────────┐ ┌────────────┐ ┌─────────────┐ │ │ │ │ │ Control Plane │ │ OIDC │ │ KMS Key │ │ │ │ │ │ K8s 1.29 │ │ Provider │ │ (secrets │ │ │ │ │ │ API + etcd │ │ (for IRSA) │ │ encryption)│ │ │ │ │ └──────────────┘ └────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ outputs: cluster_name, oidc_arn, endpoint │ │ ↓ │ │ Layer 3: Node Groups (state: eks-nodegroups.tfstate) │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ ┌────────────┐ ┌────────────┐ ┌────────────────┐│ │ │ │ │ System │ │ Application│ │ GPU Nodes ││ │ │ │ │ Nodes │ │ Nodes │ │ (optional) ││ │ │ │ │ t3.xlarge │ │ m5.2xlarge │ │ g4dn.xlarge ││ │ │ │ │ 3 fixed │ │ 3-15 auto │ │ 0-4 auto ││ │ │ │ │ 100Gi gp3 │ │ 100Gi gp3 │ │ 100Gi gp3 ││ │ │ │ └────────────┘ └────────────┘ └────────────────┘│ │ │ └─────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────┘
Quick Answer
Prevent cross-environment contamination through four layers: separate state files per environment with IAM-scoped access, provider configurations locked to specific AWS accounts via assume_role, module versioning with pinned tags so an untested module change cannot propagate, and CI/CD pipeline guardrails that validate the target environment before apply.
Detailed Answer
Preventing cross-environment contamination in Terraform is like building firewalls between apartments in a building: you need physical separation (state isolation), locked doors (IAM boundaries), independent utilities (provider configurations), and a building code (CI/CD guardrails) that prevents shortcuts through shared walls. The first layer is state file isolation. Each environment must have its own state file with its own backend configuration. Never share a state file between environments, even with workspaces, if the blast radius of corruption is unacceptable. The state file contains sensitive data including resource IDs, IP addresses, and sometimes plaintext outputs. An S3 bucket policy should restrict each environment's Terraform role to only its own key prefix: the prod role can access s3://state-bucket/prod/* but is explicitly denied s3://state-bucket/dev/*. This prevents a misconfigured prod pipeline from reading or overwriting dev state. The second layer is provider-level isolation. Each environment's provider block must assume a role in its specific AWS account. Even if someone accidentally passes the wrong tfvars file, the provider configuration ensures Terraform operates in the correct account. Add a validation check using the aws_caller_identity data source: compare the actual account ID against the expected one and fail early if they do not match. This catches the scenario where an engineer runs terraform apply with prod credentials but dev configuration, or vice versa. The third layer is module versioning. When environments share modules from a private registry or Git repository, use pinned version tags. Dev might use module version 2.3.0-rc1 while Prod uses 2.2.0 (the last stable release). Without version pinning, a module change pushed to the main branch immediately affects every environment that references source = "git::...?ref=main". This is the most common cause of accidental cross-environment impact: someone fixes a bug in a shared VPC module, the fix has a typo, and every environment that references the module head picks up the broken code on next apply. The fourth layer is CI/CD pipeline guardrails. The pipeline should validate environment consistency before plan: check that the workspace name matches the tfvars file, verify the AWS account ID matches the target environment, and confirm the Git branch is allowed to deploy to that environment (only main can deploy to prod). Implement a pre-plan script that runs aws sts get-caller-identity and compares the account against an expected value from the pipeline configuration. Remote state data sources are a particularly dangerous vector for cross-environment bleed. When a production EKS module reads the networking module's state via terraform_remote_state, it must reference the production networking state, not dev. Parameterize the remote state data source's backend configuration using the environment variable: data.terraform_remote_state.networking.config.key should resolve to prod/networking/terraform.tfstate, not a hardcoded path. A common gotcha is using terraform_remote_state with a hardcoded key that works in dev but points to prod state when someone copies the configuration without updating the key. The ultimate safeguard is defense in depth: even if one layer fails, the others prevent damage. If the IAM policy has a bug that allows dev access to prod state, the provider's assume_role still locks operations to the dev account. If the provider configuration is wrong, the account ID validation check fails before any resources are touched.
Code Example
# Account identity validation — fail fast on wrong account
# Fetch the actual AWS account identity
data "aws_caller_identity" "current" {}
# Validate the account ID matches the expected environment
locals {
# Map of expected account IDs per environment
expected_accounts = {
dev = "111111111111"
qa = "222222222222"
uat = "333333333333"
prod = "444444444444"
}
# Check if current account matches the target environment
account_validated = (
data.aws_caller_identity.current.account_id ==
local.expected_accounts[var.environment]
)
}
# Validation resource that fails plan if accounts mismatch
resource "null_resource" "account_validation" {
# This count trick fails if account does not match
count = local.account_validated ? 0 : "ERROR: Running in wrong AWS account"
}
# Remote state data source — parameterized per environment
data "terraform_remote_state" "networking" {
# S3 backend for reading the networking layer state
backend = "s3"
config = {
# Same state bucket as all other stacks
bucket = "valuemomentum-terraform-state-prod"
# Key parameterized by environment to prevent cross-env reads
key = "${var.environment}/networking/terraform.tfstate"
# Same region as the backend
region = "us-east-1"
}
}
# Use networking outputs safely scoped to the correct environment
resource "aws_eks_cluster" "payments_cluster" {
# Cluster name scoped to the environment
name = "payments-eks-${var.environment}"
version = "1.29"
role_arn = aws_iam_role.eks_cluster_role.arn
vpc_config {
# Subnet IDs from the SAME environment's networking state
subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
endpoint_private_access = true
endpoint_public_access = var.environment == "prod" ? false : true
}
}
# Module versioning — pinned per environment
module "payments_vpc" {
# Pinned Git tag prevents untested changes from propagating
source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.2.0"
# In dev, you might test a release candidate:
# source = "git::https://github.com/valuemomentum/tf-modules.git//vpc?ref=v2.3.0-rc1"
vpc_name = "payments-vpc-${var.environment}"
vpc_cidr = var.vpc_cidr
environment = var.environment
}
# CI/CD pre-plan validation script (run before terraform plan)
# #!/bin/bash
# EXPECTED_ACCOUNT=$(jq -r ".${ENVIRONMENT}" accounts.json)
# ACTUAL_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
# if [ "$EXPECTED_ACCOUNT" != "$ACTUAL_ACCOUNT" ]; then
# echo "FATAL: Expected account $EXPECTED_ACCOUNT but authenticated to $ACTUAL_ACCOUNT"
# exit 1
# fi◈ Architecture Diagram
┌───────────────────────────────────────────────────────────────┐
│ Cross-Environment Protection Layers │
├───────────────────────────────────────────────────────────────┤
│ │
│ Layer 1: State Isolation (IAM-Scoped) │
│ ┌──────────────┐ DENY ┌──────────────┐ │
│ │ Dev Role │─────X─────│ prod/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ALLOW ┌──────────────┐ │
│ │ Dev Role │───────────│ dev/* │ │
│ │ (IAM) │ │ state keys │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ Layer 2: Provider Account Lock │
│ ┌──────────────────────────────────────────┐ │
│ │ provider "aws" { │ │
│ │ assume_role { │ │
│ │ role_arn = ".../${var.env}/Role" │ │
│ │ } │ │
│ │ } │ │
│ │ → Operations locked to target account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 3: Account ID Validation │
│ ┌──────────────────────────────────────────┐ │
│ │ aws_caller_identity.account_id │ │
│ │ == expected_accounts[var.environment] │ │
│ │ → FAIL FAST if wrong account │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 4: Module Version Pinning │
│ ┌──────────────────────────────────────────┐ │
│ │ Dev: source = "...?ref=v2.3.0-rc1" │ │
│ │ Prod: source = "...?ref=v2.2.0" │ │
│ │ → Untested changes cannot reach prod │ │
│ └──────────────────────────────────────────┘ │
│ │
│ Layer 5: CI/CD Pipeline Guardrails │
│ ┌──────────────────────────────────────────┐ │
│ │ Branch → Environment mapping │ │
│ │ main → prod (requires approval) │ │
│ │ develop → dev (auto-apply) │ │
│ │ Pre-plan: sts get-caller-identity check │ │
│ └──────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘Quick Answer
Composable modules follow a thin-wrapper pattern with clear input/output contracts, use variable validation blocks for early error detection, semantic versioning via Git tags for safe upgrades, and publish to a private registry for organizational reuse. Module composition uses outputs and data sources rather than nested module trees that create opaque dependency chains.
Detailed Answer
Think of building a house with prefabricated components. A good prefab wall panel has standard dimensions (clear interface), quality-tested materials (validation), a version number stamped on it (semantic versioning), and is available from a catalog (registry). A bad panel is custom-cut for one house, undocumented, and stored in someone's garage. Terraform module design follows the same principles. A well-designed Terraform module encapsulates a single infrastructure concern with a clear input/output contract. The module should do one thing well — create an RDS instance with standard security settings, or provision a VPC with consistent CIDR allocation — rather than trying to create an entire environment. Variable validation blocks catch configuration errors at plan time rather than during apply or, worse, at runtime when a database accepts an invalid parameter and fails to start. Validation expressions use conditions and error messages to enforce naming patterns, CIDR ranges, instance size constraints, and environment-specific rules before any API call is made. Internally, Terraform resolves module sources during terraform init. A module sourced from a Git repository with a version tag (git::https://github.com/company/terraform-aws-rds.git?ref=v2.3.1) is downloaded and cached in .terraform/modules. The version pin ensures that a new commit to the module repository does not unexpectedly change infrastructure across all consumers. When published to a private registry (Terraform Cloud, Artifactory, or a self-hosted registry), modules appear in a searchable catalog with documentation generated from variables.tf, outputs.tf, and README.md. The registry enforces semantic versioning, making it safe to specify version constraints like ~> 2.3 (any 2.x from 2.3 upward) in consumer configurations. At production scale, module composition patterns matter as much as individual module quality. The recommended pattern is flat composition: a root configuration references multiple modules at the same level, passing outputs from one module as inputs to another, rather than nesting modules three or four levels deep. Deep nesting creates opaque dependency chains where a change in a leaf module requires understanding the full tree to predict impact. Root modules should be environment-specific (payments-prod, payments-staging) and pin module versions independently per environment so that staging can test a new module version before production adopts it. Teams should run terraform validate and tflint in CI for every module change, and use automated tests with terratest or terraform test to verify module behavior. The non-obvious gotcha is that module versioning only works if teams actually bump versions. A common failure pattern is pinning to a Git branch (ref=main) instead of a tag, which means terraform init on different days pulls different code. Another trap is overusing count or for_each in modules to make them do too many things — a module that creates either an RDS instance or an Aurora cluster based on a boolean variable becomes untestable and produces confusing plans. Architects should split divergent resources into separate modules rather than adding conditional logic that makes the module's behavior unpredictable.
Code Example
# modules/rds-instance/variables.tf — Module input contract with validation
variable "instance_name" {
# Human-readable name for the database instance
type = string
description = "Name of the RDS instance, must follow naming convention"
validation {
# Enforce the team naming convention: team-service-env
condition = can(regex("^[a-z]+-[a-z]+-(?:dev|staging|prod)$", var.instance_name))
error_message = "Instance name must match pattern: team-service-env (e.g., payments-orders-prod)."
}
}
variable "instance_class" {
# RDS instance type for compute sizing
type = string
description = "RDS instance class, restricted to approved sizes"
validation {
# Only allow instance classes approved by the platform team
condition = contains(["db.t3.medium", "db.r6g.large", "db.r6g.xlarge", "db.r6g.2xlarge"], var.instance_class)
error_message = "Instance class must be one of the platform-approved sizes."
}
}
variable "allocated_storage_gb" {
# Storage size in gigabytes
type = number
description = "Allocated storage in GB, minimum 20, maximum 5000"
validation {
# Enforce storage boundaries to prevent cost overruns
condition = var.allocated_storage_gb >= 20 && var.allocated_storage_gb <= 5000
error_message = "Storage must be between 20 and 5000 GB."
}
}
# Root configuration consuming the module with version pin
# payments-prod/main.tf
module "orders_database" {
# Source from private registry with semantic version constraint
source = "app.terraform.io/company/rds-instance/aws"
version = "~> 2.3" # Accept any 2.x >= 2.3, reject 3.x
# Pass validated inputs to the module
instance_name = "payments-orders-prod"
instance_class = "db.r6g.large"
allocated_storage_gb = 500
# Pass outputs from the network module as inputs
subnet_group_name = module.vpc.database_subnet_group_name
security_group_ids = [module.vpc.database_security_group_id]
}
# Output the database endpoint for consumption by other configurations
output "orders_db_endpoint" {
# Expose the RDS endpoint for application configuration
value = module.orders_database.endpoint
description = "Connection endpoint for the orders database"
}◈ Architecture Diagram
┌──────────┐
│ Registry │
│ v2.3.1 │
└────┬─────┘
│
┌────┴─────┐
│ Root Cfg │
│ prod │
└──┬───┬───┘
│ │
↓ ↓
┌─────┐┌─────┐
│ VPC ││ RDS │
│ mod ││ mod │
└──┬──┘└──┬──┘
│ │
↓ ↓
┌──────────┐
│ outputs │
└──────────┘Quick Answer
Terraform Cloud enforces governance through Sentinel policies that evaluate plans as code before apply, cost estimation that flags unexpected spend, run tasks that integrate external checks like security scanners, and VCS workflows that trigger plans on pull requests. This shifts enforcement left into the plan phase so teams get fast feedback without needing manual approval for every change.
Detailed Answer
Think of a highway with automated safety systems. Speed cameras (Sentinel policies) automatically flag violations, fuel cost displays (cost estimation) warn drivers before they commit to a route, roadside inspection stations (run tasks) check specific safety requirements, and GPS-guided lanes (VCS workflows) route each vehicle through the correct path. The highway keeps moving because enforcement is automated, not manual. Terraform Cloud Enterprise provides a collaborative platform where infrastructure changes follow a standardized workflow: code is committed to VCS, a plan is triggered, governance checks run, and apply executes only after all checks pass. Sentinel is HashiCorp's policy-as-code framework that evaluates Terraform plans, state, and configuration using a policy language. Policies can enforce rules like requiring encryption on all S3 buckets, restricting instance types to cost-approved sizes, mandating specific tags on every resource, or preventing deletion of production databases. Policies are organized into policy sets that are applied to specific workspaces or all workspaces in an organization. Internally, the run pipeline processes stages in order: VCS trigger, terraform plan, cost estimation, Sentinel policy check, run tasks, and terraform apply. Cost estimation parses the plan output and calculates the monthly cost delta using HashiCorp's pricing database, surfacing changes like adding a db.r6g.2xlarge that increases monthly spend by $1,200. Run tasks are webhook-based integrations that send the plan JSON to external systems — security scanners like Snyk or Prisma Cloud, compliance checkers, or custom approval systems — and wait for a pass/fail response. Each run task can be advisory (warning only) or mandatory (blocking apply). The entire pipeline runs automatically on pull request creation, giving developers feedback in minutes rather than waiting for a manual review. At production scale, governance design requires balancing safety with velocity. Hard-mandatory Sentinel policies should cover non-negotiable rules like encryption and tagging. Soft-mandatory policies allow overrides with justification for edge cases like temporary large instances for data migration. Advisory policies educate teams about best practices without blocking. Cost estimation thresholds can be set to require manager approval for changes exceeding a dollar amount. VCS workflows should use speculative plans on pull requests (plan only, no apply) so developers see the impact before merging, and auto-apply on the main branch for environments like dev where speed matters more than manual gates. The non-obvious gotcha is that Sentinel policies execute after the plan phase, so they cannot prevent Terraform from planning invalid configurations — they can only block the apply. If a Sentinel policy references a resource attribute that does not exist in the plan (because the resource was removed), the policy can fail with a confusing error rather than a clean policy violation. Teams should test Sentinel policies against mock plan data in CI using the Sentinel CLI before deploying them to Terraform Cloud. Another trap is over-engineering run tasks: each run task adds latency to the pipeline, and if the external system is slow or unreliable, it blocks every infrastructure change across the organization.
Code Example
# Sentinel policy: require encryption on all S3 buckets
# policies/s3-encryption-required.sentinel
import "tfplan/v2" as tfplan
# Find all S3 bucket resources being created or updated
s3_buckets = filter tfplan.resource_changes as _, rc {
# Match only aws_s3_bucket resources with create or update actions
rc.type is "aws_s3_bucket" and
rc.mode is "managed" and
(rc.change.actions contains "create" or rc.change.actions contains "update")
}
# Check that every bucket has a server-side encryption configuration
encryption_check = rule {
all s3_buckets as _, bucket {
# Verify the bucket_encryption block is not null after apply
bucket.change.after.server_side_encryption_configuration is not null
}
}
# Main rule that must pass for the apply to proceed
main = rule {
encryption_check
}
# sentinel.hcl — Policy set configuration
# policy "s3-encryption-required" {
# source = "./policies/s3-encryption-required.sentinel"
# enforcement_level = "hard-mandatory" # Cannot be overridden
# }
# terraform-cloud workspace configuration via CLI
# Create a workspace connected to VCS with auto-apply disabled for production
# terraform login
# terraform workspace new payments-prod -organization=company
# .terraform-cloud.auto.tfvars — Workspace variable defaults
# These are set in the Terraform Cloud UI or API for the workspace
# environment = "prod"
# team = "payments"
# cost_threshold = 500
# Run task configuration via API (register a security scanner)
# curl -s -X POST \
# -H "Authorization: Bearer $TFC_TOKEN" \
# -H "Content-Type: application/vnd.api+json" \
# https://app.terraform.io/api/v2/organizations/company/tasks \
# -d '{"data":{"type":"tasks","attributes":{"name":"snyk-iac-scan","url":"https://hooks.snyk.io/terraform-cloud","category":"task","hmac-key":"secret-key"}}}'◈ Architecture Diagram
┌──────────┐
│ VCS Push │
└────┬─────┘
↓
┌──────────┐
│ tf plan │
└────┬─────┘
↓
┌──────────┐
│ Cost Est │
└────┬─────┘
↓
┌──────────┐
│ Sentinel │
└────┬─────┘
↓
┌──────────┐
│ Run Task │
└────┬─────┘
↓
┌──────────┐
│ tf apply │
└──────────┘Quick Answer
Terraform modules are reusable containers of related resources defined in a directory with its own variables, outputs, and resource blocks. Good module design follows single-responsibility, exposes minimal required variables, uses sensible defaults, and avoids hardcoding environment-specific values.
Detailed Answer
A Terraform module is essentially a directory containing .tf files that encapsulate a logical group of resources. Think of modules like functions in programming: they take inputs (variables), do something (create resources), and return outputs. The root module is your working directory where you run terraform commands, and any module you call from there is a child module. When you write module "payments_vpc" { source = "./modules/vpc" }, Terraform loads that directory as an isolated configuration unit with its own namespace. Internally, when Terraform processes a module call, it creates a separate resource namespace prefixed with module.payments_vpc. Resources inside the module cannot directly access resources outside it — they communicate only through input variables and output values. This enforced encapsulation is what makes modules safe to reuse. Terraform also supports module sources from Git repositories, the Terraform Registry, S3 buckets, and HTTP URLs, enabling organization-wide module libraries. Good module design starts with the single-responsibility principle. A VPC module should create a VPC, subnets, route tables, and NAT gateways — it should not also create your RDS database. Each module should represent one logical infrastructure component. Variables should have descriptions, type constraints, and sensible defaults where possible. For example, a VPC module might default to three availability zones and a /16 CIDR block but allow overrides. Outputs should expose the identifiers that downstream modules need — VPC ID, subnet IDs, security group IDs — nothing more. One critical design principle is avoiding hardcoded provider configurations inside modules. The module should inherit the provider from the calling module, not declare its own. This allows the same module to be used across multiple AWS accounts or regions by simply changing the provider in the root module. Similarly, avoid hardcoding backend configurations or environment-specific values like account IDs inside modules. Version pinning is essential for module stability. When sourcing modules from a registry or Git, always pin to a specific version or Git tag. Using version = "~> 2.0" ensures you get patch updates but not breaking major version changes. Without version pinning, a terraform init on Monday might pull different module code than the same command on Friday, leading to unpredictable infrastructure changes. Production-grade modules also include validation blocks for input variables, meaningful error messages, comprehensive README documentation, and example configurations. Teams that invest in a well-designed internal module library see dramatic reductions in infrastructure provisioning time and configuration drift across environments.
Code Example
# Root module calling the reusable VPC module for the payments platform
module "payments_vpc" {
# Source the module from the internal Git repository at a pinned version tag
source = "git::https://github.com/fintech-infra/terraform-modules.git//vpc?ref=v2.4.1"
# Name the VPC after the service and environment
vpc_name = "payments-platform-prod"
# Use a /16 CIDR block giving 65536 addresses for the payments network
vpc_cidr = "10.20.0.0/16"
# Deploy across three availability zones for high availability
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
# Enable NAT gateway for private subnet internet access
enable_nat_gateway = true
# Use a single NAT gateway in non-prod to save costs, one per AZ in prod
single_nat_gateway = false
# Enable DNS hostnames so RDS instances get resolvable DNS names
enable_dns_hostnames = true
# Tags applied to every resource the module creates
common_tags = {
Environment = "production"
Team = "payments-backend"
CostCenter = "CC-4421"
ManagedBy = "terraform"
}
}
# Inside modules/vpc/variables.tf — well-designed module inputs
variable "vpc_name" {
# Human-readable description shown in terraform plan output
description = "Name prefix for all VPC resources"
# Enforce string type at plan time
type = string
# Validate that the name follows the org naming convention
validation {
condition = can(regex("^[a-z][a-z0-9-]+$", var.vpc_name))
error_message = "VPC name must be lowercase alphanumeric with hyphens."
}
}
variable "vpc_cidr" {
# Describe what this CIDR block is used for
description = "CIDR block for the VPC network range"
# Enforce string type
type = string
# Default to a /16 block if not specified
default = "10.0.0.0/16"
}
# Inside modules/vpc/outputs.tf — expose only what consumers need
output "vpc_id" {
# Describe the output for documentation and discoverability
description = "The ID of the created VPC"
# Reference the VPC resource's ID attribute
value = aws_vpc.main.id
}
output "private_subnet_ids" {
# Consumers use these to place databases and internal services
description = "List of private subnet IDs across all availability zones"
# Collect all private subnet IDs into a list
value = aws_subnet.private[*].id
}◈ Architecture Diagram
┌─────────────────────────────────────────────────────┐
│ Root Module (Working Dir) │
│ │
│ ┌───────────────┐ ┌────────────────────┐ │
│ │ main.tf │ │ variables.tf │ │
│ │ │ │ env = "prod" │ │
│ │ module call───┼──┐ │ region = "us-east" │ │
│ └───────────────┘ │ └────────────────────┘ │
└─────────────────────┼───────────────────────────────┘
│
┌───────────▼───────────┐
│ Module: payments_vpc │
│ (modules/vpc/) │
│ │
│ ┌─────────────────┐ │
│ │ variables.tf │ │
│ │ (inputs) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ main.tf │ │
│ │ aws_vpc │ │
│ │ aws_subnet │ │
│ │ aws_nat_gateway │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ outputs.tf │ │
│ │ vpc_id │ │
│ │ subnet_ids │ │
│ └─────────────────┘ │
└───────────────────────┘Quick Answer
Multi-window burn rate alerting fires when the error rate burns through the error budget faster than expected across both a long window (1h) and a short window (5m). This reduces alert noise compared to static thresholds by only alerting when the burn rate is sustained enough to exhaust the budget within the SLO period.
Detailed Answer
Think of a car's fuel gauge. A static threshold alert says 'warn at 25% fuel' — but that ignores whether you are on a highway burning fuel fast or parked with the engine off. Multi-window burn rate is like saying 'warn when fuel consumption over the last hour would empty the tank before you reach the next gas station, AND you are still burning fast right now.' This catches real problems while ignoring brief spikes. SLO-based alerting starts with defining an error budget. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — about 43 minutes of downtime. The burn rate is how fast you are consuming this budget. A burn rate of 1x means you will exactly exhaust the budget by the end of the period. A burn rate of 14.4x means you will exhaust the 30-day budget in just 2 days. Multi-window burn rate uses two windows to reduce false positives. The long window (typically 1 hour) detects sustained error rates that threaten the budget. The short window (typically 5 minutes) confirms the problem is still happening right now. Both conditions must be true for the alert to fire. This prevents alerting on brief spikes that self-resolve (short window would not fire) and on historical errors that have already been fixed (long window shows the past, short window confirms the present). Google's SRE book recommends multiple severity tiers: 14.4x burn rate over 1h/5m for critical (page), 6x over 6h/30m for warning (ticket). At production scale, teams define recording rules that pre-compute error ratios for each SLI at multiple windows. The error ratio is calculated as rate(http_requests_total{status=~"5.."}[window]) / rate(http_requests_total[window]). Recording rules at 5m, 30m, 1h, and 6h windows avoid expensive queries at alert evaluation time. Grafana dashboards show the remaining error budget as a percentage, making it visual whether the team can ship features or must focus on reliability. The non-obvious gotcha is that burn rate alerts assume a uniform error distribution, which rarely matches reality. A 5-minute outage that burns 10% of the monthly budget followed by 29 days of perfect operation is very different from a constant 0.1% error rate. Teams should complement burn rate alerts with absolute threshold alerts for catastrophic failures (error rate > 50% for 1 minute) that would cause immediate user impact regardless of the monthly budget.
Code Example
# Recording rules for multi-window error ratios
# prometheus-rules.yaml
groups:
- name: slo-payments-api
rules:
# 5-minute error ratio (short window)
- record: payments_api:error_ratio:5m
expr: |
sum(rate(http_requests_total{service="payments-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="payments-api"}[5m]))
# 1-hour error ratio (long window)
- record: payments_api:error_ratio:1h
expr: |
sum(rate(http_requests_total{service="payments-api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{service="payments-api"}[1h]))
# Multi-window burn rate alert (14.4x = exhausts 30-day budget in 2 days)
- alert: PaymentsAPIHighBurnRate
expr: |
payments_api:error_ratio:1h > (14.4 * 0.001)
and
payments_api:error_ratio:5m > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "payments-api burning error budget at 14.4x rate"Quick Answer
Prometheus reuses the newest sample only if it falls within the lookback window, which defaults to 5 minutes. When a target or metric disappears, Prometheus writes a staleness marker so queries stop returning the old value instead of silently carrying it forever.
Detailed Answer
Think of a train station display board. If the 8:10 train reported its location two minutes ago, the board can still show a useful last-known position. If the train has not reported for an hour, showing that old position would mislead passengers. Prometheus has the same problem with metrics: a recent sample is fine to use at query time, but an old sample should eventually disappear so graphs and alerts do not pretend the system is healthy. PromQL, the Prometheus query language, evaluates instant queries at a single timestamp and range queries at many evenly spaced timestamps. For each evaluation timestamp, Prometheus looks backward for the newest sample inside the lookback window. The default lookback is 5 minutes, and it is configurable. This lets queries work even when scrapes do not land exactly on graph step boundaries. Without this behavior, normal scrape timing jitter would create broken graphs and unreliable aggregations. Staleness adds another layer. If a target scrape no longer returns a series that previously existed, or if service discovery removes a target entirely, Prometheus can write a staleness marker for that time series. After that marker, instant queries no longer return the old value for that series. This prevents stale readings from being treated as current values in aggregations like sum, avg, or alert expressions. If fresh samples later arrive for the same label set, the series simply reappears. Production alerting gets subtle here. An alert like `up == 0` catches failed scrapes where the target is still known but unreachable. However, it may not catch a target that vanished from service discovery, because there may be no `up` series left to evaluate. For detecting missing services, `absent()` or inventory-based alerts are usually needed. Engineers also tune scrape_interval, scrape_timeout, evaluation_interval, and the alert `for` duration so brief network hiccups do not page people while true disappearances still get caught quickly. The experienced gotcha is that a graph can look flat or empty for different reasons. A flat line may mean Prometheus is carrying a recent last sample inside the lookback window, while an empty graph may mean the series went stale, not that the value became zero. Exporters that attach their own timestamps can behave differently and may keep the last value visible until lookback expires. A common band-aid is using `or vector(0)` everywhere, which makes dashboards look tidy but hides missing telemetry. Senior engineers learn to distinguish between zero, missing, stale, and failed-scrape states explicitly rather than papering over the differences.
Code Example
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=up{job="payments-api"}' # Checks whether Prometheus still sees the payments-api target and whether the latest scrape succeeded.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=absent(up{job="payments-api"})' # Detects the case where the target disappeared from service discovery and no up series exists.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=max_over_time(up{job="payments-api"}[10m])' # Shows whether the target was present at any point during the last 10 minutes.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=time() - timestamp(up{job="payments-api"})' # Measures how old the newest up sample is for the target.
promtool check rules /etc/prometheus/rules/payments-availability.yml # Validates alert rules before reloading them into Prometheus.◈ Architecture Diagram
┌──────────┐
│ Target │
└────┬─────┘
↓ scrape
┌──────────┐
│ Sample │
└────┬─────┘
↓ query
┌──────────┐
│ Lookback │
└────┬─────┘
↓
┌──────────┐ ┌──────────┐
│ Present │ │ Stale │
└────┬─────┘ └────┬─────┘
↓ ↓
┌──────────┐ ┌──────────┐
│ Alert │ │ Absent │
└──────────┘ └──────────┘Quick Answer
Use histograms when you need to aggregate percentiles across many instances and tie them to SLOs. Classic histograms need explicit bucket boundaries, native histograms reduce that manual work, and summaries calculate percentiles inside the app but cannot be safely combined across replicas.
Detailed Answer
Think of measuring checkout wait times by placing customers into labeled bins: under 100 ms, under 300 ms, under 1 second, and so on. If the bins are chosen around the thresholds the business actually cares about, the data is useful. If every bin is too wide, too narrow, or missing the SLO boundary, the final percentile looks precise but answers the wrong question. Prometheus histograms are that binning system for measurements like request duration. A classic Prometheus histogram exposes cumulative bucket counters using the `le` label, which stands for less than or equal, plus `_sum` and `_count` series. Prometheus calculates percentiles using the `histogram_quantile()` function over rates of those buckets. The big advantage of this design is that you can aggregate across pods, nodes, clusters, or jobs before calculating the percentile, which is why histograms are the go-to for distributed services. The cost is extra time series: each bucket boundary creates another series for every label combination. Native histograms change the storage model by representing many bucket spans more compactly and letting Prometheus handle histogram samples directly. They reduce some of the pain of choosing bucket boundaries manually and support more flexible percentile exploration. However, they require compatible Prometheus settings, client libraries, remote write backends, and query paths, so you need to check the full chain before adopting them. Summaries are a different animal: they compute selected quantiles inside each application process. That can be useful for a single process, but averaging p95 values across replicas is statistically wrong because each process saw a different number and shape of requests. The query path matters for getting correct results. For classic histograms, you typically apply `rate()` to `_bucket` counters, aggregate with `sum by (le, service)` or similar, then call `histogram_quantile()`. The `le` label must survive until the quantile function runs because it represents the bucket boundary. For SLO checks like seeing what fraction of requests finish under 300 ms, having an exact bucket at that boundary makes the calculation simple and reliable. For Apdex-style scores, you need buckets at both the satisfied and tolerated thresholds. The gotcha is that histogram percentiles are estimates, and how good the estimate is depends entirely on where you place the buckets. A p99 alert built on buckets of 100 ms, 1 second, and 10 seconds cannot accurately tell the difference between 1.2 seconds and 8 seconds. Another common mistake is averaging per-pod p95 values in Grafana, which gives equal weight to quiet pods and busy pods. Experienced engineers pick bucket boundaries around the SLO thresholds users care about, keep labels low-cardinality, aggregate buckets before computing quantiles, and verify that the remote storage path preserves the histogram type they depend on.
Code Example
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket{job="payments-api"}[5m])))' # Computes fleet-wide p95 latency from classic histogram buckets after aggregating by bucket boundary.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=sum(rate(http_request_duration_seconds_bucket{job="payments-api",le="0.3"}[5m])) / sum(rate(http_request_duration_seconds_count{job="payments-api"}[5m]))' # Calculates the fraction of payments-api requests completed within the 300 ms SLO bucket.
curl -G 'http://prometheus:9090/api/v1/query' --data-urlencode 'query=sum(rate(http_request_duration_seconds_sum{job="payments-api"}[5m])) / sum(rate(http_request_duration_seconds_count{job="payments-api"}[5m]))' # Calculates average latency from histogram sum and count without using quantile math.
promtool check rules /etc/prometheus/rules/payments-latency-slo.yml # Validates histogram-based recording and alerting rules before deployment.◈ Architecture Diagram
┌──────────┐
│ Request │
└────┬─────┘
↓
┌──────────┐
│ Buckets │
└────┬─────┘
↓
┌──────────┐
│ Rate │
└────┬─────┘
↓
┌──────────┐
│ Sum by le│
└────┬─────┘
↓
┌──────────┐
│ p95 │
└──────────┘Quick Answer
Remote write tails the Prometheus WAL into per-destination queues, shards the work across parallel senders, batches samples, and retries on failure. If queues fill up, Prometheus stops reading from the WAL for that destination. If the receiver stays down too long, unsent samples can be lost as WAL data gets compacted away.
Detailed Answer
Think of a shipping dock that sends packages from a factory to a central warehouse. The factory keeps making boxes, workers load them into several truck lanes, and sometimes the warehouse slows down. If the lanes fill up, boxes pile up at the dock. If the warehouse stays closed for hours, the factory has to choose between halting intake, using more dock space, or eventually throwing away boxes it can no longer hold. Prometheus remote write is that shipping dock for metrics. Prometheus first ingests samples locally through its normal scrape and WAL path. A remote write component then reads from the WAL, maps internal series IDs to their label sets, queues samples, and sends compressed HTTP requests to the configured remote endpoint. That endpoint might be Grafana Mimir, Thanos Receive, Cortex, VictoriaMetrics, or a managed cloud service. Remote write is not a magic way to backfill historical data; it is mainly a streaming replication path from the local ingestion flow. Backpressure shows up when the remote endpoint is slow, returning errors, rate-limiting, or totally unreachable. Prometheus uses shards, which are parallel sending workers, to improve throughput. Each shard has an in-memory queue with a capacity limit and a maximum batch size. Failed requests get retried with exponential backoff. Prometheus can automatically adjust the shard count based on the incoming sample rate and how long sends are taking. But once queues fill up, reading from the WAL for that remote write target is blocked, and pending samples start piling up. In production, the key metrics to watch are pending samples, failed samples, retried samples, send batch duration, current shard count, and queue capacity. Tuning usually starts with the receiver side: confirm it is healthy, not throttling, and not rejecting samples due to tenant limits or bad labels. Then tune Prometheus settings like `max_samples_per_send`, `capacity`, `max_shards`, and backoff values. Capacity should generally be several times the batch size, but setting it too high increases Prometheus memory usage. Write relabeling can drop expensive or unnecessary samples before they even leave Prometheus. The gotcha is that cranking every knob up can make the outage worse. More shards can overwhelm a backend that is trying to recover. More queue capacity can cause Prometheus memory pressure, especially during high series churn because remote write caches series labels. Another gotcha involves the two-hour WAL window: if remote write stays blocked longer than the WAL can hold unsent data, samples get lost when the WAL is compacted. Senior engineers treat remote write tuning as end-to-end flow control, not just a matter of making the queue bigger.
Code Example
remote_write: # Sends locally ingested samples to a central backend such as Mimir or Thanos Receive.
- url: https://mimir-write.monitoring.svc/api/v1/push # Points Prometheus at the remote write receiver endpoint.
name: payments-mimir # Gives this remote write queue a stable name in metrics and logs.
remote_timeout: 30s # Bounds each send request so slow receivers do not hang workers forever.
queue_config: # Controls memory queues and parallel send workers for this remote write target.
max_samples_per_send: 5000 # Sends larger batches to improve throughput when the receiver supports them.
capacity: 30000 # Keeps per-shard capacity about six times the batch size to absorb short slowdowns.
max_shards: 10 # Caps parallelism so Prometheus does not overload the central backend during recovery.
min_shards: 2 # Starts with two workers so the queue can drain promptly after restart.
min_backoff: 1s # Waits at least one second before retrying a failed send.
max_backoff: 30s # Prevents retry storms by backing off repeated failures.
write_relabel_configs: # Drops samples before remote write to reduce bandwidth and receiver load.
- source_labels: [__name__] # Selects samples by metric name before deciding whether to send them.
regex: 'go_.*' # Matches noisy runtime metrics that the central backend does not need.
action: drop # Drops matching samples from remote write while keeping local scrape data.◈ Architecture Diagram
┌──────────┐
│ WAL │
└────┬─────┘
↓
┌──────────┐
│ Queue │
└────┬─────┘
↓
┌──────────┐
│ Shards │
└────┬─────┘
↓
┌──────────┐
│ Receiver │
└────┬─────┘
↓
┌──────────┐
│ Object │
└──────────┘Quick Answer
Alertmanager groups related alerts, deduplicates notifications, routes them to the right receiver, silences planned noise, and inhibits lower-level alerts when a parent alert explains them. In HA mode, Prometheus should send alerts directly to every Alertmanager peer, not through a load balancer.
Detailed Answer
Think of a hospital emergency department during a city-wide power failure. Thousands of alarms pour in from buildings, traffic lights, and elevators. Operators do not want a separate phone call for each alarm. They want one grouped incident per affected area, with enough detail to know which buildings still need help. Alertmanager is that dispatch layer for Prometheus alerts. Prometheus evaluates alerting rules and sends firing or resolved alerts to Alertmanager over HTTP. Alertmanager then groups alerts by chosen labels, routes each group through a routing tree to the right receiver (Slack, PagerDuty, email), deduplicates repeated notifications, applies silences for planned maintenance, and applies inhibitions when one alert makes another redundant. For example, if a ClusterDown alert is firing, an inhibition rule can suppress thousands of pod-level alerts from that same cluster because they are all symptoms of the same root cause. Grouping is label-driven. The group_by setting picks which labels define a notification group. group_wait delays the first notification briefly so related alerts can arrive together. group_interval controls how often new alerts get added to an existing group. repeat_interval controls how frequently unresolved alerts are re-sent. Inhibition rules compare a source alert against target alerts using matchers and equality labels. Silences use matchers and time windows, and are usually created through the Alertmanager UI or API during maintenance windows. For high availability, multiple Alertmanager instances form a cluster and share notification state through a gossip protocol. Prometheus should be configured with all Alertmanager peers listed as targets. The Prometheus docs warn against putting a load balancer between Prometheus and Alertmanager because each Prometheus instance needs to deliver alerts to the full cluster so deduplication and state replication work correctly. Teams also set external labels like cluster, region, and replica carefully so Alertmanager can tell independent environments apart while still deduplicating HA Prometheus replicas. The gotcha is that label design can either flood your team or hide a real outage. If group_by includes pod, every single pod failure during a deployment becomes a separate page. If it only groups by alertname, unrelated production and staging incidents might collapse into one notification. Inhibition can be dangerous too -- if the source alert is too broad or fires too easily, it can silence real alerts. Senior engineers test alert routes with sample payloads, keep grouping labels tied to ownership and blast radius, and regularly review active silences to make sure planned maintenance windows have not turned into black holes for real incidents.
Code Example
alerting: # Configures where Prometheus sends evaluated alerts.
alertmanagers: # Lists Alertmanager targets for alert delivery.
- static_configs: # Uses explicit peer targets instead of a load-balanced single endpoint.
- targets: ['alertmanager-0:9093','alertmanager-1:9093','alertmanager-2:9093'] # Sends alerts directly to every HA Alertmanager peer.
route: # Defines the root Alertmanager routing tree.
receiver: sre-pager # Sends unmatched production alerts to the SRE paging receiver.
group_by: ['cluster','namespace','alertname'] # Groups by blast radius without grouping unrelated clusters together.
group_wait: 30s # Waits briefly so related alerts from the same incident can arrive together.
group_interval: 5m # Controls how often new alerts are added to an existing notification group.
repeat_interval: 4h # Prevents repeated pages for the same unresolved alert group.
inhibit_rules: # Suppresses noisy child alerts when a parent outage alert is already firing.
- source_matchers: ['alertname="ClusterDown"'] # Uses the cluster-level outage alert as the inhibition source.
target_matchers: ['severity="warning"'] # Suppresses lower-severity warning alerts during the parent outage.
equal: ['cluster'] # Applies inhibition only inside the same cluster label value.◈ Architecture Diagram
┌──────────┐
│ Rules │
└────┬─────┘
↓
┌──────────┐
│ Alerts │
└────┬─────┘
↓
┌──────────┐
│ Group │
└────┬─────┘
↓
┌──────────┐ ┌──────────┐
│ Inhibit │←────│ Silence │
└────┬─────┘ └──────────┘
↓
┌──────────┐
│ Route │
└────┬─────┘
↓
┌──────────┐
│ Pager │
└──────────┘Quick Answer
Store dashboard JSON and alert rule YAML in Git. Use Grafana provisioning, Grafonnet (a Jsonnet library), or Terraform's Grafana provider to define dashboards as code. Changes go through PR review, CI validates syntax, and CD applies them automatically. Updating 10 dashboards means changing one template and pushing a single commit.
Detailed Answer
Think of it like managing a chain of restaurants where every location has to serve the same menu. Instead of calling each manager and dictating changes over the phone, which is like clicking around in Grafana's UI, you update the master menu in a shared drive, the managers review it, and an automated system prints and ships the new menus to all locations at once. The GitOps workflow for Grafana has three main approaches, from simple to powerful. The simplest is Grafana's built-in provisioning: you put dashboard JSON files and alert rule YAML files in a directory that Grafana watches, usually mounted via a ConfigMap in Kubernetes. When the files change, Grafana reloads them. You store these files in Git, and your CI/CD pipeline updates the ConfigMap every time a change merges to main. The second approach uses Grafonnet, a Jsonnet library for generating Grafana dashboard JSON programmatically. Instead of writing raw 500-line JSON files by hand, you write concise Jsonnet code that generates them. This is where updating 10 dashboards at once becomes easy: if all 10 share a common template, say a service dashboard with CPU, memory, error rate, and latency panels, you define the template once and pass in parameters per service. Changing the template changes all 10 dashboards in one commit. Jsonnet compiles down to JSON, which then gets provisioned into Grafana. The third approach uses Terraform with the Grafana provider. You define dashboards, folders, alert rules, and notification channels as Terraform resources. The CI pipeline runs `terraform plan` on pull requests to show what would change and `terraform apply` on merge. This gives you state management, drift detection, and the full Terraform workflow. For large organizations managing hundreds of dashboards across multiple Grafana instances, this is the most maintainable path. For alerts, Grafana's alerting rules and notification policies can also be defined in YAML and provisioned alongside dashboards. The entire alerting chain -- rules, routing policies, contact points, and message templates -- lives in Git, version-controlled and reviewable. The day-to-day workflow looks like this: a developer creates a branch, modifies dashboard Jsonnet or Terraform files, opens a pull request, CI runs syntax checks like jsonnet lint or terraform validate and optionally renders a preview, a reviewer approves, the PR merges to main, and the CD pipeline applies changes to Grafana. The big win is that every change is code-reviewed, version-controlled, and reversible with a simple `git revert`.
Code Example
# ─── Approach 1: Grafana Provisioning via ConfigMap ───
# dashboards.yaml (Grafana provisioning config)
apiVersion: 1
providers:
- name: default
type: file
options:
path: /var/lib/grafana/dashboards # Watch this directory
foldersFromFilesStructure: true
# Mount dashboards from ConfigMap in Kubernetes
# kubectl create configmap grafana-dashboards \
# --from-file=dashboards/ -n monitoring
# ─── Approach 2: Grafonnet (Jsonnet) ───
# service-dashboard.jsonnet — one template, many dashboards
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
# Template function — reused for all services
local serviceDashboard(name, namespace) =
dashboard.new(name + ' Service Dashboard')
+ dashboard.withUid(name + '-svc')
+ dashboard.withPanels([
# CPU panel
grafana.panel.timeSeries.new(name + ' CPU')
+ { targets: [prometheus.new(
'sum(rate(container_cpu_usage_seconds_total{namespace="' + namespace + '", pod=~"' + name + '.*"}[5m]))'
)] },
# Error rate panel
grafana.panel.timeSeries.new(name + ' Error Rate')
+ { targets: [prometheus.new(
'sum(rate(http_requests_total{namespace="' + namespace + '", status=~"5.."}[5m]))'
)] },
]);
# Generate 10 dashboards from one template
{
'payments-api.json': serviceDashboard('payments-api', 'production'),
'checkout-svc.json': serviceDashboard('checkout-svc', 'production'),
'user-auth.json': serviceDashboard('user-auth', 'production'),
# ... 7 more services
}
# Build: jsonnet -J vendor/ -m output/ service-dashboard.jsonnet
# ─── Approach 3: Terraform ───
resource "grafana_dashboard" "payments" {
config_json = file("dashboards/payments-api.json")
folder = grafana_folder.production.id
}
# CI Pipeline (.github/workflows/grafana.yml)
# on PR: terraform plan → post diff as comment
# on merge: terraform apply → dashboards updated◈ Architecture Diagram
GitOps Workflow for Grafana:
┌──────────┐ PR ┌──────────┐
│Developer │───────────►│ Git │
│ │ │ (main) │
│ edit │ review │ │
│ .jsonnet │◄───────────│ CI runs: │
│ or .tf │ approve │ lint │
└──────────┘ │ plan │
└────┬─────┘
│ merge
▼
┌─────────────────┐
│ CD Pipeline │
│ │
│ jsonnet build │
│ OR │
│ terraform apply │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Grafana │
│ │
│ 10 dashboards │
│ updated from │
│ 1 template │
└─────────────────┘
Template → 10 dashboards:
┌──────────────┐
│ svc-template │
│ .jsonnet │──► payments-api.json
│ │──► checkout-svc.json
│ │──► user-auth.json
│ │──► ... 7 more
│ 1 change = │
│ 10 updates │
└──────────────┘Quick Answer
Symptom-based alerting fires on things users actually feel, like high error rates, slow responses, or SLO budget burn, instead of internal causes like high CPU or disk at 80%. It cuts alert noise dramatically because many internal causes map to just a few user-facing symptoms. You implement it with SLO-based burn rate alerts in Prometheus using multi-window, multi-burn-rate rules.
Detailed Answer
Think of it like a car dashboard. Cause-based alerting would mean separate warning lights for every internal part: fuel injector pressure, alternator voltage, coolant thermostat position, oxygen sensor reading. You would have 200 lights and no idea which ones matter. Symptom-based alerting gives you one light that says 'engine temperature high,' and the mechanic investigates the cause from there. Traditional monitoring creates alerts for every possible internal state: CPU above 80%, disk above 85%, memory above 90%, Pod restarts above 3, queue depth above 1000. This leads to massive alert fatigue. A team with 50 microservices might have 500-plus alert rules, most of which fire for brief spikes that fix themselves. Engineers start ignoring alerts, and when a real outage happens, the critical signal is buried in noise. Symptom-based alerting flips this around. You alert on what users experience: the error rate is burning through the SLO budget faster than sustainable, latency has crossed the SLO target, or availability has dropped below the threshold. These are called SLI-based alerts, where SLI stands for Service Level Indicator. If CPU is at 95% but the error rate is 0% and latency is normal, there is no user impact, so no alert is needed. If CPU is at 40% but the error rate is 5%, users are hurting, so you alert right away. The best way to implement this is Google's multi-window, multi-burn-rate approach from the SRE book. You define an SLO such as 99.9% availability over 30 days, which gives you an error budget of 43.2 minutes of allowed downtime. Then you create burn rate alerts. A fast-burn alert fires when the error rate is consuming budget at 14.4 times the sustainable rate, meaning the entire monthly budget would be gone in 2 hours. This catches acute incidents. A slow-burn alert fires at 1 times the sustainable rate held over 3 days, catching gradual degradation. Each alert uses two time windows, a short one like 5 minutes and a long one like 1 hour, so a single brief spike does not trigger a false alarm. In Prometheus, this translates to recording rules that calculate error ratios over multiple windows, plus alert rules that compare burn rates against thresholds. Grafana displays an SLO dashboard showing remaining error budget, burn rate trends, and alert status. The result: instead of 500 noisy alerts, you might have 10 to 20 SLO-based alerts across all services, each one actionable and tied to real user impact.
Code Example
# ─── SLO Definition ───
# Service: payments-api
# SLO: 99.9% availability (error budget: 0.1% or 43.2 min/month)
# ─── Recording Rules (prometheus-rules.yaml) ───
groups:
- name: payments-slo
rules:
# Error ratio over different windows
- record: payments:error_ratio:5m
expr: |
sum(rate(http_requests_total{job="payments-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payments-api"}[5m]))
- record: payments:error_ratio:1h
expr: |
sum(rate(http_requests_total{job="payments-api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="payments-api"}[1h]))
- record: payments:error_ratio:6h
expr: |
sum(rate(http_requests_total{job="payments-api",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="payments-api"}[6h]))
# ─── Alert Rules (burn rate) ───
# Fast burn: 14.4x budget consumption → page immediately
- alert: PaymentsSLOFastBurn
expr: |
payments:error_ratio:5m > (14.4 * 0.001)
and
payments:error_ratio:1h > (14.4 * 0.001)
for: 2m
labels:
severity: critical
annotations:
summary: "Payments API burning error budget 14x too fast"
description: "At this rate, monthly budget exhausted in 2 hours"
# Slow burn: 1x sustained → ticket (not page)
- alert: PaymentsSLOSlowBurn
expr: |
payments:error_ratio:6h > (1 * 0.001)
and
payments:error_ratio:3d > (1 * 0.001)
for: 30m
labels:
severity: warning
annotations:
summary: "Payments API slowly burning error budget"
description: "Gradual degradation — investigate this week"◈ Architecture Diagram
Cause-Based (noisy): Symptom-Based (actionable): ┌────────────────────┐ ┌────────────────────┐ │ CPU > 80% PAGE │ │ │ │ Disk > 85% PAGE │ │ Error rate > SLO │ │ Memory > 90% PAGE │ │ burn rate? │ │ Restarts > 3 PAGE │ │ │ │ Queue > 1000 PAGE │ │ YES → PAGE │ │ Latency spike PAGE │ │ NO → silence │ └────────────────────┘ └────────────────────┘ 500+ alerts, most noise 10-20 alerts, all real Multi-Window Burn Rate: Error Budget: 43.2 min/month (99.9% SLO) ┌─── Fast Burn ──────────────────────┐ │ 14.4x burn rate │ │ 5min window AND 1hr window │ │ → exhausts budget in 2 hours │ │ → PAGE immediately │ └────────────────────────────────────┘ ┌─── Slow Burn ──────────────────────┐ │ 1x burn rate │ │ 6hr window AND 3day window │ │ → exhausts budget in 30 days │ │ → ticket, investigate this week │ └────────────────────────────────────┘
Quick Answer
Recording rules pre-compute expensive PromQL queries and save the results as new time series, making dashboards load faster. Alerting rules check PromQL conditions at regular intervals and fire alerts to Alertmanager when conditions stay true for a set duration.
Detailed Answer
Think of recording rules like a restaurant that preps ingredients before the dinner rush. Instead of chopping vegetables from scratch for every order, the kitchen pre-chops during quiet hours. Recording rules pre-compute expensive PromQL queries on a schedule so dashboards load instantly. Alerting rules are like a smoke detector: they continuously check a condition and sound the alarm when something crosses a threshold for long enough to be a real problem. Both types of rules are defined in YAML files and loaded by Prometheus through the rule_files config. They are organized into rule groups, where each group has a name and an optional evaluation interval. Recording rules have a record field (the name of the new metric to create) and an expr field (the PromQL expression to evaluate). The naming convention follows the pattern level:metric:operations -- for example, namespace:http_requests_total:rate5m tells you the aggregation level, the base metric, and what operation was applied. Alerting rules have an alert field (the alert name), an expr field, an optional for duration, labels to attach, and annotations for human-readable descriptions. Under the hood, Prometheus evaluates rules within each group sequentially but can run multiple groups in parallel. The evaluation interval defaults to the global setting but can be overridden per group. For recording rules, each evaluation writes a new sample to the TSDB with the current timestamp. For alerting rules, the evaluation produces one of three states: inactive (the expression returned nothing), pending (the expression matched but the for duration has not passed yet), or firing (the expression has been true for at least the for duration). When an alert hits firing state, Prometheus sends it to all configured Alertmanagers. In production, recording rules are essential for scaling dashboards. Without them, 50 engineers opening the same Grafana dashboard during an incident would each trigger the same expensive aggregation 50 times per refresh. Recording rules compute it once and store the result. A common pattern is building a pyramid: raw metrics get aggregated into per-service rates, then those rates get aggregated into per-team totals. For alerting, the for clause is critical -- it prevents false alarms from momentary spikes. A for: 5m clause means the condition must be continuously true for 5 minutes before the alert fires. A key gotcha with recording rules is circular dependencies. If rule A depends on the output of rule B, both must be in the same group with B listed first, because rules within a group run sequentially. Across groups, evaluation order is not guaranteed. For alerting rules, a common mistake is leaving out the for clause entirely, which causes alerts to fire on every brief spike. Another pitfall is hardcoding values in annotations instead of using template variables. Always include {{ $labels.instance }} and {{ $value }} in your annotation templates so on-call engineers can immediately see which target is affected and how bad it is.
Code Example
# prometheus-rules.yml - Recording and Alerting Rules
# Loaded via: rule_files: ['prometheus-rules.yml'] in prometheus.yml
groups:
# Recording rules for payments-api performance metrics
- name: payments_api_recording_rules # Group name for organization
interval: 30s # Evaluate every 30 seconds
rules:
# Pre-compute per-service request rate
- record: service:http_requests_total:rate5m # New time series name (level:metric:operation)
expr: > # PromQL expression to evaluate
sum by (service, environment) (
rate(http_requests_total[5m]) # Rate of requests over 5 minutes
)
labels:
aggregated_by: "recording_rule" # Custom label to identify pre-computed metrics
# Pre-compute error rate percentage
- record: service:http_error_rate:ratio_rate5m # Error ratio as a recording rule
expr: > # Avoids expensive division in dashboards
sum by (service) (rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))
# Pre-compute p99 latency per service
- record: service:http_request_duration:p99_5m # 99th percentile latency
expr: > # histogram_quantile is expensive at query time
histogram_quantile(0.99,
sum by (le, service) (
rate(http_request_duration_seconds_bucket[5m])
)
)
# Alerting rules for checkout-service SLOs
- name: checkout_service_alerts # Alerting rule group
rules:
# Alert when error rate exceeds 1% for 5 minutes
- alert: CheckoutHighErrorRate # Alert name (PascalCase convention)
expr: > # PromQL condition to evaluate
service:http_error_rate:ratio_rate5m{service="checkout-service"} > 0.01
for: 5m # Must be true for 5 min before firing
labels:
severity: critical # Routing label for Alertmanager
team: checkout # Team responsible for this alert
annotations:
summary: "High error rate on checkout-service" # Short description
description: > # Detailed description with templates
Error rate is {{ $value | humanizePercentage }}
for {{ $labels.service }} in {{ $labels.environment }}.
runbook_url: "https://wiki.internal/runbooks/checkout-errors" # Link to remediation steps
# Alert when p99 latency exceeds 2 seconds
- alert: CheckoutHighLatency # Latency SLO violation alert
expr: > # Use pre-computed recording rule
service:http_request_duration:p99_5m{service="checkout-service"} > 2.0
for: 10m # Longer for-clause to reduce noise
labels:
severity: warning # Warning severity, not critical
team: checkout # Ownership label
annotations:
summary: "P99 latency exceeds 2s on checkout-service"
description: > # Include actual value for quick triage
P99 latency is {{ $value | humanizeDuration }}
for {{ $labels.service }}.◈ Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐ │ Recording Rules vs Alerting Rules │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Recording Rules │ │ │ │ │ │ │ │ Expensive PromQL ──→ Evaluate every 30s │ │ │ │ expr ──→ Store as new metric │ │ │ │ in TSDB │ │ │ │ │ │ │ │ sum(rate(http_total[5m])) → service:http:rate5m │ │ │ │ [complex query] [pre-computed] │ │ │ └──────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ Alerting Rules │ │ │ │ │ │ │ │ PromQL expr ──→ Evaluate ──→ State Machine │ │ │ │ │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ INACTIVE │──→│ PENDING │──→│ FIRING │ │ │ │ │ │ expr=∅ │ │ expr=true│ │ for:5m │ │ │ │ │ └──────────┘ │ timer<5m │ │ elapsed │ │ │ │ │ ↑ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ │ │ │ └──expr=false──┘ ↓ │ │ │ │ ┌────────────┐ │ │ │ │ │Alertmanager│ │ │ │ │ │ routing │ │ │ │ │ └────────────┘ │ │ │ └──────────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────────────────────┘
Quick Answer
Alertmanager receives alerts from Prometheus, groups related ones by labels, routes them to the right receiver (Slack, PagerDuty, email) using a routing tree, and supports silences to temporarily mute notifications. Grouping reduces noise by batching alerts that share the same labels into one notification.
Detailed Answer
Think of Alertmanager as a hospital triage system. Patients (alerts) arrive and are grouped by condition. Cardiac cases go to cardiology, broken bones go to orthopedics (routing rules). If the hospital is doing planned maintenance on radiology machines, they put up a sign saying ignore false alarms from 2am to 4am (silences). The triage nurse does not page the doctor twice for the same patient (deduplication), and waits a few minutes to batch patients arriving together (group_wait). Alertmanager is a separate process that receives alerts from one or more Prometheus servers through its /api/v2/alerts endpoint. Its configuration defines receivers (notification channels like Slack or PagerDuty), a routing tree (which alerts go where), inhibition rules (suppress certain alerts when others are already firing), and templates for formatting notifications. The routing tree starts with a root route that has a default receiver. Child routes match on alert labels using matchers. Routes are evaluated top to bottom, and the first matching child wins unless continue: true is set, which lets evaluation continue to the next sibling. Grouping is Alertmanager's most important noise reduction feature. When group_by is set to something like [service, environment], all alerts with the same service and environment label values get bundled into a single notification. Three timing settings control notification behavior: group_wait is how long to wait for more alerts before sending the first notification for a new group (default 30 seconds), group_interval is the minimum time between updates to an existing group when new alerts arrive (default 5 minutes), and repeat_interval is how long to wait before resending an unresolved alert (default 4 hours). Getting these right is the difference between a useful alert system and one that either floods your phone or misses real problems. In production, a well-designed routing tree mirrors your organization's on-call structure. Critical payment alerts go to PagerDuty for immediate paging. Warning-level alerts for batch jobs go to a Slack channel. Silences are created through the Alertmanager UI or API and match alerts by label matchers -- they are essential during deployments and maintenance windows. Inhibition rules automatically suppress downstream alerts: when the entire cluster is unreachable (KubeAPIDown), you do not want 500 pod alerts flooding the channel. The inhibition rule says if KubeAPIDown is firing, suppress all alerts with the same cluster label. A common mistake is setting group_by to too many labels, like [service, pod, instance]. This creates one notification per pod, which defeats the purpose of grouping. On the other hand, the special value group_by: ['...'] groups nothing -- every alert becomes its own group. Another pitfall is setting repeat_interval too low, causing alert fatigue from constant re-notifications for chronic issues. The sweet spot is usually 4 to 12 hours. Engineers also forget that silences expire. If you create a 1-hour silence for a deployment that takes 2 hours, alerts will resume halfway through. Always add generous padding to silence durations.
Code Example
# alertmanager.yml - Alertmanager configuration
global:
resolve_timeout: 5m # Mark alert as resolved if not re-received in 5m
slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxx' # Default Slack webhook
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' # PagerDuty events API
# Notification templates
templates:
- '/etc/alertmanager/templates/*.tmpl' # Path to custom notification templates
# Routing tree - determines which alerts go to which receivers
route:
receiver: 'slack-default' # Default receiver if no child route matches
group_by: ['alertname', 'service'] # Group alerts by these labels
group_wait: 30s # Wait 30s for more alerts before first notification
group_interval: 5m # Wait 5m between updates to existing groups
repeat_interval: 4h # Re-send unresolved alerts every 4 hours
routes:
# Critical payment alerts go to PagerDuty immediately
- match:
severity: critical # Match alerts with severity=critical
team: payments # AND team=payments
receiver: 'pagerduty-payments' # Route to PagerDuty
group_wait: 10s # Shorter wait for critical alerts
repeat_interval: 1h # Re-page every hour if unresolved
continue: false # Stop matching after this route
# All critical alerts (non-payments) go to PagerDuty general
- match:
severity: critical # Match any critical alert
receiver: 'pagerduty-general' # General on-call PagerDuty
group_wait: 15s # Quick notification for critical
# Warning alerts go to team-specific Slack channels
- match:
severity: warning # Match warning-level alerts
receiver: 'slack-default' # Default Slack channel
routes:
- match:
team: checkout # Checkout team warnings
receiver: 'slack-checkout' # Team-specific Slack channel
- match:
team: payments # Payments team warnings
receiver: 'slack-payments' # Payments Slack channel
# Inhibition rules - suppress alerts when others are firing
inhibit_rules:
- source_match: # When this alert is firing...
alertname: 'KubeAPIDown' # Kubernetes API server is down
target_match_re: # ...suppress these alerts
alertname: 'Kube.*' # All Kubernetes-related alerts
equal: ['cluster'] # Only if cluster label matches
- source_match: # When critical alert is firing...
severity: 'critical' # For a specific service
target_match:
severity: 'warning' # Suppress warning alerts
equal: ['alertname', 'service'] # For the same alert and service
# Receivers - notification channel configurations
receivers:
- name: 'slack-default' # Default Slack receiver
slack_configs:
- channel: '#alerts-general' # Slack channel name
send_resolved: true # Notify when alert resolves
title: '{{ .GroupLabels.alertname }}' # Alert name as title
text: >- # Notification body template
{{ range .Alerts }}
*{{ .Labels.service }}* - {{ .Annotations.summary }}
{{ end }}
- name: 'pagerduty-payments' # PagerDuty for payments team
pagerduty_configs:
- service_key: 'payments-service-key-xxx' # PagerDuty integration key
severity: '{{ .GroupLabels.severity }}' # Map to PD severity
description: '{{ .CommonAnnotations.summary }}' # Alert summary
- name: 'slack-checkout' # Checkout team Slack channel
slack_configs:
- channel: '#checkout-alerts' # Team-specific channel
send_resolved: true # Send resolution notifications
- name: 'slack-payments' # Payments team Slack channel
slack_configs:
- channel: '#payments-alerts' # Team-specific channel
send_resolved: true # Send resolution notifications
- name: 'pagerduty-general' # General on-call PagerDuty
pagerduty_configs:
- service_key: 'general-oncall-key-xxx' # General integration key◈ Architecture Diagram
┌──────────────────────────────────────────────────────────────────┐ │ Alertmanager Routing Tree │ ├──────────────────────────────────────────────────────────────────┤ │ │ │ Prometheus ──→ POST /api/v2/alerts ──→ Alertmanager │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Root Route │ │ │ │ receiver: slack-default │ │ │ │ group_by: [alertname, service] │ │ │ │ │ │ │ │ ├── severity=critical AND team=payments │ │ │ │ │ └── receiver: pagerduty-payments │ │ │ │ │ │ │ │ │ ├── severity=critical │ │ │ │ │ └── receiver: pagerduty-general │ │ │ │ │ │ │ │ │ └── severity=warning │ │ │ │ ├── team=checkout │ │ │ │ │ └── receiver: slack-checkout │ │ │ │ └── team=payments │ │ │ │ └── receiver: slack-payments │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ Grouping Timeline: │ │ ┌────────┐ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Alert │→ │group_wait│→ │ 1st Notify │→ │group_interval│ │ │ │ Arrives│ │ (30s) │ │ (batch sent) │ │ (5m) │ │ │ └────────┘ └──────────┘ └──────────────┘ └──────┬───────┘ │ │ │ │ │ ┌───────↓───────┐ │ │ │repeat_interval│ │ │ │ (4h) │ │ │ │ re-send if │ │ │ │ unresolved │ │ │ └───────────────┘ │ └──────────────────────────────────────────────────────────────────┘