Uber

17 interview questions · kubernetes, prometheus

kubernetesprometheusadvancedarchitectintermediate

How do you run Kafka on Kubernetes using Strimzi, and what are the production challenges?

advancedgeneralkubernetes

▼

Quick Answer

Strimzi provides a Kubernetes Operator that manages Kafka clusters as custom resources. It handles broker deployment, topic management, user authentication, and rolling upgrades. Production challenges include persistent storage sizing, rack awareness for HA, and managing JVM heap tuning across broker pods.

Detailed Answer

Think of running a post office inside a shopping mall. The mall provides space, power, and security (Kubernetes), but the post office needs its own sorting machines, mailboxes, and delivery routes (Kafka brokers, topics, partitions). Strimzi is the contractor that builds and maintains the post office inside the mall, handling construction, repairs, and expansions without shutting down mail service. Strimzi is a CNCF project that provides Kubernetes Operators for running Apache Kafka. Instead of manually deploying Kafka brokers as StatefulSets with complex configuration, you declare a Kafka custom resource that specifies replicas, storage, listeners, and authentication. The Strimzi Operator reconciles this into the actual Kubernetes resources: StatefulSets for brokers and ZooKeeper (or KRaft controllers), Services for client access, ConfigMaps for broker configuration, and PersistentVolumeClaims for data. Internally, the Strimzi Operator watches for Kafka, KafkaTopic, KafkaUser, and KafkaConnect custom resources. When you create a Kafka CR with 3 replicas and 100Gi storage, the Operator creates a StatefulSet with 3 pods, each with a PVC for data and a PVC for logs. It configures broker IDs, advertised listeners with proper DNS names, inter-broker replication, and rack awareness using topology labels. For client access, Strimzi creates bootstrap Services and per-broker Services so clients can discover and connect to specific brokers. At production scale, the key challenges are storage performance (Kafka is I/O intensive, requiring SSDs with provisioned IOPS), JVM tuning (brokers need careful heap sizing to avoid long GC pauses), rolling upgrades (Strimzi performs rolling restarts but you must ensure ISR counts stay healthy), and monitoring (JMX metrics exported via Prometheus JMX exporter). Teams should configure PodDisruptionBudgets to prevent multiple brokers from going down simultaneously during node maintenance. The non-obvious gotcha is that Kafka's advertised listeners must be reachable by all clients. In Kubernetes, if you expose Kafka outside the cluster using NodePort or LoadBalancer listeners, each broker needs its own external address. Strimzi handles this with per-broker Services, but misconfigured DNS or security groups can cause clients to connect to the bootstrap but fail when redirected to individual brokers. Always test external client connectivity from outside the cluster before going live.

Code Example

# Install Strimzi Operator in the kafka namespace
kubectl create namespace kafka # Dedicated namespace for Kafka components
kubectl apply -f https://strimzi.io/install/latest?namespace=kafka -n kafka # Deploy Strimzi Operator CRDs and controller

# Create a 3-broker Kafka cluster with KRaft mode (no ZooKeeper)
apiVersion: kafka.strimzi.io/v1beta2 # Strimzi Kafka API
kind: Kafka # Custom resource for a Kafka cluster
metadata:
  name: payments-kafka # Cluster name used in Service DNS
  namespace: kafka # Deploy in the kafka namespace
spec:
  kafka:
    version: 3.7.0 # Kafka version
    replicas: 3 # Three brokers for HA
    listeners:
    - name: plain # Internal plaintext listener
      port: 9092 # Standard Kafka port
      type: internal # ClusterIP access only
    - name: tls # Internal TLS listener
      port: 9093 # TLS port
      type: internal # Encrypted internal traffic
      tls: true # Enable TLS encryption
    storage:
      type: persistent-claim # Use PVCs for durability
      size: 100Gi # 100GB per broker
      class: gp3 # AWS gp3 for consistent IOPS
    config:
      num.partitions: 12 # Default partitions per topic
      default.replication.factor: 3 # Replicate to all brokers
      min.insync.replicas: 2 # Require 2 ISR for acks=all

# Check broker pod status
kubectl get pods -n kafka -l strimzi.io/cluster=payments-kafka # Shows broker and controller pods

# Check under-replicated partitions
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions # Should return empty if healthy

◈ Architecture Diagram

┌──────────────────────────────┐
│  Strimzi Operator            │
│  (watches Kafka CRs)         │
└──────────┬───────────────────┘
           ↓
┌──────────────────────────────┐
│  Kafka CR → StatefulSet      │
│  ┌────────┐┌────────┐┌──────┐│
│  │Broker 0││Broker 1││Brk 2 ││
│  │100Gi   ││100Gi   ││100Gi ││
│  └────────┘└────────┘└──────┘│
└──────────────────────────────┘

How does Kafka enable event-driven microservices, and how do you guarantee message ordering?

advancedgeneralkubernetes

▼

Quick Answer

Kafka decouples producers from consumers through topics and partitions, enabling asynchronous event-driven communication. Message ordering is guaranteed within a single partition by using a consistent partition key. Exactly-once semantics require idempotent producers and transactional writes.

Detailed Answer

Think of a newspaper printing operation. Writers (producers) submit articles to different sections (topics) — sports, finance, weather. Each section has multiple columns (partitions). Articles within the same column are printed in order, but articles across different columns may interleave. If you want all articles about the same team in order, you assign them to the same column using the team name as a key. Kafka enables event-driven architecture by acting as a durable, distributed message log between microservices. Instead of services calling each other directly via HTTP (tight coupling), services publish events to Kafka topics and other services consume them independently. A payments service publishes payment-completed events, and the notification service, analytics service, and fraud detection service each consume those events at their own pace without knowing about each other. Internally, each Kafka topic is divided into partitions. Producers send messages with an optional key, and Kafka hashes the key to determine the partition. All messages with the same key go to the same partition, and within a partition, messages are strictly ordered by offset. Consumer groups assign partitions to consumers, so each partition is read by exactly one consumer in a group. For exactly-once semantics, Kafka provides idempotent producers (deduplicate retries using producer ID and sequence number) and transactional writes (atomic writes across multiple partitions). At production scale, partition key design is the most important decision. For a payments system, using the account ID as the partition key ensures all events for one account are ordered — payment initiated, payment authorized, payment completed. If you use random partitioning for throughput, you lose ordering guarantees. Consumer group rebalancing during scaling or failures can cause brief processing pauses, so teams should use cooperative sticky assignor to minimize disruption. The non-obvious gotcha is that ordering only works within a partition, not across partitions. If you need global ordering across all events, you must use a single partition — but that limits throughput to one consumer. Most systems design for per-entity ordering (all events for account-123 in order) rather than global ordering, which is sufficient for almost all business requirements.

Code Example

# Create a topic with 12 partitions and replication factor 3
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --create --topic payment-events \
  --partitions 12 \
  --replication-factor 3 # 12 partitions for parallel consumption, 3 replicas for durability

# Produce a keyed message (account ID ensures ordering per account)
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-console-producer.sh \
  --bootstrap-server localhost:9092 \
  --topic payment-events \
  --property parse.key=true \
  --property key.separator=: # Format: account-123:{"event":"payment_completed"}

# Consume with a consumer group
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-console-consumer.sh \
  --bootstrap-server localhost:9092 \
  --topic payment-events \
  --group settlements-processor \
  --from-beginning # Reads from earliest offset for new group

# Check consumer lag per partition
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe --group settlements-processor # Shows LAG column per partition

◈ Architecture Diagram

┌──────────┐
│ Producer │
│ key=acct │
└────┬─────┘
     ↓ hash(key)
┌─────────────────────┐
│ Topic: payments     │
│ ┌───┐ ┌───┐ ┌───┐  │
│ │P0 │ │P1 │ │P2 │  │
│ │ordered│ordered│ordered│
│ └───┘ └───┘ └───┘  │
└─────────────────────┘
     ↓
┌──────────┐
│ Consumer │
│ Group    │
└──────────┘

How do you handle Kafka consumer lag and backpressure in high-throughput payment processing?

advancedgeneralkubernetes

▼

Quick Answer

Monitor consumer lag per partition using kafka-consumer-groups CLI or Prometheus metrics. Scale consumers with KEDA based on lag thresholds. Handle backpressure with dead letter queues for failed messages, circuit breakers for slow downstream services, and partition rebalancing to distribute load evenly.

Detailed Answer

Think of a call center during a product recall. If calls come in faster than agents can answer, the queue grows (consumer lag). You can add more agents (scale consumers), redirect overflow calls to voicemail (dead letter queue), or temporarily stop accepting new calls from the website (backpressure to producers). The key is detecting the queue growth early and responding before callers give up. Consumer lag is the difference between the latest message offset in a partition and the consumer's current committed offset. A lag of zero means the consumer is caught up. Growing lag means messages are being produced faster than consumed. In a payment processing system, growing lag means transactions are being delayed, which can trigger timeouts, duplicate processing attempts, and compliance violations for settlement deadlines. The response strategy has three layers. First, scale consumers: use KEDA (Kubernetes Event-Driven Autoscaler) with a Kafka trigger that watches consumer group lag and scales the Deployment from 1 to N consumers. Each consumer in the group gets assigned partitions, so the maximum parallelism equals the number of partitions. Second, handle failures: messages that fail processing after retries go to a dead letter topic for manual investigation rather than blocking the entire partition. Third, apply backpressure: if a downstream service (like the fraud detection API) is slow, implement circuit breakers so consumers pause processing rather than overwhelming the failing service. At production scale, partition count determines your maximum consumer parallelism. If you have 12 partitions and 12 consumers, each handles one partition. Adding a 13th consumer gives it nothing to do. Teams should provision enough partitions upfront (at least 3x the expected peak consumer count) because repartitioning an existing topic requires data migration. Monitor lag per partition, not just aggregate lag, because a single hot partition with a slow consumer can cause localized delays while the aggregate looks healthy. The non-obvious gotcha is that consumer lag spikes are normal during deployments. When consumers restart during a rolling update, partitions are reassigned and the new consumers start from the last committed offset, which may be slightly behind. This brief lag spike should resolve within minutes. Alert on lag that is continuously growing for 15+ minutes, not on momentary spikes. Also, max.poll.records and max.poll.interval.ms settings determine how many messages a consumer fetches per poll and how long it can take to process them — misconfiguring these causes unnecessary rebalances that worsen lag.

Code Example

# Check consumer lag per partition
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe --group settlements-processor # Shows LAG per partition

# KEDA ScaledObject to auto-scale consumers based on lag
apiVersion: keda.sh/v1alpha1 # KEDA API
kind: ScaledObject # Auto-scaling configuration
metadata:
  name: settlements-processor-scaler # Scaler for the settlements consumer
  namespace: payments # Application namespace
spec:
  scaleTargetRef:
    name: settlements-processor # Deployment to scale
  minReplicaCount: 1 # Minimum 1 consumer always running
  maxReplicaCount: 12 # Max equals partition count
  triggers:
  - type: kafka # KEDA Kafka trigger
    metadata:
      bootstrapServers: payments-kafka-kafka-bootstrap.kafka:9092 # Kafka bootstrap address
      consumerGroup: settlements-processor # Consumer group to monitor
      topic: payment-events # Topic to watch
      lagThreshold: "100" # Scale up when lag exceeds 100 messages per partition

# Dead letter topic configuration in consumer application
# If processing fails after 3 retries, send to DLQ
# spring.kafka.consumer.properties.max.poll.records=50
# spring.kafka.consumer.properties.max.poll.interval.ms=300000

How do you identify whether a pod restart is caused by OOMKilled, a connectivity failure, or an application-level bug?

advancedpodskubernetes

▼

Quick Answer

Check the container's last termination reason and exit code: OOMKilled shows reason=OOMKilled with exit 137, connectivity failures show timeout errors in application logs with exit 1, and application bugs show stack traces or panic messages in logs with exit 1 or 2. The distinction comes from correlating exit codes, termination reasons, and log content.

Detailed Answer

Think of a car that keeps stalling. A mechanic checks three things in order: is it out of fuel (OOMKilled — out of memory), is the road blocked (connectivity — cannot reach a dependency), or is the engine itself broken (application bug). Each has a different diagnostic signature, and checking them in the right order saves time. In Kubernetes, every container termination has metadata that points to the cause. The container status records the exit code, the termination reason, and the termination message. OOMKilled is the clearest: Kubernetes sets the reason field to OOMKilled and the exit code to 137. This means the kernel's Out-Of-Memory killer terminated the process because it exceeded its cgroup memory limit. The container did not choose to exit — it was killed by the kernel. For connectivity failures, the exit code is typically 1 (generic application error) and the logs show timeout or connection refused messages when trying to reach a database, cache, or external API. The key diagnostic is checking the application logs for patterns like 'connection refused,' 'timeout,' 'no such host,' or 'TLS handshake failed.' You can verify by execing into the pod and testing connectivity manually with nc, curl, or nslookup to isolate whether it is a DNS, network policy, or service availability issue. For application bugs, the exit code is 1 or sometimes 2 (misuse), and the logs show stack traces, null pointer exceptions, panic messages, or assertion failures. These are predictable (deterministic) — the same input or configuration triggers the same crash. You can distinguish them from connectivity issues because the error occurs during request processing or startup logic, not during a connection attempt. The non-obvious gotcha is that OOMKilled can masquerade as an application bug if you only check logs. When the OOM killer strikes, the process is terminated immediately — there may be no log line because the application never got a chance to write one. If you see a container with exit code 137, zero log output, and high restart count, check the termination reason field directly. Also, a JVM application may show exit code 1 with a java.lang.OutOfMemoryError in logs if it hits the JVM heap limit before hitting the cgroup limit — this is an application-level OOM, not a kernel OOMKill, and the fix is different (increase JVM heap, not container memory limit).

Code Example

# Step 1: Check termination reason and exit code
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
  -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason, exitCode, startedAt, finishedAt

# Step 2: If exit code 137, confirm OOMKilled
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -i 'oom\|killed\|reason' # Confirms OOMKilled

# Step 3: Check memory usage vs limits for OOMKilled
kubectl top pod payments-api-7d9f8b6c4-abc12 -n payments # Current memory usage
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments \
  -o jsonpath='{.spec.containers[0].resources.limits.memory}' # Configured memory limit

# Step 4: If exit code 1, check logs for connectivity vs application error
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=100 # Check for timeout/connection vs stack trace

# Step 5: Test connectivity from inside the pod
kubectl exec -n payments deploy/payments-api -- nc -zv payments-db.internal 5432 # Test database connectivity
kubectl exec -n payments deploy/payments-api -- nslookup redis-cache.payments.svc # Test DNS resolution

# Quick reference for exit codes:
# Exit 0   = Normal termination (container completed successfully)
# Exit 1   = Application error (check logs for stack trace or connection error)
# Exit 137 = SIGKILL (OOMKilled by kernel or killed by kubelet)
# Exit 143 = SIGTERM (graceful shutdown, often from liveness probe failure)

◈ Architecture Diagram

┌──────────────┐
│ Pod Restart  │
└──────┬───────┘
       ↓
┌──────────────┐
│ Exit Code?   │
├──────┬───────┤
│ 137  │  1    │
│ OOM  │ Logs? │
└──┬───┴───┬───┘
   ↓       ↓
┌─────┐ ┌────────┐
│ OOM │ │Timeout?│
│Kill │ ├────┬───┤
└─────┘ │Yes │No │
        ↓    ↓
     ┌────┐┌────┐
     │Conn││ Bug│
     └────┘└────┘

How do HPA and VPA work together for autoscaling in production?

advancedschedulingkubernetes

▼

Quick Answer

HPA (Horizontal Pod Autoscaler) scales the number of Pod replicas based on metrics like CPU or custom metrics, while VPA (Vertical Pod Autoscaler) adjusts individual Pod resource requests and limits. Using them together requires careful configuration to avoid conflicts where both try to respond to the same metric.

Detailed Answer

Imagine a restaurant kitchen during peak hours. Horizontal scaling is hiring more cooks to handle more orders in parallel -- each cook handles a portion of the workload. Vertical scaling is upgrading your existing cooks to faster, more skilled chefs who can each handle more complex dishes. In practice, you need both strategies: more cooks for sheer volume, and better-equipped cooks so each one operates efficiently. That is the relationship between HPA and VPA in Kubernetes. HPA watches specified metrics (CPU use, memory, or custom/external metrics) and adjusts the replica count of a Deployment, ReplicaSet, or StatefulSet. It runs a control loop every 15 seconds (configurable via --horizontal-pod-autoscaler-sync-period) that calculates the desired replicas using the formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / desiredMetricValue)). VPA, on the other hand, monitors actual resource consumption of Pods over time and recommends or automatically updates the resource requests and limits in the Pod spec. VPA has three modes: Off (recommendations only), Initial (sets requests at Pod creation), and Auto (evicts and recreates Pods with updated requests). Internally, HPA queries the metrics API (metrics.k8s.io for resource metrics, custom.metrics.k8s.io for custom metrics, or external.metrics.k8s.io for external sources like Prometheus). The metrics-server or a Prometheus adapter populates these APIs. HPA calculates the ratio of current to desired metric values across all Pods, applies a tolerance (default 10%) to prevent flapping, and issues a scale request to the API server. VPA consists of three components: the Recommender (analyzes historical usage and computes recommendations), the Updater (evicts Pods whose requests deviate significantly from recommendations), and the Admission Controller (mutates Pod specs at creation time to inject recommended requests). The Recommender uses a decaying histogram of resource usage to generate its recommendations. In production, running HPA and VPA together on the same metric (like CPU) creates a conflict. HPA sees high CPU and adds replicas; VPA sees high CPU and increases requests per Pod. Both react to the same signal, leading to over-provisioning or oscillation. The recommended pattern is to use HPA for CPU-based scaling and VPA in recommendation-only mode (mode: Off) so operators can manually adjust requests based on VPA suggestions. Alternatively, use HPA with custom metrics (like requests-per-second from Prometheus) and let VPA manage CPU and memory requests in Auto mode, since they are responding to different signals. Multidimensional Pod Autoscaler (MPA), available in some managed Kubernetes distributions, attempts to coordinate both axes natively. A non-obvious gotcha is that VPA in Auto mode evicts Pods to apply new resource requests, which means it causes rolling restarts that can impact availability if your PodDisruptionBudget is not configured correctly. Another trap: HPA uses resource requests as the baseline for percentage calculations (e.g., 80% CPU target means 80% of the CPU request), so if VPA increases the request, the same absolute CPU usage now represents a lower percentage, potentially causing HPA to scale in and reduce replicas. This feedback loop can destabilize your scaling behavior. Always set VPA minAllowed and maxAllowed bounds to prevent runaway resource allocation, and use HPA stabilization windows (behavior.scaleDown.stabilizationWindowSeconds) to dampen rapid fluctuations.

Code Example

# HPA scaling on custom metric (requests-per-second) to avoid conflict with VPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-api-hpa  # HPA for the payments API service
  namespace: payments  # Same namespace as the target Deployment
spec:
  scaleTargetRef:  # Reference to the workload being scaled
    apiVersion: apps/v1  # API version of the target
    kind: Deployment  # Scale a Deployment
    name: payments-api  # Name of the Deployment
  minReplicas: 3  # Never go below 3 replicas for HA
  maxReplicas: 25  # Cap at 25 to control costs
  metrics:  # Use custom metric to avoid conflict with VPA on CPU
    - type: Pods  # Per-pod custom metric
      pods:
        metric:
          name: http_requests_per_second  # Custom metric from Prometheus adapter
        target:
          type: AverageValue  # Target average across all Pods
          averageValue: "100"  # Scale up when RPS exceeds 100 per Pod
  behavior:  # Fine-tune scaling behavior to prevent flapping
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
        - type: Percent  # Scale down by percentage
          value: 10  # Remove at most 10% of Pods per period
          periodSeconds: 60  # Evaluate every 60 seconds
    scaleUp:
      stabilizationWindowSeconds: 30  # React quickly to traffic spikes
      policies:
        - type: Pods  # Scale up by fixed number
          value: 4  # Add at most 4 Pods per period
          periodSeconds: 60  # Evaluate every 60 seconds
---
# VPA managing CPU and memory requests (Auto mode safe since HPA uses custom metric)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payments-api-vpa  # VPA for the same payments API
  namespace: payments  # Same namespace
spec:
  targetRef:  # Reference to the workload
    apiVersion: apps/v1  # API version of the target
    kind: Deployment  # Target Deployment
    name: payments-api  # Same Deployment as HPA targets
  updatePolicy:
    updateMode: Auto  # Automatically evict and recreate Pods with new requests
  resourcePolicy:
    containerPolicies:
      - containerName: payments-api  # Apply to the main container
        minAllowed:  # Floor to prevent under-provisioning
          cpu: 250m  # Minimum 250 millicores
          memory: 256Mi  # Minimum 256MB
        maxAllowed:  # Ceiling to prevent runaway costs
          cpu: "2"  # Maximum 2 CPU cores
          memory: 2Gi  # Maximum 2GB memory
        controlledResources:  # Only manage these resources
          - cpu  # VPA manages CPU requests
          - memory  # VPA manages memory requests

◈ Architecture Diagram

┌──────────┐         ┌──────────┐
│  Metrics │         │  Metrics │
│  Server  │         │ Prometheus│
└────┬─────┘         └────┬─────┘
     │ cpu/mem             │ rps
     ↓                     ↓
┌──────────┐         ┌──────────┐
│   VPA    │         │   HPA    │
│ Adjusts  │         │ Adjusts  │
│ Requests │         │ Replicas │
└────┬─────┘         └────┬─────┘
     │                     │
     └──────────┬──────────┘
                ↓
         ┌──────────┐
         │ Payments │
         │   API    │
         │ Deploy   │
         └──────────┘

How do pod topology spread constraints work internally in the Kubernetes scheduler, and what production failures can occur when they interact with cluster autoscaling?

architectschedulingkubernetes

▼

Quick Answer

Topology spread constraints tell the scheduler to distribute Pods across failure domains defined by node labels such as zone or hostname, using maxSkew to control imbalance. When combined with cluster autoscaling, problems arise if a zone has zero nodes — the autoscaler may not know about the zone, causing the scheduler to leave Pods pending indefinitely.

Detailed Answer

Think of seating guests at a wedding reception. You want to spread friends evenly across tables so no table is overcrowded and no group is isolated. The wedding planner checks how many people are at each table and seats the next guest at the most empty one, but if a table does not exist yet (no physical table has been set up), the planner cannot seat anyone there even if the venue has room. Topology spread constraints in Kubernetes work the same way. Kubernetes topology spread constraints are declared in the Pod spec under topologySpreadConstraints. Each constraint specifies a topologyKey (a node label like topology.kubernetes.io/zone or kubernetes.io/hostname), a maxSkew (the maximum allowed difference in Pod count between the most-populated and least-populated domain), a whenUnsatisfiable behavior (DoNotSchedule or ScheduleAnyway), and a labelSelector to identify which Pods count toward the spread calculation. Internally, the scheduler evaluates topology spread during the Filter and Score phases. In the Filter phase, it eliminates nodes where placing the Pod would violate the maxSkew when whenUnsatisfiable is DoNotSchedule. In the Score phase, it ranks remaining nodes by how well they balance the distribution. The scheduler considers the topologyKey label on existing nodes to define domains — a domain only exists if at least one node carries that label value. It then counts matching Pods per domain and calculates whether the new Pod can land in each domain without exceeding maxSkew. At production scale, the interaction with cluster autoscaling creates subtle failures. If a node pool in one availability zone scales to zero, that zone disappears from the scheduler's topology map. The scheduler only sees zones with active nodes, so it may consider a two-zone spread sufficient even when three zones are available. When maxSkew is 1 and whenUnsatisfiable is DoNotSchedule, the scheduler can leave Pods pending because it cannot place them in a zone that has no nodes, and the autoscaler may not create a node in the missing zone because it does not see pending Pods that specifically require it. This chicken-and-egg problem is one of the most common production issues with topology spread constraints. The non-obvious gotcha is that topology spread constraints count all matching Pods, including ones that are terminating, not-ready, or failing. During a rolling update, old Pods being terminated still count toward the spread calculation, which can cause new Pods to be unschedulable until the old ones are fully removed. Architects should set minDomains to explicitly declare how many zones the spread should consider, use node affinity in combination with spread constraints to ensure the autoscaler knows about expected zones, and monitor for unschedulable Pods with topology spread violation events.

Code Example

# Apply a Deployment with zone and node spread constraints
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages replicated Pods
metadata:
  name: checkout-api # Production checkout service
  namespace: payments # Team namespace
spec:
  replicas: 6 # Six replicas to spread across three zones with two per zone
  selector:
    matchLabels:
      app: checkout-api # Pod selector
  template:
    metadata:
      labels:
        app: checkout-api # Label used by spread constraint selector
    spec:
      topologySpreadConstraints:
      - maxSkew: 1 # Allows at most one Pod difference between zones
        topologyKey: topology.kubernetes.io/zone # Spreads across availability zones
        whenUnsatisfiable: DoNotSchedule # Strictly enforces zone balance
        labelSelector:
          matchLabels:
            app: checkout-api # Counts only checkout-api Pods
        minDomains: 3 # Expects three zones even if some have zero nodes
      - maxSkew: 1 # Allows at most one Pod difference between nodes within a zone
        topologyKey: kubernetes.io/hostname # Spreads across individual nodes
        whenUnsatisfiable: ScheduleAnyway # Prefers balance but allows imbalance
        labelSelector:
          matchLabels:
            app: checkout-api # Counts only checkout-api Pods
      containers:
      - name: api # Application container
        image: registry.company.com/checkout-api:3.7.2 # Versioned production image
        resources:
          requests:
            cpu: 250m # Minimum CPU for scheduling
            memory: 512Mi # Minimum memory for scheduling

# Check Pod distribution across zones
kubectl get pods -n payments -l app=checkout-api -o custom-columns='POD:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'

# Identify Pods pending due to topology spread violations
kubectl get events -n payments --field-selector reason=FailedScheduling | grep topology

◈ Architecture Diagram

┌─── Zone A ──┐ ┌─── Zone B ──┐ ┌─── Zone C ──┐
│ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│
│ │Pod1││Pod2││ │ │Pod3││Pod4││ │ │Pod5││Pod6││
│ └────┘└────┘│ │ └────┘└────┘│ │ └────┘└────┘│
│  maxSkew=1  │ │  maxSkew=1  │ │  maxSkew=1  │
└─────────────┘ └─────────────┘ └─────────────┘

How should architects combine Vertical Pod Autoscaler and Horizontal Pod Autoscaler in the same cluster without creating scaling conflicts, and when does KEDA fit better than either?

architectgeneralkubernetes

▼

Quick Answer

VPA and HPA conflict when both scale on the same metric because VPA resizes Pods while HPA changes Pod count, creating feedback loops. The safe pattern is VPA on memory and HPA on CPU or custom metrics, or VPA in recommendation-only mode. KEDA fits better for event-driven workloads that scale to zero or react to external queue depth rather than Pod-level CPU or memory.

Detailed Answer

Think of a restaurant kitchen during dinner rush. The horizontal approach is adding more cooks to handle more orders. The vertical approach is giving each cook a bigger stove and more counter space so they can cook faster. If you try both approaches based on the same signal — how backed up the order queue is — the kitchen oscillates between adding cooks and upgrading stoves in a confusing loop. You need different signals for each scaling axis. The Horizontal Pod Autoscaler watches metrics like CPU use, memory use, or custom metrics, and adjusts the number of Pod replicas to meet a target value. The Vertical Pod Autoscaler observes actual resource consumption over time and recommends or applies changes to the resource requests and limits of individual Pods. When both use CPU as their metric, VPA may increase a Pod's CPU request, which changes the Pod's use percentage, which then triggers HPA to scale down replicas, which increases use again, creating an oscillation loop. The recommended production pattern separates their concerns. Run VPA in recommendation mode (updateMode: Off) or target only memory, while HPA scales on CPU or custom metrics. Alternatively, use VPA in Auto mode for stateful workloads where horizontal scaling is impractical — databases, caches, or single-instance controllers — and reserve HPA for stateless services that benefit from replica scaling. Some teams use VPA recommendations to set initial resource requests in CI/CD pipelines rather than letting VPA mutate Pods at runtime, which avoids the Pod restart that VPA triggers when updating in-place. At production scale, KEDA (Kubernetes Event-Driven Autoscaler) fills a gap that neither HPA nor VPA addresses well. KEDA scales based on external event sources — message queue depth in Kafka or SQS, pending items in a Redis stream, HTTP request rate from Prometheus, or custom metrics from any source with a KEDA scaler. Critically, KEDA can scale Deployments to zero replicas when there is no work, which standard HPA cannot do (HPA's minimum is one). This makes KEDA the right choice for batch processors, event consumers, and asynchronous workers where idle cost matters. KEDA works by deploying ScaledObject resources that create HPA objects under the hood, so it integrates with existing Kubernetes autoscaling infrastructure. The non-obvious gotcha is that VPA in Auto mode restarts Pods when it changes resource requests, which can cause brief service disruption and interact badly with PodDisruptionBudget limits. Teams often discover this during peak traffic when VPA decides to right-size all replicas simultaneously. Architects should set VPA update policies with minReplicas and eviction requirements, and test VPA behavior during high-traffic scenarios before enabling Auto mode on critical services.

Code Example

# Deploy VPA in recommendation mode for a service — no automatic Pod restarts
apiVersion: autoscaling.k8s.io/v1 # VPA API group
kind: VerticalPodAutoscaler # Recommends or applies resource changes
metadata:
  name: checkout-api-vpa # VPA for the checkout service
  namespace: payments # Same namespace as the target
spec:
  targetRef:
    apiVersion: apps/v1 # References a Deployment
    kind: Deployment # Target workload type
    name: checkout-api # The Deployment to analyze
  updatePolicy:
    updateMode: "Off" # Generates recommendations without applying them
  resourcePolicy:
    containerPolicies:
    - containerName: api # Target the main container
      minAllowed:
        cpu: 100m # Never recommend below 100m CPU
        memory: 256Mi # Never recommend below 256Mi memory
      maxAllowed:
        cpu: 2 # Cap recommendations at 2 CPU cores
        memory: 4Gi # Cap recommendations at 4Gi memory

# Deploy HPA scaling on CPU utilization (safe alongside VPA on memory)
apiVersion: autoscaling/v2 # HPA v2 API for custom metrics support
kind: HorizontalPodAutoscaler # Scales replica count
metadata:
  name: checkout-api-hpa # HPA for the checkout service
  namespace: payments # Same namespace
spec:
  scaleTargetRef:
    apiVersion: apps/v1 # References the same Deployment
    kind: Deployment # Target workload type
    name: checkout-api # The Deployment to scale
  minReplicas: 3 # Never scale below three replicas
  maxReplicas: 20 # Cap at twenty replicas
  metrics:
  - type: Resource # Uses built-in resource metrics
    resource:
      name: cpu # Scales on CPU utilization only
      target:
        type: Utilization # Target a percentage of requests
        averageUtilization: 70 # Scale up when average CPU exceeds 70 percent

# Read VPA recommendations without applying them
kubectl get vpa checkout-api-vpa -n payments -o jsonpath='{.status.recommendation.containerRecommendations[*]}'

◈ Architecture Diagram

┌─────────────────────────────────┐
│        Scaling Decision         │
├───────────┬───────────┬─────────┤
│  HPA      │  VPA      │  KEDA   │
│ ─ ─ ─ ─  │ ─ ─ ─ ─   │ ─ ─ ─  │
│ CPU/custom│ Memory    │ Queue   │
│ → replicas│ → sizing  │ → 0..N │
│ stateless │ stateful  │ events  │
└───────────┴───────────┴─────────┘

How does Karpenter differ from Cluster Autoscaler in node provisioning strategy, and when should architects choose one over the other for production workloads?

architectschedulingkubernetes

▼

Quick Answer

Karpenter provisions nodes directly from cloud provider APIs based on pending Pod requirements, selecting instance types dynamically without predefined node groups. Cluster Autoscaler adjusts the size of existing node groups. Karpenter is faster and more flexible for heterogeneous workloads, while Cluster Autoscaler is simpler for teams with well-defined node group templates and multi-cloud portability needs.

Detailed Answer

Think of hiring staff for a catering company. Cluster Autoscaler is like posting a job ad for a specific role — you have predefined job descriptions (node groups), and when you need more people, you hire from those templates. Karpenter is like a staffing agency that looks at the actual tasks on the board, finds a person with exactly the right skills and availability, and places them immediately. The agency is faster and more flexible, but you need to trust it with your hiring criteria. Cluster Autoscaler has been the standard Kubernetes node scaling solution since early Kubernetes versions. It works by monitoring pending Pods that cannot be scheduled due to insufficient resources, then scaling up one of the configured node groups (Auto Scaling Groups on AWS, Managed Instance Groups on GCP, or VM Scale Sets on Azure). It also scales down by identifying underutilized nodes and draining them. The key limitation is that node groups must be pre-configured with specific instance types, and the autoscaler chooses among existing groups rather than selecting the optimal instance type for each workload. Karpenter, originally created by AWS and now a CNCF project, takes a fundamentally different approach. Instead of managing node groups, Karpenter watches for unschedulable Pods and directly provisions compute from the cloud provider API, choosing the best instance type based on Pod resource requirements, node selectors, affinity rules, and topology spread constraints. NodePool resources define constraints like allowed instance families, availability zones, capacity types (on-demand or spot), and expiration policies. Karpenter evaluates all pending Pods together and can bin-pack them onto a single optimally-sized instance rather than scaling a node group one unit at a time. At production scale, Karpenter typically provisions nodes in under 60 seconds compared to 2-5 minutes for Cluster Autoscaler, because it skips the Auto Scaling Group scaling process and calls the EC2 API directly. Karpenter also handles node disruption proactively through consolidation, which replaces underutilized nodes with cheaper or better-fitting ones, and expiration, which rotates nodes to pick up AMI updates. However, Karpenter is currently most mature on AWS, with Azure support in development and GCP support community-driven. Teams needing multi-cloud portability or running on GKE or AKS may still prefer Cluster Autoscaler. The non-obvious gotcha is that Karpenter's flexibility requires careful constraint definition. Without proper NodePool limits on instance families, maximum Pods per node, or total cluster capacity, Karpenter can provision very large or very expensive instances. It can also create infrastructure drift if the team's Terraform or IaC does not account for Karpenter-managed nodes. Architects should set explicit NodePool constraints, integrate Karpenter's provisioned nodes into their monitoring and cost dashboards, and understand that Karpenter manages node lifecycle independently of any external node group definition.

Code Example

# Karpenter NodePool that provisions cost-optimized compute for general workloads
apiVersion: karpenter.sh/v1 # Karpenter API group
kind: NodePool # Defines provisioning constraints and behavior
metadata:
  name: general-workloads # Pool name for general-purpose services
spec:
  template:
    spec:
      requirements:
      - key: karpenter.sh/capacity-type # Defines instance purchasing model
        operator: In
        values: [on-demand, spot] # Allows both on-demand and spot instances
      - key: node.kubernetes.io/instance-type # Limits instance families
        operator: In
        values: [m6i.large, m6i.xlarge, m7i.large, m7i.xlarge, c6i.large, c6i.xlarge] # Curated list of right-sized instances
      - key: topology.kubernetes.io/zone # Constrains to specific zones
        operator: In
        values: [us-east-1a, us-east-1b, us-east-1c] # All three availability zones
      nodeClassRef:
        group: karpenter.k8s.aws # AWS-specific node configuration
        kind: EC2NodeClass # References the EC2 node template
        name: general-al2023 # Node class with AL2023 AMI and security groups
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized # Replaces wasteful nodes automatically
    consolidateAfter: 30s # Waits 30 seconds before consolidating
  limits:
    cpu: 200 # Maximum total CPU across all nodes in this pool
    memory: 800Gi # Maximum total memory across all nodes in this pool
---
apiVersion: karpenter.k8s.aws/v1 # AWS-specific Karpenter API
kind: EC2NodeClass # Configures the EC2 instance template
metadata:
  name: general-al2023 # Referenced by the NodePool above
spec:
  amiSelectorTerms:
  - alias: al2023@latest # Uses the latest Amazon Linux 2023 EKS-optimized AMI
  subnetSelectorTerms:
  - tags:
      karpenter.sh/discovery: payments-cluster # Discovers subnets by tag
  securityGroupSelectorTerms:
  - tags:
      karpenter.sh/discovery: payments-cluster # Discovers security groups by tag

# Check which instances Karpenter provisioned and why
kubectl get nodeclaims -o custom-columns='NAME:.metadata.name,TYPE:.status.instanceType,ZONE:.metadata.labels.topology\.kubernetes\.io/zone,CAPACITY:.metadata.labels.karpenter\.sh/capacity-type'

◈ Architecture Diagram

┌──────────────────────────────────────┐
│         Cluster Autoscaler           │
│  Pending → ASG Scale → Node Ready    │
│  (2-5 min, fixed instance types)     │
└──────────────────────────────────────┘
┌──────────────────────────────────────┐
│         Karpenter                    │
│  Pending → EC2 API → Node Ready      │
│  (<60s, dynamic instance selection)  │
└──────────────────────────────────────┘

How do you monitor Kafka on Kubernetes, and what metrics and alerts matter most?

intermediatemonitoringkubernetes

▼

Quick Answer

Monitor Kafka using JMX metrics exported to Prometheus via the JMX Exporter sidecar. Critical metrics include under-replicated partitions, consumer group lag, request latency (produce/fetch), ISR shrink rate, and broker disk usage. Alert on under-replicated partitions > 0 and consumer lag growing continuously.

Detailed Answer

Think of monitoring a highway system. You track traffic flow (throughput), lane closures (under-replicated partitions), backup length (consumer lag), and road surface condition (disk usage). A single lane closure might not cause problems, but if multiple lanes close simultaneously, traffic grinds to a halt. The same applies to Kafka — individual metric spikes are normal, but correlated spikes indicate a systemic problem. Kafka exposes hundreds of JMX metrics covering broker performance, topic throughput, consumer behavior, and replication health. On Kubernetes with Strimzi, the JMX Exporter runs as a sidecar container in each broker pod, converting JMX MBeans into Prometheus-compatible metrics on an HTTP endpoint. A ServiceMonitor resource tells Prometheus to scrape these endpoints, and Grafana dashboards visualize the data. The most critical metrics fall into four categories. Replication health: kafka.server UnderReplicatedPartitions should always be zero — any non-zero value means data is at risk. Consumer health: kafka.consumer.group lag per partition shows how far behind consumers are — growing lag means consumers cannot keep up with producers. Broker performance: kafka.network RequestMetrics for produce and fetch request latency — p99 above 100ms indicates broker pressure. Resource usage: disk utilization per broker — Kafka stores messages on disk, and running out stops the broker. At production scale, alerting should follow the symptom-based approach. Alert on under-replicated partitions greater than zero for more than 5 minutes (data durability risk), consumer lag increasing for more than 15 minutes (processing falling behind), produce request p99 latency above 200ms (client impact), and disk usage above 75% (capacity planning trigger). Avoid alerting on individual broker CPU spikes — they are often transient during rebalancing. The non-obvious gotcha is that consumer lag metrics are only accurate when consumers are actively polling. If a consumer crashes and stops polling, the lag metric freezes at the last known value rather than showing increasing lag. Teams should also monitor consumer group state (Stable, Rebalancing, Dead) and alert on groups stuck in Rebalancing for more than 5 minutes, which indicates a consumer that keeps crashing during rebalance.

Code Example

# Check under-replicated partitions across all topics
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-topics.sh \
  --bootstrap-server localhost:9092 \
  --describe --under-replicated-partitions # Should return empty when healthy

# Check consumer group lag
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-consumer-groups.sh \
  --bootstrap-server localhost:9092 \
  --describe --group settlements-processor # Shows CURRENT-OFFSET, LOG-END-OFFSET, LAG

# Prometheus alert rule for under-replicated partitions
# groups:
# - name: kafka-alerts
#   rules:
#   - alert: KafkaUnderReplicatedPartitions
#     expr: kafka_server_replicamanager_underreplicatedpartitions > 0
#     for: 5m
#     labels:
#       severity: critical
#     annotations:
#       summary: "Kafka broker {{ $labels.pod }} has under-replicated partitions"
#
#   - alert: KafkaConsumerLagGrowing
#     expr: delta(kafka_consumergroup_lag[15m]) > 1000
#     for: 15m
#     labels:
#       severity: warning
#     annotations:
#       summary: "Consumer group {{ $labels.consumergroup }} lag growing on {{ $labels.topic }}"

◈ Architecture Diagram

┌──────────┐
│ Broker   │
│ JMX      │
└────┬─────┘
     ↓
┌──────────┐
│JMX Export│
│ (sidecar)│
└────┬─────┘
     ↓
┌──────────┐
│Prometheus│
└────┬─────┘
     ↓
┌──────────┐
│ Grafana  │
│ + Alerts │
└──────────┘

What changed with Kafka KRaft mode replacing ZooKeeper, and why does it matter?

intermediategeneralkubernetes

▼

Quick Answer

KRaft mode uses Kafka's internal Raft-based consensus protocol for metadata management instead of an external ZooKeeper ensemble. This simplifies operations by removing a separate distributed system to manage, speeds up controller failover from seconds to milliseconds, and reduces the cluster's resource footprint.

Detailed Answer

Think of a company that used to outsource its HR department to an external firm (ZooKeeper). Every hiring decision, org chart change, and payroll update had to go through the external firm, adding latency and a dependency. KRaft mode brings HR in-house — the company manages its own employee records directly, which is faster, simpler, and eliminates the external dependency. Historically, Kafka relied on Apache ZooKeeper for metadata management: tracking which brokers are alive, which broker is the controller, partition leadership assignments, topic configurations, and ACLs. This meant operating two distributed systems — Kafka and ZooKeeper — each with their own deployment, monitoring, scaling, and failure modes. ZooKeeper required its own ensemble of 3 or 5 nodes, its own storage, and its own expertise to troubleshoot. KRaft (Kafka Raft) replaces ZooKeeper with a built-in metadata quorum. A subset of Kafka nodes run as controllers using the Raft consensus protocol to manage cluster metadata. The controller quorum elects a leader, and all metadata changes (topic creation, partition reassignment, broker registration) go through this leader. The metadata is stored as a replicated log within Kafka itself, using the same storage engine as regular Kafka topics. This means controller failover happens in milliseconds instead of the seconds it took with ZooKeeper leader election. At production scale, KRaft simplifies operations significantly. You deploy and manage one system instead of two. Cluster startup is faster because brokers do not need to wait for ZooKeeper to be available. Scaling is simpler because you do not need to resize the ZooKeeper ensemble separately. Monitoring is unified — you no longer need separate dashboards and alerts for ZooKeeper health. Strimzi supports KRaft mode since Kafka 3.5, and ZooKeeper support is being deprecated. The non-obvious gotcha is that KRaft changes how you think about controller nodes. In ZooKeeper mode, any broker could become the controller. In KRaft mode, you explicitly designate controller nodes (or run combined controller+broker nodes for smaller clusters). For production, dedicated controller nodes are recommended because they avoid resource contention between metadata operations and message processing. Teams migrating from ZooKeeper mode should test KRaft in staging first, as some older Kafka clients may not support KRaft-specific metadata protocols.

Code Example

# Strimzi Kafka CR with KRaft mode (no ZooKeeper)
apiVersion: kafka.strimzi.io/v1beta2 # Strimzi API
kind: Kafka # Kafka cluster custom resource
metadata:
  name: payments-kafka # Cluster name
  namespace: kafka # Namespace
  annotations:
    strimzi.io/kraft: enabled # Enable KRaft mode
spec:
  kafka:
    version: 3.7.0 # Kafka version with stable KRaft
    replicas: 3 # Three broker nodes
    storage:
      type: persistent-claim # Persistent storage
      size: 100Gi # Per-broker storage
    config:
      process.roles: broker,controller # Combined mode for small clusters
      controller.quorum.voters: [email protected]:9093,[email protected]:9093,[email protected]:9093 # KRaft voter list

# Verify KRaft mode is active (no ZooKeeper pods should exist)
kubectl get pods -n kafka # Should show only kafka-kafka-* pods, no zookeeper pods

# Check controller quorum status
kubectl exec -n kafka payments-kafka-kafka-0 -- bin/kafka-metadata.sh \
  --snapshot /var/lib/kafka/data/__cluster_metadata-0/00000000000000000000.log \
  --cluster-id $(kubectl exec -n kafka payments-kafka-kafka-0 -- cat /var/lib/kafka/data/meta.properties | grep cluster.id | cut -d= -f2) # Shows metadata log entries

◈ Architecture Diagram

┌─ ZooKeeper Mode ──────┐  ┌─ KRaft Mode ─────────┐
│ ZK Ensemble (3 nodes) │  │ No ZooKeeper needed  │
│ + Kafka Brokers (3)   │  │ Kafka nodes handle   │
│ = 6 nodes total       │  │ metadata internally  │
│ Failover: seconds     │  │ = 3 nodes total      │
│ Two systems to manage │  │ Failover: millisecs  │
└───────────────────────┘  └──────────────────────┘

How do HPA, VPA, and Cluster Autoscaler work together to handle traffic spikes?

intermediateschedulingkubernetes

▼

Quick Answer

HPA scales pods horizontally based on CPU, memory, or custom metrics. VPA adjusts pod resource requests and limits vertically to right-size containers. Cluster Autoscaler adds or removes nodes when pods cannot be scheduled due to insufficient cluster capacity. Together, they form a three-layer scaling system: right-size pods, scale pod count, then scale infrastructure.

Detailed Answer

Think of a restaurant handling a dinner rush. VPA is like giving each chef a bigger workstation when they are cramped (vertical scaling of individual workers). HPA is like calling in more chefs when orders pile up (horizontal scaling of worker count). Cluster Autoscaler is like opening additional kitchen rooms when there is no space for more chefs (infrastructure scaling). Each addresses a different bottleneck, and they work best when coordinated. The Horizontal Pod Autoscaler (HPA) watches metrics and adjusts the replica count of a Deployment or StatefulSet. By default it uses CPU utilization, but it can target memory, custom metrics (requests per second from Prometheus), or external metrics (SQS queue depth from CloudWatch). HPA evaluates every 15 seconds (configurable), calculates the desired replica count using the formula desiredReplicas = currentReplicas * (currentMetric / targetMetric), and scales accordingly. It respects minReplicas and maxReplicas boundaries and has stabilization windows to prevent flapping. The Vertical Pod Autoscaler (VPA) analyzes historical resource usage and recommends or automatically adjusts the CPU and memory requests and limits for containers. It operates in three modes: Off (only recommends), Initial (sets resources only at pod creation), and Auto (evicts and recreates pods with updated resources). VPA solves the problem of developers guessing resource requests — either setting them too high (wasting cluster capacity) or too low (causing throttling and OOMKills). It uses a recommender component that analyzes metrics over time and an updater that evicts pods needing adjustment. The Cluster Autoscaler watches for pods stuck in Pending state because no node has sufficient resources. When it detects unschedulable pods, it evaluates which node group can accommodate them and triggers the cloud provider to add nodes. Conversely, it removes underutilized nodes (below 50 percent utilization by default) after a cooldown period, draining pods safely using PodDisruptionBudgets. It works with AWS Auto Scaling Groups, GCP Managed Instance Groups, or Azure VM Scale Sets. In production, coordination between these three autoscalers requires careful planning. The critical rule is never to use HPA and VPA on the same metric for the same pod, because they will fight: HPA tries to add replicas while VPA tries to resize existing ones, creating oscillation. The recommended pattern is HPA on CPU with VPA on memory in recommendation-only mode, or HPA on custom metrics while VPA handles resource right-sizing in Initial mode. Cluster Autoscaler must respond within 30-60 seconds to add nodes or traffic will be dropped during spikes. Teams configure node pool warm-up strategies or use Karpenter for faster, more flexible node provisioning. The non-obvious gotcha is scaling lag. HPA reacts in seconds but new pods need time to start (image pull, readiness probe). Cluster Autoscaler takes 1-3 minutes to provision new nodes. During a sudden traffic spike, the system can drop requests for several minutes. Mitigation strategies include setting higher minReplicas during known peak windows, using PodPriority to preempt less critical workloads, over-provisioning with pause pods (low-priority placeholder pods that get evicted instantly to free capacity), and combining with KEDA for event-driven scaling that responds to queue depth before CPU rises.

Code Example

# Check current HPA status including current vs target metrics and replica count
kubectl get hpa -n payments

# Describe HPA to see scaling events, conditions, and metric sources
kubectl describe hpa payments-api -n payments

# Example HPA manifest scaling on CPU and custom requests-per-second metric
# ---
# apiVersion: autoscaling/v2
# kind: HorizontalPodAutoscaler
# metadata:
#   name: payments-api
#   namespace: payments
# spec:
#   scaleTargetRef:
#     apiVersion: apps/v1
#     kind: Deployment
#     name: payments-api
#   minReplicas: 4
#   maxReplicas: 40
#   metrics:
#   - type: Resource
#     resource:
#       name: cpu
#       target:
#         type: Utilization
#         averageUtilization: 65

# Check VPA recommendations without applying them (Off mode)
kubectl get vpa payments-api -n payments -o yaml | grep -A10 recommendation

# Check Cluster Autoscaler status and recent scaling decisions
kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml

# Look for pods stuck in Pending that might trigger Cluster Autoscaler
kubectl get pods -A --field-selector=status.phase=Pending

◈ Architecture Diagram

┌─────────────────────────────────────────────────┐
│                Traffic Spike                      │
└────────────────────┬────────────────────────────┘
                     ↓
┌─────────────────────────────────────────────────┐
│  Layer 1: HPA (seconds)                          │
│  Scale pods 4 → 12 based on CPU/custom metrics   │
└────────────────────┬────────────────────────────┘
                     ↓ pods Pending?
┌─────────────────────────────────────────────────┐
│  Layer 2: Cluster Autoscaler (1-3 minutes)       │
│  Add nodes to fit unschedulable pods             │
└────────────────────┬────────────────────────────┘
                     ↓ right-size over time
┌─────────────────────────────────────────────────┐
│  Layer 3: VPA (background)                       │
│  Adjust resource requests based on actual usage  │
└─────────────────────────────────────────────────┘

Why do Kubernetes pods get evicted, and what commands and steps do you use to diagnose and resolve pod eviction?

intermediatepodskubernetes

▼

Quick Answer

Pods are evicted when a node is under resource pressure — disk (DiskPressure), memory (MemoryPressure), or PID exhaustion (PIDPressure). The kubelet evicts pods based on QoS class priority: BestEffort first, then Burstable, then Guaranteed last. Diagnosis starts with kubectl describe node to check conditions and kubectl get events to find eviction reasons.

Detailed Answer

Think of an overcrowded bus. When the bus exceeds its weight limit, the driver must ask some passengers to leave. Passengers without tickets (BestEffort pods) are asked first, then those with partial tickets (Burstable pods), and finally full-fare passengers (Guaranteed pods) are the last to go. The bus driver does not choose randomly — there is a clear priority order based on who has the strongest claim to stay. In Kubernetes, pod eviction is the kubelet's mechanism for protecting node stability. When a node runs low on a critical resource — memory, temporary (ephemeral) storage, or process IDs — the kubelet begins evicting pods to reclaim that resource. This is different from preemption (which is the scheduler removing lower-priority pods to make room for higher-priority ones) and different from API-initiated eviction (which is used during node drain for maintenance). Internally, the kubelet monitors resource usage against configurable eviction thresholds. The default soft eviction threshold for memory is memory.available < 100Mi, and for disk is nodefs.available < 10%. When a threshold is breached, the kubelet sets the corresponding node condition (MemoryPressure, DiskPressure, PIDPressure) and begins ranking pods for eviction. The ranking uses QoS class: BestEffort pods (no resource requests or limits) are evicted first, Burstable pods (requests set but lower than limits) are evicted next based on how much they exceed their requests, and Guaranteed pods (requests equal limits for all containers) are evicted last. At production scale, the most common eviction cause is ephemeral storage exhaustion from container logs, emptyDir volumes, or container writable layers growing unbounded. Memory-based evictions happen when applications have memory leaks or when resource limits are set too low for actual workload requirements. Teams should monitor node conditions, set appropriate resource requests and limits to ensure critical pods get Guaranteed QoS, configure log rotation to prevent disk pressure, and use PodDisruptionBudgets to limit the impact of evictions on service availability. The non-obvious gotcha is that eviction thresholds have both soft and hard variants. Soft evictions give pods a grace period to terminate cleanly, while hard evictions kill pods immediately. If the hard eviction threshold is hit (e.g., memory.available < 50Mi), the kubelet kills pods without waiting for graceful shutdown, which can cause data loss or incomplete request processing. Architects should ensure hard thresholds are never reached by setting soft thresholds with enough buffer.

Code Example

# Check node conditions for resource pressure
kubectl describe node ip-10-0-1-42.ec2.internal | grep -A5 'Conditions' # Shows MemoryPressure, DiskPressure status

# Find eviction events in the namespace
kubectl get events -n payments --field-selector reason=Evicted --sort-by='.lastTimestamp' # Lists evicted pods with reasons

# Check which pod was evicted and why
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.reason}' # Shows 'Evicted'
kubectl get pod payments-api-7d9f8b6c4-evicted -n payments -o jsonpath='{.status.message}' # Shows the resource that triggered eviction

# Check node resource usage
kubectl top node ip-10-0-1-42.ec2.internal # Shows current CPU and memory usage

# Check disk usage on the node (requires node access)
kubectl debug node/ip-10-0-1-42.ec2.internal -it --image=busybox -- df -h # Shows filesystem usage on the node

# Check QoS class of pods to understand eviction priority
kubectl get pods -n payments -o custom-columns='NAME:.metadata.name,QOS:.status.qosClass' # Shows BestEffort, Burstable, or Guaranteed

# Set proper resource requests equal to limits for Guaranteed QoS
# resources:
#   requests:
#     cpu: 250m      # Request equals limit for Guaranteed QoS
#     memory: 512Mi  # Request equals limit for Guaranteed QoS
#   limits:
#     cpu: 250m      # Matches request
#     memory: 512Mi  # Matches request

◈ Architecture Diagram

┌──────────────────────────┐
│ Node Resource Pressure   │
│ Memory < 100Mi           │
└────────────┬─────────────┘
             ↓
┌──────────────────────────┐
│ Eviction Priority        │
│ 1. BestEffort  (first)   │
│ 2. Burstable   (next)    │
│ 3. Guaranteed  (last)    │
└──────────────────────────┘

How do you troubleshoot a pod stuck in CrashLoopBackOff, and what are the most common root causes?

intermediatepodskubernetes

▼

Quick Answer

CrashLoopBackOff means the container starts, crashes, and Kubernetes restarts it with exponential backoff (10s, 20s, 40s, up to 5 minutes). Common causes are application startup errors, missing environment variables or secrets, misconfigured commands or entrypoints, failed health probes, and OOMKilled. Diagnosis uses kubectl logs --previous, kubectl describe pod, and checking exit codes.

Detailed Answer

Think of a light switch connected to a circuit breaker. You flip the switch (container starts), the circuit overloads (container crashes), and the breaker trips (Kubernetes waits before retrying). Each time you try again, the breaker waits longer before allowing another attempt. CrashLoopBackOff is Kubernetes telling you that the container keeps failing and the wait time between restarts is increasing. In Kubernetes, CrashLoopBackOff is not a separate error state — it is the backoff delay that kubelet applies after repeated container crashes. The container exits with a non-zero code, kubelet restarts it after 10 seconds, it crashes again, kubelet waits 20 seconds, then 40, then 80, capping at 300 seconds (5 minutes). The pod status shows CrashLoopBackOff during these waiting periods and Error or Completed when the container actually exits. The most common root causes fall into categories. Application errors: the application throws an unhandled exception during startup because a required database is unreachable, a configuration file is malformed, or a required API key is missing. Configuration errors: the container command or args field is wrong (pointing to a script that does not exist in the image), the image tag points to a version with a different entrypoint, or a required environment variable is not set. Resource errors: the container is OOMKilled immediately on startup because the memory limit is too low for the JVM heap or the application's baseline memory footprint. Probe errors: an aggressive liveness probe kills the container before it finishes starting up, especially for Java applications with long startup times. At production scale, the diagnostic sequence is: first check exit code with kubectl describe pod (exit code 1 = application error, 137 = OOMKilled/SIGKILL, 143 = SIGTERM). Then check previous container logs with kubectl logs --previous since the current container may have already crashed. Check whether the container image recently changed with kubectl rollout history. Verify that ConfigMaps, Secrets, and PersistentVolumeClaims referenced by the pod actually exist in the namespace. The non-obvious gotcha is that CrashLoopBackOff can be caused by a liveness probe that is too aggressive during startup. If the liveness probe starts checking before the application is ready and the initialDelaySeconds is too short, the probe fails, kubelet kills the container, it restarts, and the cycle continues. The fix is to use a startup probe with a longer timeout to protect the liveness probe during application initialization, or to increase the liveness probe's initialDelaySeconds and failureThreshold.

Code Example

# Check pod status and restart count
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments # Shows status CrashLoopBackOff and restart count

# Get the exit code to categorize the failure
kubectl describe pod payments-api-7d9f8b6c4-abc12 -n payments | grep -A10 'Last State' # Exit code 1=app error, 137=OOMKilled

# Check logs from the PREVIOUS crashed container (critical — current container may already be dead)
kubectl logs payments-api-7d9f8b6c4-abc12 -n payments --previous --tail=50 # Shows why the last container died

# Check if required ConfigMaps and Secrets exist
kubectl get configmap payments-config -n payments # Verify ConfigMap exists
kubectl get secret payments-db-credentials -n payments # Verify Secret exists

# Check if the container command is correct by inspecting the image
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.spec.containers[0].command}' # Shows configured command

# Check if OOMKilled is the cause
kubectl get pod payments-api-7d9f8b6c4-abc12 -n payments -o jsonpath='{.status.containerStatuses[0].lastState.terminated}' # Shows reason and exit code

# Fix startup probe to prevent liveness probe from killing slow-starting apps
# startupProbe:
#   httpGet:
#     path: /health        # Startup health endpoint
#     port: 8080           # Application port
#   failureThreshold: 30   # Allow 30 x 10s = 5 minutes to start
#   periodSeconds: 10      # Check every 10 seconds during startup

◈ Architecture Diagram

┌──────────┐
│ Start    │
└────┬─────┘
     ↓
┌──────────┐
│ Crash    │←─── Exit 1: App Error
│ (exit≠0) │←─── Exit 137: OOMKill
└────┬─────┘←─── Exit 143: Probe
     ↓
┌──────────┐
│ Backoff  │
│10→20→40s │
└────┬─────┘
     ↓
┌──────────┐
│ Restart  │
└──────────┘

How do Mimir and Thanos scale Prometheus, and when would you pick one over the other?

advancedmonitoringprometheus

▼

Quick Answer

Both add high availability and long-term storage to Prometheus by deduplicating data from replicas and saving metrics in object storage like S3. Thanos bolts a sidecar onto each existing Prometheus and adds a global query layer. Mimir receives metrics via remote-write into a standalone, horizontally scalable system. Thanos is easier to adopt; Mimir scales further but needs more infrastructure.

Detailed Answer

Think of Prometheus like a single security camera with a local hard drive. It records great footage but the drive fills up in two weeks, and if the camera breaks, you lose everything. Thanos is like adding a cloud backup to each existing camera -- a sidecar uploads footage to cheap cloud storage, and a central monitor can search all cameras at once. Mimir is like replacing the local drives entirely -- all cameras stream footage straight to a centralized, scalable recording system in the cloud. Thanos achieves high availability by running a sidecar container next to each Prometheus instance. The sidecar does two things: it uploads Prometheus's local two-hour TSDB blocks to object storage such as S3, GCS, or MinIO, and it exposes a gRPC Store API that the Thanos Query component can reach. Thanos Query acts as a global query layer. It fans out PromQL queries to all sidecars for recent data and to the Store Gateway for historical data in object storage, then deduplicates results from high-availability Prometheus pairs using external labels. A separate Thanos Compactor merges and downsamples blocks in object storage so long-range queries stay fast. Mimir, originally called Cortex and donated by Grafana Labs, works differently. Instead of sidecars, each Prometheus sends metrics via remote-write to Mimir's Distributor component. Mimir is built as a set of microservices that you can scale independently. Distributors accept writes and shard them across Ingesters, which hold recent data in memory and periodically flush blocks to object storage. Queriers answer PromQL requests by reading from both Ingesters for fresh data and Store Gateways for older data. Because each piece scales on its own, Mimir handles very large workloads without bottlenecks. The operational trade-offs come down to complexity versus capability. Thanos is easier to adopt because you keep your existing Prometheus instances and just attach sidecars -- no change to the write path. But the sidecar model means every Prometheus still needs local disk for TSDB blocks, and queries fan out to many sidecars, which can slow down in large environments. Mimir offloads storage entirely from Prometheus so it becomes almost stateless, giving you better multi-tenancy, faster global queries, and no local disk dependency. The trade-off is that you now operate a distributed system with Distributors, Ingesters, Compactors, and Store Gateways, each needing proper resource tuning. At scale, say 50-plus Prometheus instances and 10 million or more active series, Mimir generally performs better because queries hit centralized, pre-indexed storage rather than fanning out to dozens of sidecars. For smaller setups with 5 to 10 Prometheus instances, Thanos is simpler and perfectly fine. A common migration path is to start with Thanos, then move to Mimir when query performance becomes a bottleneck.

Code Example

# ─── Thanos Architecture ───
# Sidecar alongside each Prometheus instance
# prometheus.yaml (add external labels for dedup)
global:
  external_labels:
    cluster: payments-prod        # Identifies this Prometheus
    replica: prometheus-0         # HA pair identifier

# Thanos Sidecar container (runs next to Prometheus)
# --tsdb.path=/prometheus        # Reads Prometheus TSDB
# --objstore.config-file=bucket.yml  # S3 bucket config
# --grpc-address=0.0.0.0:10901  # Store API endpoint

# Thanos Query (global query layer)
# --store=prometheus-0-sidecar:10901
# --store=prometheus-1-sidecar:10901
# --store=thanos-store-gateway:10901
# --query.replica-label=replica  # Dedup HA pairs

# ─── Mimir Architecture ───
# Prometheus remote-writes to Mimir
# prometheus.yaml
remote_write:
  - url: http://mimir-distributor:8080/api/v1/push
    headers:
      X-Scope-OrgID: payments-team  # Multi-tenancy

# Mimir components (each independently scalable):
# Distributor  → receives writes, shards to ingesters
# Ingester     → buffers recent data, flushes to S3
# Querier      → handles PromQL queries
# Store Gateway→ serves historical data from S3
# Compactor    → merges blocks, downsamples

# Query both systems:
# Thanos: http://thanos-query:9090 (PromQL-compatible)
# Mimir:  http://mimir-querier:8080/prometheus (PromQL-compatible)

◈ Architecture Diagram

Thanos Architecture:
  ┌────────────┐  ┌────────────┐
  │Prometheus-0│  │Prometheus-1│
  │ + Sidecar  │  │ + Sidecar  │
  └─────┬──────┘  └─────┬──────┘
        │ upload        │ upload
        ▼               ▼
  ┌──────────────────────────┐
  │     Object Storage (S3)  │
  └──────────┬───────────────┘
             │
  ┌──────────┴───────────────┐
  │      Thanos Query        │
  │  fans out to sidecars    │
  │  + Store Gateway         │
  │  deduplicates HA pairs   │
  └──────────────────────────┘

  Mimir Architecture:
  ┌────────────┐  ┌────────────┐
  │Prometheus-0│  │Prometheus-1│
  │remote-write│  │remote-write│
  └─────┬──────┘  └─────┬──────┘
        │               │
        ▼               ▼
  ┌──────────────────────────┐
  │     Distributor          │
  └──────────┬───────────────┘
             │ shards
     ┌───────┼───────┐
     ▼       ▼       ▼
  ┌──────┐┌──────┐┌──────┐
  │Ingest││Ingest││Ingest│
  └──┬───┘└──┬───┘└──┬───┘
     │ flush │       │
     ▼       ▼       ▼
  ┌──────────────────────────┐
  │     Object Storage (S3)  │
  └──────────────────────────┘

How does Thanos enable long-term Prometheus storage and global queries across clusters?

architectmonitoringprometheus

▼

Quick Answer

Thanos uploads Prometheus TSDB blocks to object storage for long-term retention and provides a global query layer that fans out across multiple Prometheus instances. Architects must size the Store Gateway for index caching, configure compaction for downsampling, and watch query latency, storage costs, and compaction lag.

Detailed Answer

Think of a library system across a city. Each branch library (Prometheus instance) keeps recent books on its shelves, but older books go to a central warehouse (object storage). When a researcher wants to search across all branches and the warehouse at the same time, a central catalog system (Thanos Querier) knows where every book is and fetches it from the right place. Thanos does exactly this for Prometheus metrics. Thanos was built to fix two basic Prometheus limitations: local storage is not durable (if the node dies, metrics are lost), and a single Prometheus can only query its own data. In a multi-cluster environment with 15 Kubernetes clusters, each running its own Prometheus, there is no built-in way to query across all clusters or keep metrics beyond the local retention period (typically 15-30 days). Thanos adds a sidecar to each Prometheus that uploads completed TSDB blocks to object storage (S3, GCS, or Azure Blob), a Store Gateway that serves historical blocks from object storage, a Querier that merges results from sidecars and Store Gateways, and a Compactor that downsamples and compacts blocks for faster long-range queries. Here is the detailed flow. Prometheus writes 2-hour TSDB blocks to local disk. The Thanos Sidecar watches the Prometheus data directory and uploads completed blocks to the object storage bucket. Each block contains a meta.json describing its time range, labels, and resolution. The Querier implements the Prometheus HTTP API and receives PromQL queries. It fans out to all connected StoreAPI endpoints -- sidecars for recent data and Store Gateways for historical data -- deduplicates overlapping series using external labels, and returns merged results. The Compactor runs as a singleton (meaning exactly one instance), downloading blocks from object storage, merging overlapping blocks, creating downsampled versions at 5-minute and 1-hour resolutions, and re-uploading the compacted blocks. This cuts storage costs and speeds up queries over long time ranges. At production scale, the Store Gateway is the most resource-hungry component because it must cache block index headers in memory to answer queries quickly. A cluster with 500 million active time series and 1 year of retention may have hundreds of thousands of blocks, requiring Store Gateway instances with 32-64 GB of memory for index caching. The Compactor must keep up with block production -- if it falls behind, queries over historical data slow down because the Querier has to open many small blocks instead of a few large ones. Architects should watch thanos_compact_group_compactions_failures_total, thanos_store_bucket_cache_hits_total, thanos_query_store_api_duration_seconds, and overall object storage bucket size and cost. The sneaky gotcha is that Thanos deduplication depends on consistent external labels. If two Prometheus instances scrape the same targets but have different or missing external labels, the Querier cannot deduplicate correctly and returns duplicate series that produce wrong aggregation results. Another trap is that the Compactor is a singleton -- running two Compactors against the same bucket causes data corruption because they fight over the same blocks. Teams using Thanos Compactor must guarantee exactly-one behavior, typically via a Kubernetes Deployment with replicas: 1 and a PodDisruptionBudget that prevents eviction during compaction.

Code Example

# Deploy Thanos Sidecar alongside Prometheus using Helm values
# values-prometheus.yaml for kube-prometheus-stack
# prometheus:
#   prometheusSpec:
#     replicas: 2 # HA Prometheus pair in each cluster
#     retention: 6h # Short local retention since Thanos handles long-term
#     externalLabels:
#       cluster: payments-prod-us-east # Unique label for deduplication
#       region: us-east-1 # Region label for filtering queries
#     thanos:
#       objectStorageConfig:
#         existingSecret:
#           name: thanos-objstore-config # Secret containing S3 bucket config
#           key: objstore.yml # Key within the secret

# thanos-objstore-config secret content
# objstore.yml:
# type: S3
# config:
#   bucket: company-thanos-metrics-prod
#   endpoint: s3.us-east-1.amazonaws.com
#   region: us-east-1

# Deploy Thanos Querier that connects to all cluster sidecars and store gateways
kubectl apply -f thanos-querier.yaml

# thanos-querier.yaml
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages the Querier replicas
metadata:
  name: thanos-querier # Central query component
  namespace: monitoring # Observability namespace
spec:
  replicas: 3 # Three replicas for high availability
  selector:
    matchLabels:
      app: thanos-querier # Pod selector
  template:
    metadata:
      labels:
        app: thanos-querier # Label for Service discovery
    spec:
      containers:
      - name: querier # Thanos Querier container
        image: quay.io/thanos/thanos:v0.36.1 # Pinned Thanos version
        args:
        - query # Run in query mode
        - --grpc-address=0.0.0.0:10901 # gRPC address for other components
        - --http-address=0.0.0.0:9090 # HTTP address for PromQL API
        - --endpoint=dnssrv+_grpc._tcp.thanos-sidecar.monitoring.svc # Discover sidecars via DNS SRV
        - --endpoint=dnssrv+_grpc._tcp.thanos-store.monitoring.svc # Discover store gateways via DNS SRV
        - --query.replica-label=prometheus_replica # Deduplicate HA Prometheus pairs
        ports:
        - containerPort: 9090 # HTTP port for Grafana and API queries
          name: http # Port name for Service
        - containerPort: 10901 # gRPC port for inter-component communication
          name: grpc # Port name for Service

# Query across all clusters for the payment API error rate
# curl http://thanos-querier:9090/api/v1/query --data-urlencode \
#   'query=sum(rate(http_requests_total{service="payments-api",code=~"5.."}[5m])) by (cluster)'

◈ Architecture Diagram

┌──────────┐  ┌──────────┐
│Prometheus│  │Prometheus│
│Cluster A │  │Cluster B │
└────┬─────┘  └────┬─────┘
     │ Sidecar     │ Sidecar
     ↓             ↓
┌─────────────────────────┐
│    Object Storage (S3)  │
└────────────┬────────────┘
             │
     ┌───────┴───────┐
     ↓               ↓
┌──────────┐  ┌──────────┐
│Store GW  │  │Compactor │
└────┬─────┘  └──────────┘
     │
┌────┴─────┐
│ Querier  │
└──────────┘

How does Prometheus federation work, and when should you use it versus Thanos or remote write?

architectmonitoringprometheus

▼

Quick Answer

Hierarchical federation pulls pre-aggregated metrics from lower-level Prometheus instances into a global one for fleet-wide dashboards. Cross-service federation pulls specific metrics from another team's Prometheus. Use federation for aggregated views or targeted sharing, but pick Thanos or remote write when you need full-resolution global queries or long-term retention.

Detailed Answer

Think of a news organization. Hierarchical federation is like regional bureaus sending headline summaries to the national desk -- the national editor gets the big picture without reading every local article. Cross-service federation is like the sports desk borrowing a specific stat from the finance desk's data feed. In both cases, only selected information flows upward or sideways, not everything. Prometheus federation uses the /federate endpoint to expose a subset of metrics from one Prometheus instance so another Prometheus can scrape them. In hierarchical federation, a global Prometheus sits above multiple datacenter or cluster-level Prometheus instances. Each lower-level instance runs recording rules that pre-aggregate raw metrics into summary time series -- for example, computing the 99th percentile request latency per service every minute. The global Prometheus scrapes only these aggregated metrics from the /federate endpoint of each lower-level instance, giving it a fleet-wide view without ingesting the raw per-pod or per-instance metrics. Under the hood, the /federate endpoint accepts match[] parameters that filter which metrics are exposed. The global Prometheus configures a scrape job with honor_labels: true to keep the original labels from the source instances, and metrics_path: /federate with params that specify the match expressions. Cross-service federation uses the same mechanism but horizontally: the payments team's Prometheus scrapes specific metrics like http_requests_total or circuit_breaker_state from the user-auth team's Prometheus to monitor a critical dependency. The match expression is narrow, pulling only the exact metrics needed rather than the entire dataset. At production scale, hierarchical federation works well when the global Prometheus only needs aggregated views -- overall error rates, cluster-level resource utilization, SLO compliance percentages. It falls short when engineers need to drill into full-resolution metrics for incident investigation because the global instance only has pre-aggregated data. That is where Thanos or remote write to a central TSDB like Cortex, Mimir, or VictoriaMetrics becomes necessary. Remote write pushes all raw metrics from every Prometheus to a central store, enabling full-resolution global queries at the cost of more storage and ingestion infrastructure. Pick hierarchical federation for cost-effective fleet-wide dashboards, cross-service federation for targeted dependency monitoring with minimal coupling, and Thanos or remote write when incident responders need full-resolution cross-cluster queries. The sneaky gotcha is that federation creates a scrape-interval dependency. The global Prometheus scrapes the /federate endpoint at its own interval (typically 60 seconds), but the source metrics were produced at the source's scrape interval (typically 15-30 seconds). This creates staleness windows where the global view lags behind reality. If the global scrape interval is longer than twice the source's recording rule evaluation interval, data points can be missed entirely. Another trap is that /federate is expensive for the source Prometheus -- serving thousands of metrics via /federate on every scrape adds CPU and memory pressure to the source. Teams should use recording rules to keep the number of series exposed via /federate small rather than using broad match expressions that pull raw metrics.

Code Example

# Recording rule on the cluster-level Prometheus to pre-aggregate for federation
# rules/payments-aggregation.yaml
apiVersion: monitoring.coreos.com/v1 # Prometheus Operator CRD
kind: PrometheusRule # Defines recording and alerting rules
metadata:
  name: payments-federation-rules # Rules for federation aggregation
  namespace: monitoring # Monitoring namespace
spec:
  groups:
  - name: payments-federation # Rule group name
    interval: 30s # Evaluate every 30 seconds
    rules:
    - record: cluster:http_request_duration_seconds:p99 # Pre-aggregated p99 latency
      expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="payments-api"}[5m])) by (le, cluster))
    - record: cluster:http_requests_total:rate5m # Pre-aggregated request rate
      expr: sum(rate(http_requests_total{service="payments-api"}[5m])) by (cluster, code)
    - record: cluster:up:ratio # Availability ratio across all targets
      expr: count(up{job="payments-api"} == 1) / count(up{job="payments-api"})

# Global Prometheus scrape config for hierarchical federation
# prometheus-global.yaml (scrape_configs section)
# scrape_configs:
#   - job_name: 'federate-us-east'
#     honor_labels: true # Preserve original labels from source Prometheus
#     metrics_path: '/federate' # Use the federation endpoint
#     params:
#       'match[]': # Filter to only pull pre-aggregated metrics
#         - '{__name__=~"cluster:.*"}' # Match all cluster-level recording rules
#     static_configs:
#       - targets:
#         - 'prometheus-us-east.monitoring.svc:9090' # US East cluster Prometheus
#         labels:
#           source_cluster: us-east-prod # Label identifying the source
#   - job_name: 'federate-eu-west'
#     honor_labels: true # Preserve original labels from source
#     metrics_path: '/federate' # Federation endpoint
#     params:
#       'match[]':
#         - '{__name__=~"cluster:.*"}' # Same filter pattern
#     static_configs:
#       - targets:
#         - 'prometheus-eu-west.monitoring.svc:9090' # EU West cluster Prometheus
#         labels:
#           source_cluster: eu-west-prod # Source identification label

# Cross-service federation: payments team scrapes auth service circuit breaker status
# Added to the payments team's Prometheus scrape config
# - job_name: 'cross-federate-auth'
#   honor_labels: true
#   metrics_path: '/federate'
#   params:
#     'match[]':
#       - 'circuit_breaker_state{service="user-auth-service"}'
#       - 'http_requests_total{service="user-auth-service",code=~"5.."}'  
#   static_configs:
#     - targets: ['prometheus-auth.monitoring.svc:9090']

◈ Architecture Diagram

┌─────────────────────────┐
│   Global Prometheus     │
│   (aggregated view)     │
└────┬──────────────┬─────┘
     │ /federate    │ /federate
┌────┴─────┐  ┌────┴─────┐
│Cluster A │  │Cluster B │
│Prometheus│  │Prometheus│
│rec. rules│  │rec. rules│
└────┬─────┘  └────┬─────┘
     │ scrape      │ scrape
┌────┴─────┐  ┌────┴─────┐
│ Targets  │  │ Targets  │
└──────────┘  └──────────┘

Thanos Sidecar vs Thanos Receive -- when should you pick each one?

architectmonitoringprometheus

▼

Quick Answer

Thanos Sidecar is pull-based: it uploads completed Prometheus TSDB blocks to object storage and serves recent data via StoreAPI. Thanos Receive is push-based: Prometheus remote-writes metrics to a stateful Receive cluster that replicates and uploads blocks. Pick Sidecar for simpler Kubernetes-native setups, and Receive when Prometheus cannot reach object storage directly or you need multi-tenancy.

Detailed Answer

Think of two ways to archive office documents. The Sidecar approach is like each department scanning its own files and uploading them to cloud storage -- simple and decentralized, but each department needs cloud access. The Receive approach is like all departments sending their files to a central mailroom that handles scanning, copies, and archiving -- more infrastructure in the middle, but departments need no cloud access and the mailroom can sort files by department. Thanos Sidecar runs as a container alongside each Prometheus Pod. It watches the Prometheus data directory for completed TSDB blocks (every 2 hours) and uploads them to object storage. It also exposes a StoreAPI gRPC endpoint that the Querier uses to access recent data still in Prometheus's local TSDB. This model is pull-based: Prometheus writes to local disk as usual, and the Sidecar pulls completed blocks into object storage. The main advantages are simplicity (no extra stateful components), tight coupling with the Prometheus lifecycle, and the ability to serve recent data with zero extra latency since it reads directly from Prometheus's local TSDB. Thanos Receive implements the Prometheus remote write API. Prometheus instances are configured with remote_write to send metrics to a Receive cluster. Receive ingests the data, applies tenant labels, replicates across Receive instances for durability (configurable replication factor, typically 3), and writes TSDB blocks to local disk before uploading to object storage. The Receive cluster is stateful and must be carefully sized for ingestion throughput, disk IOPS, and memory. It serves both recent and locally-stored historical data via StoreAPI. At production scale, the decision depends on your infrastructure and organizational needs. Sidecar is the go-to pattern when Prometheus runs in Kubernetes with direct access to object storage (S3, GCS), when the team wants minimal extra infrastructure, and when HA is handled by running duplicate Prometheus pairs with the Querier deduplicating via external labels. Receive is the better choice when Prometheus runs in edge locations, on-premise datacenters, or environments where direct object storage access is blocked by network policy or compliance rules. Receive also enables multi-tenancy: each tenant's metrics can be routed to specific Receive instances with tenant-level resource limits and retention. The Receive hashring distributes incoming series across instances based on tenant and metric labels, allowing horizontal scaling of ingestion. The sneaky gotcha with Sidecar is the 2-hour upload delay -- completed TSDB blocks are only uploaded after Prometheus compacts them, so there is a window where data exists only on Prometheus's local disk. If the Prometheus Pod crashes before upload, that block can be lost (though HA pairs reduce this risk). With Receive, the gotcha is operational complexity: Receive is a stateful distributed system that requires careful hashring configuration, anti-affinity scheduling, and monitoring of replication lag. If a Receive instance falls behind on ingestion, back-pressure can cause Prometheus remote write queues to grow, eventually dropping data. Architects must size Receive for peak ingestion rate plus a 30 percent buffer, and watch thanos_receive_write_failures_total and remote_storage_queue_highest_sent_timestamp_seconds.

Code Example

# Sidecar pattern: Prometheus with Thanos Sidecar in Kubernetes
# kube-prometheus-stack Helm values
# prometheus:
#   prometheusSpec:
#     replicas: 2 # HA pair for redundancy
#     retention: 4h # Short retention since Sidecar handles long-term
#     externalLabels:
#       cluster: payments-prod # Unique cluster label for deduplication
#     thanos:
#       image: quay.io/thanos/thanos:v0.36.1 # Sidecar image
#       objectStorageConfig:
#         existingSecret:
#           name: thanos-s3-config # S3 bucket configuration
#           key: objstore.yml

# Receive pattern: Prometheus remote-writes to Thanos Receive
# prometheus-remote-write.yaml (Prometheus config)
# remote_write:
#   - url: http://thanos-receive.monitoring.svc:19291/api/v1/receive
#     headers:
#       THANOS-TENANT: payments # Multi-tenant header for isolation
#     queue_config:
#       capacity: 10000 # Buffer capacity before dropping
#       max_shards: 30 # Parallel write shards
#       min_shards: 3 # Minimum active shards
#       max_samples_per_send: 5000 # Batch size per remote write request

# Thanos Receive StatefulSet
apiVersion: apps/v1 # Stable StatefulSet API
kind: StatefulSet # Stateful for persistent storage
metadata:
  name: thanos-receive # Receive component
  namespace: monitoring # Observability namespace
spec:
  replicas: 3 # Three instances for replication factor 3
  serviceName: thanos-receive # Headless service for peer discovery
  selector:
    matchLabels:
      app: thanos-receive # Pod selector
  template:
    metadata:
      labels:
        app: thanos-receive # Label for Service and Querier discovery
    spec:
      containers:
      - name: receive # Thanos Receive container
        image: quay.io/thanos/thanos:v0.36.1 # Pinned version
        args:
        - receive # Run in receive mode
        - --grpc-address=0.0.0.0:10901 # gRPC for Querier StoreAPI
        - --http-address=0.0.0.0:10902 # HTTP for health and metrics
        - --remote-write.address=0.0.0.0:19291 # Remote write ingestion endpoint
        - --receive.replication-factor=3 # Replicate to all 3 instances
        - --receive.hashrings-file=/etc/thanos/hashring.json # Hashring config
        - --tsdb.path=/data/receive # Local TSDB storage path
        - --tsdb.retention=6h # Keep blocks locally before upload
        - --objstore.config-file=/etc/thanos/objstore.yml # S3 upload config
        ports:
        - containerPort: 19291 # Remote write ingestion port
          name: remote-write # Port name
        - containerPort: 10901 # gRPC StoreAPI port
          name: grpc # Port name
        volumeMounts:
        - name: data # Persistent volume for TSDB blocks
          mountPath: /data/receive # Mount path matching tsdb.path
  volumeClaimTemplates:
  - metadata:
      name: data # PVC name for each replica
    spec:
      accessModes: [ReadWriteOnce] # Single-node access
      resources:
        requests:
          storage: 100Gi # Storage for local TSDB blocks before upload

◈ Architecture Diagram

┌── Sidecar Pattern ──┐  ┌── Receive Pattern ──┐
│                     │  │                     │
│ ┌────────┐          │  │ ┌────────┐          │
│ │Prom    │          │  │ │Prom    │          │
│ │+ Sidecar──→ S3    │  │ │        │──remote  │
│ └────────┘   upload │  │ └───┬────┘   write  │
│              (pull) │  │     ↓        (push) │
│                     │  │ ┌────────┐          │
│                     │  │ │Receive │──→ S3    │
│                     │  │ │(x3 HA) │          │
│                     │  │ └────────┘          │
└─────────────────────┘  └─────────────────────┘