DigitalOcean

4 interview questions · prometheus

prometheusadvancedarchitect

How do Mimir and Thanos scale Prometheus, and when would you pick one over the other?

advancedmonitoringprometheus

▼

Quick Answer

Both add high availability and long-term storage to Prometheus by deduplicating data from replicas and saving metrics in object storage like S3. Thanos bolts a sidecar onto each existing Prometheus and adds a global query layer. Mimir receives metrics via remote-write into a standalone, horizontally scalable system. Thanos is easier to adopt; Mimir scales further but needs more infrastructure.

Detailed Answer

Think of Prometheus like a single security camera with a local hard drive. It records great footage but the drive fills up in two weeks, and if the camera breaks, you lose everything. Thanos is like adding a cloud backup to each existing camera -- a sidecar uploads footage to cheap cloud storage, and a central monitor can search all cameras at once. Mimir is like replacing the local drives entirely -- all cameras stream footage straight to a centralized, scalable recording system in the cloud. Thanos achieves high availability by running a sidecar container next to each Prometheus instance. The sidecar does two things: it uploads Prometheus's local two-hour TSDB blocks to object storage such as S3, GCS, or MinIO, and it exposes a gRPC Store API that the Thanos Query component can reach. Thanos Query acts as a global query layer. It fans out PromQL queries to all sidecars for recent data and to the Store Gateway for historical data in object storage, then deduplicates results from high-availability Prometheus pairs using external labels. A separate Thanos Compactor merges and downsamples blocks in object storage so long-range queries stay fast. Mimir, originally called Cortex and donated by Grafana Labs, works differently. Instead of sidecars, each Prometheus sends metrics via remote-write to Mimir's Distributor component. Mimir is built as a set of microservices that you can scale independently. Distributors accept writes and shard them across Ingesters, which hold recent data in memory and periodically flush blocks to object storage. Queriers answer PromQL requests by reading from both Ingesters for fresh data and Store Gateways for older data. Because each piece scales on its own, Mimir handles very large workloads without bottlenecks. The operational trade-offs come down to complexity versus capability. Thanos is easier to adopt because you keep your existing Prometheus instances and just attach sidecars -- no change to the write path. But the sidecar model means every Prometheus still needs local disk for TSDB blocks, and queries fan out to many sidecars, which can slow down in large environments. Mimir offloads storage entirely from Prometheus so it becomes almost stateless, giving you better multi-tenancy, faster global queries, and no local disk dependency. The trade-off is that you now operate a distributed system with Distributors, Ingesters, Compactors, and Store Gateways, each needing proper resource tuning. At scale, say 50-plus Prometheus instances and 10 million or more active series, Mimir generally performs better because queries hit centralized, pre-indexed storage rather than fanning out to dozens of sidecars. For smaller setups with 5 to 10 Prometheus instances, Thanos is simpler and perfectly fine. A common migration path is to start with Thanos, then move to Mimir when query performance becomes a bottleneck.

Code Example

# ─── Thanos Architecture ───
# Sidecar alongside each Prometheus instance
# prometheus.yaml (add external labels for dedup)
global:
  external_labels:
    cluster: payments-prod        # Identifies this Prometheus
    replica: prometheus-0         # HA pair identifier

# Thanos Sidecar container (runs next to Prometheus)
# --tsdb.path=/prometheus        # Reads Prometheus TSDB
# --objstore.config-file=bucket.yml  # S3 bucket config
# --grpc-address=0.0.0.0:10901  # Store API endpoint

# Thanos Query (global query layer)
# --store=prometheus-0-sidecar:10901
# --store=prometheus-1-sidecar:10901
# --store=thanos-store-gateway:10901
# --query.replica-label=replica  # Dedup HA pairs

# ─── Mimir Architecture ───
# Prometheus remote-writes to Mimir
# prometheus.yaml
remote_write:
  - url: http://mimir-distributor:8080/api/v1/push
    headers:
      X-Scope-OrgID: payments-team  # Multi-tenancy

# Mimir components (each independently scalable):
# Distributor  → receives writes, shards to ingesters
# Ingester     → buffers recent data, flushes to S3
# Querier      → handles PromQL queries
# Store Gateway→ serves historical data from S3
# Compactor    → merges blocks, downsamples

# Query both systems:
# Thanos: http://thanos-query:9090 (PromQL-compatible)
# Mimir:  http://mimir-querier:8080/prometheus (PromQL-compatible)

◈ Architecture Diagram

Thanos Architecture:
  ┌────────────┐  ┌────────────┐
  │Prometheus-0│  │Prometheus-1│
  │ + Sidecar  │  │ + Sidecar  │
  └─────┬──────┘  └─────┬──────┘
        │ upload        │ upload
        ▼               ▼
  ┌──────────────────────────┐
  │     Object Storage (S3)  │
  └──────────┬───────────────┘
             │
  ┌──────────┴───────────────┐
  │      Thanos Query        │
  │  fans out to sidecars    │
  │  + Store Gateway         │
  │  deduplicates HA pairs   │
  └──────────────────────────┘

  Mimir Architecture:
  ┌────────────┐  ┌────────────┐
  │Prometheus-0│  │Prometheus-1│
  │remote-write│  │remote-write│
  └─────┬──────┘  └─────┬──────┘
        │               │
        ▼               ▼
  ┌──────────────────────────┐
  │     Distributor          │
  └──────────┬───────────────┘
             │ shards
     ┌───────┼───────┐
     ▼       ▼       ▼
  ┌──────┐┌──────┐┌──────┐
  │Ingest││Ingest││Ingest│
  └──┬───┘└──┬───┘└──┬───┘
     │ flush │       │
     ▼       ▼       ▼
  ┌──────────────────────────┐
  │     Object Storage (S3)  │
  └──────────────────────────┘

How does Thanos enable long-term Prometheus storage and global queries across clusters?

architectmonitoringprometheus

▼

Quick Answer

Thanos uploads Prometheus TSDB blocks to object storage for long-term retention and provides a global query layer that fans out across multiple Prometheus instances. Architects must size the Store Gateway for index caching, configure compaction for downsampling, and watch query latency, storage costs, and compaction lag.

Detailed Answer

Think of a library system across a city. Each branch library (Prometheus instance) keeps recent books on its shelves, but older books go to a central warehouse (object storage). When a researcher wants to search across all branches and the warehouse at the same time, a central catalog system (Thanos Querier) knows where every book is and fetches it from the right place. Thanos does exactly this for Prometheus metrics. Thanos was built to fix two basic Prometheus limitations: local storage is not durable (if the node dies, metrics are lost), and a single Prometheus can only query its own data. In a multi-cluster environment with 15 Kubernetes clusters, each running its own Prometheus, there is no built-in way to query across all clusters or keep metrics beyond the local retention period (typically 15-30 days). Thanos adds a sidecar to each Prometheus that uploads completed TSDB blocks to object storage (S3, GCS, or Azure Blob), a Store Gateway that serves historical blocks from object storage, a Querier that merges results from sidecars and Store Gateways, and a Compactor that downsamples and compacts blocks for faster long-range queries. Here is the detailed flow. Prometheus writes 2-hour TSDB blocks to local disk. The Thanos Sidecar watches the Prometheus data directory and uploads completed blocks to the object storage bucket. Each block contains a meta.json describing its time range, labels, and resolution. The Querier implements the Prometheus HTTP API and receives PromQL queries. It fans out to all connected StoreAPI endpoints -- sidecars for recent data and Store Gateways for historical data -- deduplicates overlapping series using external labels, and returns merged results. The Compactor runs as a singleton (meaning exactly one instance), downloading blocks from object storage, merging overlapping blocks, creating downsampled versions at 5-minute and 1-hour resolutions, and re-uploading the compacted blocks. This cuts storage costs and speeds up queries over long time ranges. At production scale, the Store Gateway is the most resource-hungry component because it must cache block index headers in memory to answer queries quickly. A cluster with 500 million active time series and 1 year of retention may have hundreds of thousands of blocks, requiring Store Gateway instances with 32-64 GB of memory for index caching. The Compactor must keep up with block production -- if it falls behind, queries over historical data slow down because the Querier has to open many small blocks instead of a few large ones. Architects should watch thanos_compact_group_compactions_failures_total, thanos_store_bucket_cache_hits_total, thanos_query_store_api_duration_seconds, and overall object storage bucket size and cost. The sneaky gotcha is that Thanos deduplication depends on consistent external labels. If two Prometheus instances scrape the same targets but have different or missing external labels, the Querier cannot deduplicate correctly and returns duplicate series that produce wrong aggregation results. Another trap is that the Compactor is a singleton -- running two Compactors against the same bucket causes data corruption because they fight over the same blocks. Teams using Thanos Compactor must guarantee exactly-one behavior, typically via a Kubernetes Deployment with replicas: 1 and a PodDisruptionBudget that prevents eviction during compaction.

Code Example

# Deploy Thanos Sidecar alongside Prometheus using Helm values
# values-prometheus.yaml for kube-prometheus-stack
# prometheus:
#   prometheusSpec:
#     replicas: 2 # HA Prometheus pair in each cluster
#     retention: 6h # Short local retention since Thanos handles long-term
#     externalLabels:
#       cluster: payments-prod-us-east # Unique label for deduplication
#       region: us-east-1 # Region label for filtering queries
#     thanos:
#       objectStorageConfig:
#         existingSecret:
#           name: thanos-objstore-config # Secret containing S3 bucket config
#           key: objstore.yml # Key within the secret

# thanos-objstore-config secret content
# objstore.yml:
# type: S3
# config:
#   bucket: company-thanos-metrics-prod
#   endpoint: s3.us-east-1.amazonaws.com
#   region: us-east-1

# Deploy Thanos Querier that connects to all cluster sidecars and store gateways
kubectl apply -f thanos-querier.yaml

# thanos-querier.yaml
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages the Querier replicas
metadata:
  name: thanos-querier # Central query component
  namespace: monitoring # Observability namespace
spec:
  replicas: 3 # Three replicas for high availability
  selector:
    matchLabels:
      app: thanos-querier # Pod selector
  template:
    metadata:
      labels:
        app: thanos-querier # Label for Service discovery
    spec:
      containers:
      - name: querier # Thanos Querier container
        image: quay.io/thanos/thanos:v0.36.1 # Pinned Thanos version
        args:
        - query # Run in query mode
        - --grpc-address=0.0.0.0:10901 # gRPC address for other components
        - --http-address=0.0.0.0:9090 # HTTP address for PromQL API
        - --endpoint=dnssrv+_grpc._tcp.thanos-sidecar.monitoring.svc # Discover sidecars via DNS SRV
        - --endpoint=dnssrv+_grpc._tcp.thanos-store.monitoring.svc # Discover store gateways via DNS SRV
        - --query.replica-label=prometheus_replica # Deduplicate HA Prometheus pairs
        ports:
        - containerPort: 9090 # HTTP port for Grafana and API queries
          name: http # Port name for Service
        - containerPort: 10901 # gRPC port for inter-component communication
          name: grpc # Port name for Service

# Query across all clusters for the payment API error rate
# curl http://thanos-querier:9090/api/v1/query --data-urlencode \
#   'query=sum(rate(http_requests_total{service="payments-api",code=~"5.."}[5m])) by (cluster)'

◈ Architecture Diagram

┌──────────┐  ┌──────────┐
│Prometheus│  │Prometheus│
│Cluster A │  │Cluster B │
└────┬─────┘  └────┬─────┘
     │ Sidecar     │ Sidecar
     ↓             ↓
┌─────────────────────────┐
│    Object Storage (S3)  │
└────────────┬────────────┘
             │
     ┌───────┴───────┐
     ↓               ↓
┌──────────┐  ┌──────────┐
│Store GW  │  │Compactor │
└────┬─────┘  └──────────┘
     │
┌────┴─────┐
│ Querier  │
└──────────┘

How does Prometheus federation work, and when should you use it versus Thanos or remote write?

architectmonitoringprometheus

▼

Quick Answer

Hierarchical federation pulls pre-aggregated metrics from lower-level Prometheus instances into a global one for fleet-wide dashboards. Cross-service federation pulls specific metrics from another team's Prometheus. Use federation for aggregated views or targeted sharing, but pick Thanos or remote write when you need full-resolution global queries or long-term retention.

Detailed Answer

Think of a news organization. Hierarchical federation is like regional bureaus sending headline summaries to the national desk -- the national editor gets the big picture without reading every local article. Cross-service federation is like the sports desk borrowing a specific stat from the finance desk's data feed. In both cases, only selected information flows upward or sideways, not everything. Prometheus federation uses the /federate endpoint to expose a subset of metrics from one Prometheus instance so another Prometheus can scrape them. In hierarchical federation, a global Prometheus sits above multiple datacenter or cluster-level Prometheus instances. Each lower-level instance runs recording rules that pre-aggregate raw metrics into summary time series -- for example, computing the 99th percentile request latency per service every minute. The global Prometheus scrapes only these aggregated metrics from the /federate endpoint of each lower-level instance, giving it a fleet-wide view without ingesting the raw per-pod or per-instance metrics. Under the hood, the /federate endpoint accepts match[] parameters that filter which metrics are exposed. The global Prometheus configures a scrape job with honor_labels: true to keep the original labels from the source instances, and metrics_path: /federate with params that specify the match expressions. Cross-service federation uses the same mechanism but horizontally: the payments team's Prometheus scrapes specific metrics like http_requests_total or circuit_breaker_state from the user-auth team's Prometheus to monitor a critical dependency. The match expression is narrow, pulling only the exact metrics needed rather than the entire dataset. At production scale, hierarchical federation works well when the global Prometheus only needs aggregated views -- overall error rates, cluster-level resource utilization, SLO compliance percentages. It falls short when engineers need to drill into full-resolution metrics for incident investigation because the global instance only has pre-aggregated data. That is where Thanos or remote write to a central TSDB like Cortex, Mimir, or VictoriaMetrics becomes necessary. Remote write pushes all raw metrics from every Prometheus to a central store, enabling full-resolution global queries at the cost of more storage and ingestion infrastructure. Pick hierarchical federation for cost-effective fleet-wide dashboards, cross-service federation for targeted dependency monitoring with minimal coupling, and Thanos or remote write when incident responders need full-resolution cross-cluster queries. The sneaky gotcha is that federation creates a scrape-interval dependency. The global Prometheus scrapes the /federate endpoint at its own interval (typically 60 seconds), but the source metrics were produced at the source's scrape interval (typically 15-30 seconds). This creates staleness windows where the global view lags behind reality. If the global scrape interval is longer than twice the source's recording rule evaluation interval, data points can be missed entirely. Another trap is that /federate is expensive for the source Prometheus -- serving thousands of metrics via /federate on every scrape adds CPU and memory pressure to the source. Teams should use recording rules to keep the number of series exposed via /federate small rather than using broad match expressions that pull raw metrics.

Code Example

# Recording rule on the cluster-level Prometheus to pre-aggregate for federation
# rules/payments-aggregation.yaml
apiVersion: monitoring.coreos.com/v1 # Prometheus Operator CRD
kind: PrometheusRule # Defines recording and alerting rules
metadata:
  name: payments-federation-rules # Rules for federation aggregation
  namespace: monitoring # Monitoring namespace
spec:
  groups:
  - name: payments-federation # Rule group name
    interval: 30s # Evaluate every 30 seconds
    rules:
    - record: cluster:http_request_duration_seconds:p99 # Pre-aggregated p99 latency
      expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="payments-api"}[5m])) by (le, cluster))
    - record: cluster:http_requests_total:rate5m # Pre-aggregated request rate
      expr: sum(rate(http_requests_total{service="payments-api"}[5m])) by (cluster, code)
    - record: cluster:up:ratio # Availability ratio across all targets
      expr: count(up{job="payments-api"} == 1) / count(up{job="payments-api"})

# Global Prometheus scrape config for hierarchical federation
# prometheus-global.yaml (scrape_configs section)
# scrape_configs:
#   - job_name: 'federate-us-east'
#     honor_labels: true # Preserve original labels from source Prometheus
#     metrics_path: '/federate' # Use the federation endpoint
#     params:
#       'match[]': # Filter to only pull pre-aggregated metrics
#         - '{__name__=~"cluster:.*"}' # Match all cluster-level recording rules
#     static_configs:
#       - targets:
#         - 'prometheus-us-east.monitoring.svc:9090' # US East cluster Prometheus
#         labels:
#           source_cluster: us-east-prod # Label identifying the source
#   - job_name: 'federate-eu-west'
#     honor_labels: true # Preserve original labels from source
#     metrics_path: '/federate' # Federation endpoint
#     params:
#       'match[]':
#         - '{__name__=~"cluster:.*"}' # Same filter pattern
#     static_configs:
#       - targets:
#         - 'prometheus-eu-west.monitoring.svc:9090' # EU West cluster Prometheus
#         labels:
#           source_cluster: eu-west-prod # Source identification label

# Cross-service federation: payments team scrapes auth service circuit breaker status
# Added to the payments team's Prometheus scrape config
# - job_name: 'cross-federate-auth'
#   honor_labels: true
#   metrics_path: '/federate'
#   params:
#     'match[]':
#       - 'circuit_breaker_state{service="user-auth-service"}'
#       - 'http_requests_total{service="user-auth-service",code=~"5.."}'  
#   static_configs:
#     - targets: ['prometheus-auth.monitoring.svc:9090']

◈ Architecture Diagram

┌─────────────────────────┐
│   Global Prometheus     │
│   (aggregated view)     │
└────┬──────────────┬─────┘
     │ /federate    │ /federate
┌────┴─────┐  ┌────┴─────┐
│Cluster A │  │Cluster B │
│Prometheus│  │Prometheus│
│rec. rules│  │rec. rules│
└────┬─────┘  └────┬─────┘
     │ scrape      │ scrape
┌────┴─────┐  ┌────┴─────┐
│ Targets  │  │ Targets  │
└──────────┘  └──────────┘

Thanos Sidecar vs Thanos Receive -- when should you pick each one?

architectmonitoringprometheus

▼

Quick Answer

Thanos Sidecar is pull-based: it uploads completed Prometheus TSDB blocks to object storage and serves recent data via StoreAPI. Thanos Receive is push-based: Prometheus remote-writes metrics to a stateful Receive cluster that replicates and uploads blocks. Pick Sidecar for simpler Kubernetes-native setups, and Receive when Prometheus cannot reach object storage directly or you need multi-tenancy.

Detailed Answer

Think of two ways to archive office documents. The Sidecar approach is like each department scanning its own files and uploading them to cloud storage -- simple and decentralized, but each department needs cloud access. The Receive approach is like all departments sending their files to a central mailroom that handles scanning, copies, and archiving -- more infrastructure in the middle, but departments need no cloud access and the mailroom can sort files by department. Thanos Sidecar runs as a container alongside each Prometheus Pod. It watches the Prometheus data directory for completed TSDB blocks (every 2 hours) and uploads them to object storage. It also exposes a StoreAPI gRPC endpoint that the Querier uses to access recent data still in Prometheus's local TSDB. This model is pull-based: Prometheus writes to local disk as usual, and the Sidecar pulls completed blocks into object storage. The main advantages are simplicity (no extra stateful components), tight coupling with the Prometheus lifecycle, and the ability to serve recent data with zero extra latency since it reads directly from Prometheus's local TSDB. Thanos Receive implements the Prometheus remote write API. Prometheus instances are configured with remote_write to send metrics to a Receive cluster. Receive ingests the data, applies tenant labels, replicates across Receive instances for durability (configurable replication factor, typically 3), and writes TSDB blocks to local disk before uploading to object storage. The Receive cluster is stateful and must be carefully sized for ingestion throughput, disk IOPS, and memory. It serves both recent and locally-stored historical data via StoreAPI. At production scale, the decision depends on your infrastructure and organizational needs. Sidecar is the go-to pattern when Prometheus runs in Kubernetes with direct access to object storage (S3, GCS), when the team wants minimal extra infrastructure, and when HA is handled by running duplicate Prometheus pairs with the Querier deduplicating via external labels. Receive is the better choice when Prometheus runs in edge locations, on-premise datacenters, or environments where direct object storage access is blocked by network policy or compliance rules. Receive also enables multi-tenancy: each tenant's metrics can be routed to specific Receive instances with tenant-level resource limits and retention. The Receive hashring distributes incoming series across instances based on tenant and metric labels, allowing horizontal scaling of ingestion. The sneaky gotcha with Sidecar is the 2-hour upload delay -- completed TSDB blocks are only uploaded after Prometheus compacts them, so there is a window where data exists only on Prometheus's local disk. If the Prometheus Pod crashes before upload, that block can be lost (though HA pairs reduce this risk). With Receive, the gotcha is operational complexity: Receive is a stateful distributed system that requires careful hashring configuration, anti-affinity scheduling, and monitoring of replication lag. If a Receive instance falls behind on ingestion, back-pressure can cause Prometheus remote write queues to grow, eventually dropping data. Architects must size Receive for peak ingestion rate plus a 30 percent buffer, and watch thanos_receive_write_failures_total and remote_storage_queue_highest_sent_timestamp_seconds.

Code Example

# Sidecar pattern: Prometheus with Thanos Sidecar in Kubernetes
# kube-prometheus-stack Helm values
# prometheus:
#   prometheusSpec:
#     replicas: 2 # HA pair for redundancy
#     retention: 4h # Short retention since Sidecar handles long-term
#     externalLabels:
#       cluster: payments-prod # Unique cluster label for deduplication
#     thanos:
#       image: quay.io/thanos/thanos:v0.36.1 # Sidecar image
#       objectStorageConfig:
#         existingSecret:
#           name: thanos-s3-config # S3 bucket configuration
#           key: objstore.yml

# Receive pattern: Prometheus remote-writes to Thanos Receive
# prometheus-remote-write.yaml (Prometheus config)
# remote_write:
#   - url: http://thanos-receive.monitoring.svc:19291/api/v1/receive
#     headers:
#       THANOS-TENANT: payments # Multi-tenant header for isolation
#     queue_config:
#       capacity: 10000 # Buffer capacity before dropping
#       max_shards: 30 # Parallel write shards
#       min_shards: 3 # Minimum active shards
#       max_samples_per_send: 5000 # Batch size per remote write request

# Thanos Receive StatefulSet
apiVersion: apps/v1 # Stable StatefulSet API
kind: StatefulSet # Stateful for persistent storage
metadata:
  name: thanos-receive # Receive component
  namespace: monitoring # Observability namespace
spec:
  replicas: 3 # Three instances for replication factor 3
  serviceName: thanos-receive # Headless service for peer discovery
  selector:
    matchLabels:
      app: thanos-receive # Pod selector
  template:
    metadata:
      labels:
        app: thanos-receive # Label for Service and Querier discovery
    spec:
      containers:
      - name: receive # Thanos Receive container
        image: quay.io/thanos/thanos:v0.36.1 # Pinned version
        args:
        - receive # Run in receive mode
        - --grpc-address=0.0.0.0:10901 # gRPC for Querier StoreAPI
        - --http-address=0.0.0.0:10902 # HTTP for health and metrics
        - --remote-write.address=0.0.0.0:19291 # Remote write ingestion endpoint
        - --receive.replication-factor=3 # Replicate to all 3 instances
        - --receive.hashrings-file=/etc/thanos/hashring.json # Hashring config
        - --tsdb.path=/data/receive # Local TSDB storage path
        - --tsdb.retention=6h # Keep blocks locally before upload
        - --objstore.config-file=/etc/thanos/objstore.yml # S3 upload config
        ports:
        - containerPort: 19291 # Remote write ingestion port
          name: remote-write # Port name
        - containerPort: 10901 # gRPC StoreAPI port
          name: grpc # Port name
        volumeMounts:
        - name: data # Persistent volume for TSDB blocks
          mountPath: /data/receive # Mount path matching tsdb.path
  volumeClaimTemplates:
  - metadata:
      name: data # PVC name for each replica
    spec:
      accessModes: [ReadWriteOnce] # Single-node access
      resources:
        requests:
          storage: 100Gi # Storage for local TSDB blocks before upload

◈ Architecture Diagram

┌── Sidecar Pattern ──┐  ┌── Receive Pattern ──┐
│                     │  │                     │
│ ┌────────┐          │  │ ┌────────┐          │
│ │Prom    │          │  │ │Prom    │          │
│ │+ Sidecar──→ S3    │  │ │        │──remote  │
│ └────────┘   upload │  │ └───┬────┘   write  │
│              (pull) │  │     ↓        (push) │
│                     │  │ ┌────────┐          │
│                     │  │ │Receive │──→ S3    │
│                     │  │ │(x3 HA) │          │
│                     │  │ └────────┘          │
└─────────────────────┘  └─────────────────────┘