36 reviewed items across 3 content types.
Shopify processes over 10,000 transactions per second during flash sales. Their monolith-to-microservices migration created blind spots in request tracing across 280+ services, causing checkout failures that cost an estimated $2.3M per hour during peak traffic events.
Netflix operates over 1,000 microservices across multiple AWS regions serving 260 million subscribers. They were locked into a proprietary observability vendor costing $18M annually, and needed to migrate to a multi-backend strategy without disrupting production observability during the transition.
Stripe processes billions of dollars in payments monthly across 35 countries. A subtle intermittent payment failure affecting 0.3% of transactions in the EU region was causing $4.7M in monthly revenue loss, but the root cause was invisible because logs, metrics, and traces lived in separate silos with no correlation.
Uber operates 4,500+ microservices written in Go, Java, Python, and Node.js. Manual instrumentation coverage was only 22% because teams resisted adding tracing code, creating massive observability blind spots during the migration from their legacy Jaeger-based system to a standardized OpenTelemetry deployment.
Airbnb serves 150 million users with 800+ microservices. Their SLO (Service Level Objective) monitoring was based on synthetic checks that missed real user experience degradation. During a major booking outage, SLOs showed green while 12% of real users experienced failures, eroding trust in the SLO framework entirely.
Datadog, while offering its own proprietary agents, needed to support the growing number of customers sending telemetry via OTLP. Internally, their own infrastructure of 600+ services needed to dogfood OTLP ingestion at a scale of 2.1 trillion spans per day, exposing performance bottlenecks in their ingestion pipeline that affected all OTLP customers.
Sending telemetry directly from application SDKs to observability backends creates tight coupling between your application code and your vendor infrastructure. When you need to switch vendors, change sampling rates, add attribute enrichment, or route different telemetry types to different backends, you must modify and redeploy every application. At scale with hundreds of services, this becomes operationally impossible. The collector acts as a decoupling layer that absorbs changes in your observability infrastructure without requiring application redeployments. It also provides critical capabilities that SDKs cannot efficiently implement: tail-based sampling requires seeing all spans in a trace before making a sampling decision, which is only possible at a centralized collection point. Batch processing at the collector reduces the number of outbound connections and allows for retry logic that would be wasteful to implement in every application process. The collector also serves as a security boundary, handling authentication tokens for backend APIs so that individual applications do not need access to observability platform credentials.
Semantic conventions are standardized attribute names and values that ensure telemetry data is consistent and interoperable across services, teams, and observability tools. Without them, one team might use 'http.status' while another uses 'http_status_code' and a third uses 'response.code', making it impossible to build unified dashboards, create consistent alerts, or query traces across service boundaries. The problem compounds over time: as new services are added and old ones evolve, the attribute namespace drifts further apart. Observability backends increasingly build features on top of semantic conventions: Grafana Tempo's service graph is generated from specific span attributes, Jaeger's critical path analysis depends on standardized span kinds, and many backends auto-generate RED metrics from semantically tagged HTTP and RPC spans. Using non-standard attribute names means losing these capabilities entirely. Furthermore, when you eventually need to migrate between observability backends, semantic conventions ensure your data model is portable because every major backend understands the same attribute semantics.
Every piece of telemetry data needs infrastructure context to be actionable during incident response. When you see a spike in error rates, you need to immediately know which Kubernetes cluster, namespace, node, pod, cloud region, and availability zone the errors are coming from. Without automatic resource detection, teams either manually configure these attributes (which inevitably drift from reality as infrastructure changes) or omit them entirely (making telemetry useless for infrastructure-level debugging). Resource attributes are the foundation of telemetry filtering and aggregation: they let you answer questions like 'Is this error happening across all AZs or just us-east-1a?' or 'Did this latency regression start after a specific deployment?' Automatic detection eliminates human error and ensures consistency across all services regardless of which team deployed them. It also enables powerful automation: if every span includes the k8s.deployment.name and service.version, you can automatically correlate deployments with performance regressions using change detection algorithms on your metrics backend.
Spans without proper error recording are traces without teeth. When a span completes with status UNSET (the default), observability backends cannot distinguish between successful operations and silent failures. This means error-based sampling policies miss real errors, error rate dashboards undercount failures, and SLO calculations based on trace data are inaccurate. The problem is especially severe with auto-instrumentation: while HTTP instrumentation typically sets span status based on response codes, business logic errors (like a payment declined, an inventory check failing, or a validation rule rejecting input) often complete as successful HTTP 200 responses with error details in the response body. If your spans do not explicitly record these as errors, your tracing system shows a 99.9% success rate while your business is experiencing a 5% failure rate. Proper error recording also enables root cause analysis: when a span records the exception type, message, and stacktrace as span events, engineers can diagnose the root cause from the trace alone without having to correlate with logs. This reduces MTTR significantly because the trace becomes a self-contained debugging artifact.
Context propagation is the mechanism that ties individual spans together into a coherent distributed trace. Without it, your traces break at every service boundary, producing disconnected fragments that cannot be used for end-to-end latency analysis or root cause identification. While HTTP context propagation using W3C TraceContext headers works automatically with most auto-instrumentation, asynchronous boundaries like message queues, event buses, and background job processors require explicit propagation that many teams overlook. When a web request publishes a message to Kafka and a consumer processes it minutes later, the consumer's spans are orphaned unless the producer injected trace context into the message headers and the consumer extracted it. This creates a massive blind spot: in event-driven architectures, the most complex and error-prone processing happens asynchronously, yet it is precisely this processing that lacks trace continuity. Without async propagation, you cannot answer questions like 'How long did it take from when the user clicked checkout to when the order was fulfilled?' because the trace ends at the Kafka producer and a new, unrelated trace begins at the consumer.
OpenTelemetry Collectors process high volumes of telemetry data in memory, and without proper limits, they are susceptible to out-of-memory (OOM) kills that cause complete telemetry data loss. This typically happens during traffic spikes, when downstream backends are slow or unavailable, or when tail-based sampling buffers grow unbounded. An OOM-killed collector does not just lose the data it was processing; it creates a gap in your observability during exactly the moments when you need it most, because traffic spikes and backend issues often coincide with production incidents. The problem is compounded by Kubernetes memory limits: if the collector's memory usage exceeds its pod resource limit, Kubernetes kills the pod immediately (OOMKilled), without giving the collector a chance to flush buffered data. The new pod starts empty, having lost all in-flight telemetry. In a DaemonSet deployment, an OOM kill on one node means every service on that node loses observability for the duration of the restart. In a gateway deployment, it can mean total observability blackout. Setting proper memory limits with backpressure ensures graceful degradation: the collector drops the least important data first while preserving critical telemetry like error traces.