Kubernetes

173 reviewed items across 5 content types.

Interview Questions (143)Use Cases (7)Production Issues (9)Best Practices (9)Company Stories (5)

Asked by:

0/143 reviewed

🏢

Company Stories

How Netflix Runs 1000+ Microservices on Titus/Kubernetes with Custom Scheduler Extensions for GPU WorkloadsNetflix

▼

Challenge

Netflix operates one of the largest streaming platforms in the world, serving over 260 million subscribers across 190+ countries. By 2024, their container platform Titus was running over 1,000 microservices generating millions of container launches per day. The challenge was twofold: first, they needed to converge their proprietary Titus container platform onto upstream Kubernetes without disrupting the massive fleet of existing workloads that depended on Titus APIs and scheduling semantics. Second, they had to support a rapidly growing GPU workload footprint for content encoding, recommendation model training, and real-time personalization inference. Their existing scheduler could not efficiently bin-pack GPU workloads alongside CPU-heavy microservices, leading to GPU utilization rates below 40 percent and millions of dollars in wasted compute. The migration needed to happen without any perceivable impact to the streaming experience for hundreds of millions of concurrent users during peak hours.

Solution

Netflix pursued a multi-year convergence strategy to rebuild Titus on top of Kubernetes, adopting the Kubernetes control plane while preserving the Titus API layer that thousands of internal developers already depended on. They built a custom Kubernetes scheduler extension called Titus Kube Scheduler that implements bin-packing algorithms optimized for heterogeneous workloads mixing CPU, memory, network, and GPU resource dimensions. For GPU workloads specifically, they developed topology-aware scheduling that understands NVLink interconnects and PCIe topology, ensuring that multi-GPU training jobs are placed on nodes where GPUs share high-bandwidth interconnects. They implemented a custom resource model extending Kubernetes native GPU device plugin to expose GPU memory, compute capability, and interconnect topology as schedulable resources. The migration itself used a shadow-traffic approach where workloads ran simultaneously on both legacy Titus and new Titus-on-Kubernetes, with automated comparison of scheduling decisions and performance metrics. They built a custom admission controller that enforces resource quotas, security policies, and Netflix-specific compliance requirements. The entire platform runs across multiple AWS regions with federation managed through a custom multi-cluster control plane.

Outcome

GPU utilization improved from under 40 percent to over 65 percent through topology-aware scheduling, saving tens of millions annually in GPU compute costs. Container startup latency improved by 30 percent through optimized image caching and pre-warming. Developer experience remained unchanged during migration since the Titus API layer was preserved. Scheduling throughput increased to handle over 10,000 scheduling decisions per second. Zero customer-facing incidents during the multi-year migration.

Scale

Over 1,000 microservices, millions of containers launched daily, 200,000+ concurrent containers at peak, tens of thousands of GPU instances for ML and encoding workloads, spanning multiple AWS regions. Engineering organization of 2,000+ engineers deploying to the platform.

Key Learnings

Preserving existing developer APIs during a platform migration is more important than adopting upstream APIs directly — developer trust and velocity matter more than API purity
GPU scheduling requires topology awareness beyond simple resource counting — NVLink and PCIe topology dramatically affect multi-GPU training performance
Shadow traffic comparison between old and new scheduling systems catches subtle regression bugs that unit tests cannot detect

How Spotify Uses Kubernetes as a Universal Control Plane with Backstage for Developer ExperienceSpotify

▼

Challenge

Spotify had over 2,000 engineers working across hundreds of autonomous squads, each responsible for their own microservices. By 2024, they operated over 2,000 microservices running in production across multiple GCP regions. The core challenge was that while Kubernetes solved container orchestration, it did not solve the developer experience problem — engineers needed to understand dozens of different tools, dashboards, and configuration systems to deploy and operate their services. Service catalog sprawl made it impossible to track ownership, dependencies, and compliance status. New engineers took weeks to become productive because tribal knowledge about deployment pipelines, monitoring setup, and infrastructure provisioning was scattered across wikis, Slack channels, and individual team runbooks. Additionally, different squads had adopted incompatible deployment patterns, making platform-wide upgrades and security patching extremely difficult.

Solution

Spotify built Backstage, an open-source developer portal that provides a unified interface for all infrastructure operations, and deeply integrated it with their Kubernetes platform. Backstage serves as the single pane of glass where developers discover services, create new projects from software templates, view deployment status, check CI/CD pipelines, and access documentation. Under the hood, Backstage uses Kubernetes as a universal control plane not just for running workloads but for managing the lifecycle of all software components. They implemented custom Kubernetes operators that reconcile the desired state of services defined in Backstage software catalog entries with actual running infrastructure. Software templates in Backstage auto-generate Kubernetes manifests, Helm charts, CI/CD pipeline configurations, and monitoring dashboards for new services. They built a plugin architecture that allows squads to extend Backstage with custom capabilities while maintaining a consistent developer experience. The Kubernetes integration includes real-time pod status, deployment history, and resource utilization visible directly in the Backstage service catalog.

Outcome

New engineer onboarding time reduced from weeks to days with golden path templates. Service creation time dropped from hours of manual setup to under 10 minutes using Backstage software templates. Platform-wide security patches that previously took months to propagate across squads now complete in days through centralized template updates. Service ownership coverage went from approximately 60 percent to over 95 percent through mandatory catalog registration. DORA deployment frequency metrics improved by 3x across the organization.

Scale

Over 2,000 microservices, 2,000+ engineers across hundreds of squads, multiple GCP Kubernetes clusters, tens of thousands of pods in production, over 400 Backstage plugins contributed by internal teams. Backstage open-source project has 27,000+ GitHub stars and adopted by thousands of companies.

Key Learnings

Developer experience is a force multiplier — investing in a unified portal pays dividends across every team rather than optimizing individual workflows
Kubernetes as a control plane extends beyond workload orchestration — custom operators can reconcile any desired-state resource model
Golden path templates reduce cognitive load without removing flexibility — squads can deviate when needed but the default path handles 80 percent of cases

How Airbnb Built Their ML Training Platform on Kubernetes with Ray and Anyscale Training 12B Parameter ModelsAirbnb

▼

Challenge

Airbnb's machine learning teams needed to train increasingly large models for search ranking, pricing optimization, fraud detection, and personalized recommendations. By 2024, their most complex models had grown to 12 billion parameters, requiring distributed training across hundreds of GPUs. The existing ML infrastructure was a patchwork of custom-built job schedulers, manual GPU allocation, and team-specific training scripts that could not scale to handle the computational demands of large model training. GPU cluster utilization averaged only 35 percent because teams reserved entire nodes for exclusive use during training runs that rarely fully utilized all allocated GPUs. Training jobs frequently failed at 80 percent completion due to GPU memory errors or network partitioning, requiring full restarts that wasted hours of expensive compute time. The ML platform team had only 15 engineers supporting over 200 ML practitioners across the company.

Solution

Airbnb built their next-generation ML training platform on Kubernetes using Ray as the distributed computing framework and Anyscale as the managed Ray platform layer. They deployed dedicated GPU Kubernetes clusters on AWS using a mix of p4d.24xlarge (A100) and p5.48xlarge (H100) instances managed through Karpenter for just-in-time node provisioning. Ray on Kubernetes handles distributed training orchestration, automatically managing worker placement, fault tolerance, and gradient synchronization across hundreds of GPU workers. They built a custom Kubernetes operator called MLTrainController that manages the lifecycle of training jobs, automatically configuring Ray clusters based on model architecture and dataset size. Checkpointing is automated every N steps to S3 with automatic resume from the last checkpoint on failure, eliminating the wasted compute from full restarts. They implemented gang scheduling using the Kubernetes Coscheduling plugin to ensure all workers for a distributed training job are scheduled atomically. A priority-based preemption system allows high-priority production model retraining to preempt experimental jobs, with preempted jobs automatically resuming when resources become available.

Outcome

GPU cluster utilization improved from 35 percent to 72 percent through better bin-packing and preemption-based scheduling. Training job completion rate improved from 65 percent to 96 percent through automated checkpointing and fault recovery. Time to train the largest search ranking model decreased from 14 days to 4 days through optimized distributed training configurations. ML engineer productivity increased 3x as measured by models shipped to production per quarter. Infrastructure cost per training run decreased by 45 percent despite larger model sizes.

Scale

Training models up to 12 billion parameters, GPU clusters with hundreds of A100 and H100 GPUs, over 200 ML practitioners submitting training jobs, thousands of training runs per week, processing petabytes of training data from S3.

Key Learnings

Gang scheduling is essential for distributed GPU training — partial scheduling wastes resources when workers cannot proceed without the full cohort
Automated checkpointing with resume-from-failure transforms GPU training economics by eliminating the 80-percent-complete restart problem
Karpenter's just-in-time provisioning is critical for GPU workloads where pre-provisioned capacity is prohibitively expensive

How Uber Migrated from Apache Mesos to Kubernetes for 4000+ Microservices with Zero DowntimeUber

▼

Challenge

Uber had been running one of the largest Apache Mesos deployments in the world, powering over 4,000 microservices that handle ride matching, pricing, payments, mapping, and logistics across 10,000+ cities globally. By 2023, maintaining and scaling the Mesos-based platform had become increasingly difficult as the open-source community shifted focus to Kubernetes. Recruiting engineers with Mesos expertise became nearly impossible, and the custom tooling built on top of Mesos represented millions of lines of code that only a shrinking group of engineers could maintain. The platform handled over 1 million requests per second at peak during events like New Year's Eve across all time zones. Any migration disruption could directly impact rider safety, driver earnings, and delivery reliability. The existing Mesos deployment used a custom framework called Peloton for job scheduling that had deep integration with Uber's service discovery, load balancing, and observability infrastructure. A big-bang migration was out of the question given the blast radius.

Solution

Uber executed a multi-year incremental migration from Mesos to Kubernetes using a dual-orchestrator strategy. They built a unified platform abstraction layer called UP (Uber Platform) that provided a single API for developers regardless of whether their workloads ran on Mesos or Kubernetes underneath. This allowed the migration to happen transparently from the developer perspective. Phase one focused on stateless microservices, migrating workloads cluster-by-cluster using a canary approach where each service ran simultaneously on both Mesos and Kubernetes with traffic gradually shifted via their custom load balancer. They built custom Kubernetes controllers that replicated Peloton scheduling semantics including resource overcommit, priority-based preemption, and placement constraints that developers depended on. Service discovery was unified through a custom integration layer that registered Kubernetes pods in Uber's existing Hyperbahn and TChannel-based service mesh while simultaneously supporting the Kubernetes-native service model. They developed automated migration tooling that could analyze a service's Mesos configuration, generate equivalent Kubernetes manifests, deploy the service to Kubernetes, run comparison tests, and shift traffic — all through a self-service workflow. For stateful services including Schemaless databases, Cherami message queues, and Cadence workflow engines, they developed specialized migration operators that handled data replication and cutover.

Outcome

Completed migration of over 4,000 microservices from Mesos to Kubernetes with zero customer-impacting downtime incidents attributed to the migration. Infrastructure operational costs decreased by 30 percent through better bin-packing and resource utilization on Kubernetes. Developer deployment velocity increased by 40 percent as standardized Kubernetes tooling replaced custom Mesos wrappers. On-call incident rate for platform issues decreased by 50 percent due to better self-healing capabilities in Kubernetes. Recruiting pipeline for platform engineers improved dramatically with Kubernetes as the standard.

Scale

Over 4,000 microservices, millions of containers, over 1 million requests per second at peak, 10,000+ cities served globally, multiple data center regions, thousands of engineers deploying daily, petabytes of data processed.

Key Learnings

A platform abstraction layer above the orchestrator is essential for large-scale migrations — it decouples developer workflows from infrastructure implementation details
Migrating scheduling semantics is harder than migrating workloads — custom scheduling behaviors that developers depend on must be faithfully replicated or migration will fail
Dual-stack operation with traffic comparison is the safest migration pattern for high-availability systems — never trust that equivalent configs produce equivalent behavior

How Shopify Handles Black Friday Traffic Spikes on Kubernetes with Aggressive HPA and Pod Pre-warmingShopify

▼

Challenge

Shopify powers over 2 million online stores and processes billions of dollars in transactions during Black Friday Cyber Monday (BFCM), their single largest traffic event. During BFCM 2024, peak traffic exceeded 80 million requests per minute with order creation rates spiking to over 150,000 orders per minute. Standard Kubernetes Horizontal Pod Autoscaler operates reactively — it scales up after detecting increased load, but the 60-90 second delay between metric collection, scaling decision, and pod readiness means thousands of requests hit overloaded pods during traffic spikes. For Shopify, even a 30-second degradation during peak BFCM means millions of dollars in lost merchant sales and severe reputational damage. Their application pods had cold-start times of 40-60 seconds due to Ruby on Rails boot time, JIT compilation warm-up, and connection pool establishment to databases and caches. The traffic pattern is also uniquely challenging because it features sharp step-function increases as flash sales launch simultaneously across thousands of stores.

Solution

Shopify developed a comprehensive Kubernetes scaling strategy combining predictive pre-scaling, aggressive HPA tuning, and custom pod pre-warming infrastructure. They built a custom pre-scaling controller that uses historical BFCM traffic data combined with real-time merchant flash sale schedules to pre-provision capacity 15 minutes before predicted traffic spikes. The HPA configuration uses custom metrics from their internal load balancer rather than CPU utilization, with extremely aggressive scaling parameters: a scale-up stabilization window of zero seconds, scale-up rate of 100 percent every 15 seconds, and a custom metric that combines requests-per-second with p99 latency. They built a pod pre-warming system where standby pods are maintained in a warm pool — these pods are fully booted with established database connections and warm JIT caches but are not registered with the load balancer. When the HPA triggers a scale-up, instead of launching cold pods, the system promotes pre-warmed pods from the warm pool into the active serving fleet in under 2 seconds. The warm pool is itself auto-replenished by a background controller that continuously boots new standby pods. They use Kubernetes pod topology spread constraints to ensure pods are distributed across availability zones and node groups to prevent correlated failures.

Outcome

BFCM 2024 handled with zero downtime and p99 latency staying under 200ms even during peak traffic spikes. Pod scale-up effective time reduced from 60-90 seconds to under 2 seconds through pre-warming. Over $9.3 billion in total BFCM sales processed successfully. Infrastructure cost during BFCM reduced by 25 percent compared to previous year through better capacity planning and pre-warming efficiency versus brute-force over-provisioning. Recovery from unexpected traffic spikes improved from minutes to single-digit seconds.

Scale

Over 2 million stores, 80+ million requests per minute at BFCM peak, 150,000+ orders per minute, thousands of Kubernetes pods scaling from baseline to peak capacity, multiple GCP regions with global load balancing.

Key Learnings

Reactive autoscaling is insufficient for flash-sale traffic patterns — predictive pre-scaling combined with warm pools is essential for sub-second response to traffic spikes
HPA custom metrics based on actual request load outperform CPU-based metrics for web applications where CPU is not the binding constraint
Pod pre-warming transforms the cold-start problem from an application optimization challenge into an infrastructure scheduling problem that Kubernetes can solve