12 interview questions · azure, kubernetes
Quick Answer
The kube-scheduler uses a two-phase approach: first it filters out nodes that cannot run the Pod (predicates), then it scores the remaining feasible nodes using priority functions. The node with the highest aggregate score wins the binding.
Detailed Answer
Think of the Kubernetes scheduler like a hiring manager filling a position. First, you eliminate candidates who do not meet the minimum qualifications (no relevant degree, wrong location, missing certifications) -- that is the filtering phase. Then, among the qualified candidates, you rank them by how well they fit the role (years of experience, cultural fit, salary expectations) -- that is the scoring phase. The best-ranked candidate gets the offer, and in Kubernetes, the best-ranked node gets the Pod. In Kubernetes, the kube-scheduler is a control-plane component that watches the API server for newly created Pods that have no node assignment (spec.nodeName is empty). When it detects an unscheduled Pod, it begins the scheduling cycle. The scheduler maintains an internal scheduling queue that prioritizes Pods based on their priority class, creation timestamp, and other factors. The entire process happens in two distinct phases: filtering (also called predicates) and scoring (also called priorities). During the filtering phase, the scheduler evaluates each node against a set of filter plugins. These include PodFitsResources (checking CPU and memory requests against allocatable capacity), PodFitsHostPorts (ensuring requested host ports are available), NodeAffinity (matching node labels against affinity rules), TaintToleration (verifying the Pod tolerates all node taints), PodTopologySpread (enforcing topology spread constraints), and VolumeBinding (checking that required persistent volumes can be provisioned or are available on that node). Any node that fails even one filter is eliminated. If no nodes pass filtering, the Pod remains Pending and the scheduler may trigger preemption if the Pod has sufficient priority to evict lower-priority Pods. In the scoring phase, each surviving node is evaluated by scoring plugins that assign a value typically between 0 and 100. The NodeResourcesBalancedAllocation plugin favors nodes that would have balanced CPU and memory use after placing the Pod. The ImageLocality plugin gives higher scores to nodes that already have the container image cached, reducing pull time. InterPodAffinity scores nodes based on whether co-locating the Pod with other Pods matches affinity or anti-affinity preferences. The LeastAllocated strategy prefers nodes with the most free resources, while MostAllocated does the opposite for bin-packing. Each plugin score is multiplied by a configurable weight, and the weighted scores are summed. The node with the highest total score is selected, and the scheduler creates a Binding object to assign the Pod to that node. At production scale with thousands of nodes, the scheduler uses a percentageOfNodesToScore parameter (defaulting to a formula based on cluster size) to avoid evaluating every single node, which would be too slow. For a 5000-node cluster, it might only score 10% of feasible nodes once it has found enough candidates. The scheduler also supports scheduling profiles, allowing you to run multiple schedulers or customize the plugin chain. The scheduling framework has extension points like PreFilter, Filter, PreScore, Score, Reserve, Permit, PreBind, Bind, and PostBind, making it highly extensible. A non-obvious gotcha is that the scheduler makes decisions based on a snapshot of the cluster state, which can become stale in highly dynamic environments. If two Pods are being scheduled simultaneously and both target the same node, the second Pod may fail to bind because resources were consumed by the first. Additionally, the percentageOfNodesToScore optimization means the scheduler might not always find the globally optimal node -- it finds a good-enough node quickly. Resource requests (not limits) drive scheduling decisions, so Pods without requests are treated as requesting zero resources, which can lead to node overcommitment. Finally, DaemonSet Pods are not scheduled by the default scheduler since Kubernetes 1.12; the DaemonSet controller handles their node assignment directly.
Code Example
# Custom scheduler profile with specific plugins enabled
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: payments-scheduler # Custom scheduler name for payments workloads
plugins:
score:
enabled:
- name: NodeResourcesBalancedAllocation # Prefer nodes with balanced CPU/memory
weight: 2 # Double the weight for balanced allocation
- name: ImageLocality # Prefer nodes that already have the image cached
weight: 1 # Standard weight for image locality
disabled:
- name: NodeResourcesMostAllocated # Disable bin-packing strategy
pluginConfig:
- name: PodTopologySpread # Configure topology spread constraints
args:
defaultingType: List # Use list-based defaulting
defaultConstraints: # Spread across zones by default
- maxSkew: 1 # Allow at most 1 Pod difference between zones
topologyKey: topology.kubernetes.io/zone # Spread across AZs
whenUnsatisfiable: ScheduleAnyway # Soft constraint - still schedule if skew exceeded
---
# Pod with resource requests that drive scheduling decisions
apiVersion: v1
kind: Pod
metadata:
name: payments-api-7f8d9c # Realistic Pod name with hash suffix
namespace: payments # Namespace for the payments service
labels:
app: payments-api # Label for service discovery
tier: backend # Label for topology spread
spec:
schedulerName: payments-scheduler # Use the custom scheduler defined above
topologySpreadConstraints: # Spread Pods across zones for HA
- maxSkew: 1 # Maximum difference in Pod count between zones
topologyKey: topology.kubernetes.io/zone # Spread across AZs
whenUnsatisfiable: DoNotSchedule # Hard constraint - block if cannot satisfy
labelSelector: # Match Pods with the same app label
matchLabels:
app: payments-api # Select all payments-api Pods
containers:
- name: payments-api # Main container name
image: registry.internal.io/payments-api:v2.4.1 # Internal registry image
resources:
requests: # These values drive the scheduler filtering phase
cpu: 500m # Request half a CPU core
memory: 512Mi # Request 512MB of memory
limits: # Limits enforce runtime cgroups constraints
cpu: "1" # Limit to 1 full CPU core
memory: 1Gi # Limit to 1GB of memory◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Unscheduled│ │ Filter │ │ Score │ │ Bind │
│ Pod │───→│ Phase │───→│ Phase │───→│ to Node │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ │
↓ ↓
┌──────────┐ ┌──────────┐
│ Eliminate │ │ Rank by │
│ Infeasible│ │ Weighted │
│ Nodes │ │ Scores │
└──────────┘ └──────────┘Quick Answer
When etcd loses quorum (majority of members are down), the cluster becomes read-only and cannot process writes, meaning no new Pods can be scheduled and no state changes can be persisted. Recovery involves either restoring enough members to regain quorum or rebuilding from a snapshot backup.
Detailed Answer
Imagine a board of directors that requires a majority vote to approve any decision. If the company has five board members and three resign suddenly, the remaining two cannot approve anything -- even if they agree -- because they lack the required majority. The company is paralyzed: no new hires, no budget changes, nothing. That is exactly what happens when etcd loses quorum: the remaining members know the current state but cannot authorize any changes. In Kubernetes, etcd is the single source of truth for all cluster state -- every Pod definition, Service, ConfigMap, Secret, and controller state lives in etcd. The kube-apiserver reads from and writes to etcd exclusively. Etcd uses the Raft consensus algorithm, which requires a strict majority (N/2 + 1) of members to agree on writes. For a 3-member etcd cluster, quorum requires 2 members; for 5 members, it requires 3. When quorum is lost, etcd switches to a degraded mode where it can serve stale reads (depending on consistency settings) but rejects all write operations. When quorum is lost, the chain of failure propagates quickly. The kube-apiserver begins returning errors for any mutating request (POST, PUT, DELETE) because etcd refuses writes. Controllers in the kube-controller-manager that rely on leader election through the apiserver may lose their leases. The scheduler cannot bind Pods to nodes. Existing workloads continue running because kubelets cache their Pod specs locally and container runtimes are independent of the control plane. However, no new Pods can be created, no scaling can occur, node heartbeats cannot be updated (which eventually triggers node NotReady conditions), and self-healing stops entirely. The cluster is alive but brain-dead. Recovery depends on the failure scenario. If etcd members are down due to transient issues (network partition, disk pressure, or crashed processes), the fastest path is to bring enough members back online to restore quorum. Check each member with etcdctl endpoint status and etcdctl member list. If a member's data is corrupted, remove it from the cluster with etcdctl member remove, then re-add it as a new member with etcdctl member add and let it rejoin and replicate. For catastrophic failure where all members are lost, you must restore from an etcd snapshot. Take regular snapshots with etcdctl snapshot save, then restore with etcdctl snapshot restore to a new data directory on each member, updating the initial-cluster and initial-advertise-peer-urls flags. After restoration, restart etcd and verify the kube-apiserver reconnects. In production, etcd failures at scale are often caused by slow disks, large key-value sizes from too many Kubernetes objects, or aggressive compaction settings. The write-ahead log (WAL) is sensitive to disk latency; etcd recommends dedicated SSDs with sub-10ms p99 latency. A non-obvious gotcha is that etcd v3 has a default storage limit of 2GB (configurable up to 8GB), and if the database exceeds this limit, etcd enters a maintenance mode that effectively looks like quorum loss. Another trap: during recovery, if you restore a snapshot to an odd number of members but start them with stale peer URLs, they may form split-brain scenarios. Always restore all members from the same snapshot simultaneously and use a fresh cluster token to prevent old members from rejoining.
Code Example
# Check etcd cluster health and member status ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # First etcd endpoint --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # CA certificate for TLS --cert=/etc/kubernetes/pki/etcd/server.crt \ # Server certificate --key=/etc/kubernetes/pki/etcd/server.key \ # Server private key endpoint health --cluster # Check health of all cluster members # List all etcd members and their status ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # Connect to surviving member --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # CA cert path --cert=/etc/kubernetes/pki/etcd/server.crt \ # Client cert for auth --key=/etc/kubernetes/pki/etcd/server.key \ # Client key for auth member list -w table # Output in table format for readability # Create a snapshot backup (run this as a CronJob in production) ETCDCTL_API=3 etcdctl \ --endpoints=https://etcd-0.etcd.kube-system:2379 \ # Endpoint to snapshot from --cacert=/etc/kubernetes/pki/etcd/ca.crt \ # TLS CA certificate --cert=/etc/kubernetes/pki/etcd/server.crt \ # TLS client certificate --key=/etc/kubernetes/pki/etcd/server.key \ # TLS client key snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db # Timestamped backup file # Restore from snapshot on each etcd member ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot-20260615-030000.db \ # Snapshot file to restore --name=etcd-0 \ # This member's name --initial-cluster=etcd-0=https://10.0.1.10:2380,etcd-1=https://10.0.1.11:2380,etcd-2=https://10.0.1.12:2380 \ # All members --initial-cluster-token=etcd-cluster-recovery-1 \ # New token prevents old members rejoining --initial-advertise-peer-urls=https://10.0.1.10:2380 \ # This member's peer URL --data-dir=/var/lib/etcd-restored # New data directory to avoid conflicts
◈ Architecture Diagram
┌──────────┐ ┌──────────┐ ┌──────────┐
│ etcd-0 │ │ etcd-1 │ │ etcd-2 │
│ HEALTHY │ │ DOWN │ │ DOWN │
└────┬─────┘ └──────────┘ └──────────┘
│
↓
┌──────────┐ ┌──────────┐
│ Quorum │───→│ API Srvr │
│ LOST │ │ Read │
│ No Write │ │ Only │
└──────────┘ └──────────┘
│
┌──────────┴──────────┐
↓ ↓
┌──────────┐ ┌──────────┐
│ Scheduler│ │Controller│
│ Blocked │ │ Blocked │
└──────────┘ └──────────┘Quick Answer
Architects tune etcd by sizing disks for low-latency IOPS, adjusting compaction and defragmentation schedules, monitoring database size and peer latency, and separating the events store. Sharding the main etcd or using virtual clusters becomes necessary when a single etcd instance approaches 8 GB or 30,000-40,000 objects and API server latency degrades.
Detailed Answer
Think of a library card catalog. When the library has a few thousand books, one cabinet handles lookups fine. But when the library grows to millions of books and hundreds of librarians are searching simultaneously, you either need a faster cabinet, multiple cabinets organized by subject, or a way to archive old cards. Etcd is that card catalog for Kubernetes — every resource definition, status update, and event is a card in the catalog. Etcd is the sole persistent store for Kubernetes cluster state. Every API server read and write flows through etcd, making its performance the ceiling for cluster responsiveness. For large clusters — those with tens of thousands of Pods, thousands of Services, or high churn from controllers and operators — etcd becomes the bottleneck before CPU, memory, or network do. The key metrics are fsync latency (which depends on disk IOPS), database size, number of keys, leader election frequency, and peer round-trip time between etcd members. Internally, etcd uses a B-tree index with multi-version concurrency control, or MVCC, keeping every revision of every key until compacted. Compaction removes old revisions, and defragmentation reclaims disk space after compaction. Without regular compaction, the database grows unboundedly. Kubernetes runs automatic compaction every five minutes by default, but operators must also schedule defragmentation because compaction alone does not free physical disk space. On cloud providers, using provisioned IOPS SSD volumes (like gp3 with 6000+ IOPS on AWS) is critical because etcd performance degrades sharply when fsync latency exceeds 10 milliseconds. At production scale, the first architectural decision is separating the events store. Kubernetes Events are high-volume, short-lived objects that create write pressure without carrying critical state. Running a dedicated etcd instance for Events reduces load on the main etcd cluster significantly. AWS EKS offers provisioned control plane tiers (XL, 2XL, 4XL) that scale etcd database limits up to 16 GB for clusters running AI and ML workloads with many custom resources. When even separated events and tuned compaction are insufficient, true etcd sharding — distributing different API groups to separate etcd clusters — or virtual clusters that maintain independent etcd instances per tenant become the next scaling lever. The non-obvious gotcha is that etcd performance problems often manifest as API server timeouts or slow kubectl responses, and teams blame the API server rather than looking at etcd disk latency. A single slow etcd member in a three-node cluster can drag down the entire quorum because the leader waits for a majority of followers to acknowledge writes. Architects should alert on p99 fsync duration, database size approaching 8 GB, and any leader changes, because a leader election storm during high write load can cascade into control-plane unavailability.
Code Example
# Check etcd database size and key count on the leader member ETCDCTL_API=3 etcdctl endpoint status --endpoints=https://etcd-0.etcd.kube-system:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ --write-out=table # Monitor fsync latency histogram from Prometheus metrics curl -s https://etcd-0.etcd.kube-system:2379/metrics | grep etcd_disk_wal_fsync_duration_seconds # Trigger a manual defragmentation on a specific member during a maintenance window ETCDCTL_API=3 etcdctl defrag --endpoints=https://etcd-1.etcd.kube-system:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Configure API server to use a separate etcd instance for Event objects # In kube-apiserver manifest or startup flags: # --etcd-servers=https://etcd-main:2379 # Main etcd for all resources # --etcd-servers-overrides=/events#https://etcd-events:2379 # Separate etcd for Events # Check Kubernetes object counts by resource type to identify growth kubectl get --raw='/metrics' | grep apiserver_storage_objects | sort -t' ' -k2 -rn | head -20
◈ Architecture Diagram
┌──────────┐ │API Server│ └──┬───┬───┘ │ │ ↓ ↓ ┌─────┐ ┌─────────┐ │Main │ │Events │ │etcd │ │etcd │ │< 8GB│ │separate │ └─────┘ └─────────┘ │ ┌──┴──────────┐ │Compact+Defrag│ └──────────────┘
Quick Answer
Topology spread constraints tell the scheduler to distribute Pods across failure domains defined by node labels such as zone or hostname, using maxSkew to control imbalance. When combined with cluster autoscaling, problems arise if a zone has zero nodes — the autoscaler may not know about the zone, causing the scheduler to leave Pods pending indefinitely.
Detailed Answer
Think of seating guests at a wedding reception. You want to spread friends evenly across tables so no table is overcrowded and no group is isolated. The wedding planner checks how many people are at each table and seats the next guest at the most empty one, but if a table does not exist yet (no physical table has been set up), the planner cannot seat anyone there even if the venue has room. Topology spread constraints in Kubernetes work the same way. Kubernetes topology spread constraints are declared in the Pod spec under topologySpreadConstraints. Each constraint specifies a topologyKey (a node label like topology.kubernetes.io/zone or kubernetes.io/hostname), a maxSkew (the maximum allowed difference in Pod count between the most-populated and least-populated domain), a whenUnsatisfiable behavior (DoNotSchedule or ScheduleAnyway), and a labelSelector to identify which Pods count toward the spread calculation. Internally, the scheduler evaluates topology spread during the Filter and Score phases. In the Filter phase, it eliminates nodes where placing the Pod would violate the maxSkew when whenUnsatisfiable is DoNotSchedule. In the Score phase, it ranks remaining nodes by how well they balance the distribution. The scheduler considers the topologyKey label on existing nodes to define domains — a domain only exists if at least one node carries that label value. It then counts matching Pods per domain and calculates whether the new Pod can land in each domain without exceeding maxSkew. At production scale, the interaction with cluster autoscaling creates subtle failures. If a node pool in one availability zone scales to zero, that zone disappears from the scheduler's topology map. The scheduler only sees zones with active nodes, so it may consider a two-zone spread sufficient even when three zones are available. When maxSkew is 1 and whenUnsatisfiable is DoNotSchedule, the scheduler can leave Pods pending because it cannot place them in a zone that has no nodes, and the autoscaler may not create a node in the missing zone because it does not see pending Pods that specifically require it. This chicken-and-egg problem is one of the most common production issues with topology spread constraints. The non-obvious gotcha is that topology spread constraints count all matching Pods, including ones that are terminating, not-ready, or failing. During a rolling update, old Pods being terminated still count toward the spread calculation, which can cause new Pods to be unschedulable until the old ones are fully removed. Architects should set minDomains to explicitly declare how many zones the spread should consider, use node affinity in combination with spread constraints to ensure the autoscaler knows about expected zones, and monitor for unschedulable Pods with topology spread violation events.
Code Example
# Apply a Deployment with zone and node spread constraints
apiVersion: apps/v1 # Stable Deployment API
kind: Deployment # Manages replicated Pods
metadata:
name: checkout-api # Production checkout service
namespace: payments # Team namespace
spec:
replicas: 6 # Six replicas to spread across three zones with two per zone
selector:
matchLabels:
app: checkout-api # Pod selector
template:
metadata:
labels:
app: checkout-api # Label used by spread constraint selector
spec:
topologySpreadConstraints:
- maxSkew: 1 # Allows at most one Pod difference between zones
topologyKey: topology.kubernetes.io/zone # Spreads across availability zones
whenUnsatisfiable: DoNotSchedule # Strictly enforces zone balance
labelSelector:
matchLabels:
app: checkout-api # Counts only checkout-api Pods
minDomains: 3 # Expects three zones even if some have zero nodes
- maxSkew: 1 # Allows at most one Pod difference between nodes within a zone
topologyKey: kubernetes.io/hostname # Spreads across individual nodes
whenUnsatisfiable: ScheduleAnyway # Prefers balance but allows imbalance
labelSelector:
matchLabels:
app: checkout-api # Counts only checkout-api Pods
containers:
- name: api # Application container
image: registry.company.com/checkout-api:3.7.2 # Versioned production image
resources:
requests:
cpu: 250m # Minimum CPU for scheduling
memory: 512Mi # Minimum memory for scheduling
# Check Pod distribution across zones
kubectl get pods -n payments -l app=checkout-api -o custom-columns='POD:.metadata.name,NODE:.spec.nodeName,ZONE:.metadata.labels.topology\.kubernetes\.io/zone'
# Identify Pods pending due to topology spread violations
kubectl get events -n payments --field-selector reason=FailedScheduling | grep topology◈ Architecture Diagram
┌─── Zone A ──┐ ┌─── Zone B ──┐ ┌─── Zone C ──┐ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ ┌────┐┌────┐│ │ │Pod1││Pod2││ │ │Pod3││Pod4││ │ │Pod5││Pod6││ │ └────┘└────┘│ │ └────┘└────┘│ │ └────┘└────┘│ │ maxSkew=1 │ │ maxSkew=1 │ │ maxSkew=1 │ └─────────────┘ └─────────────┘ └─────────────┘
Quick Answer
Scheduler plugins hook into the scheduling framework's extension points (PreFilter, Filter, PreScore, Score, Reserve, Permit, PreBind, Bind) to add custom logic like gang scheduling, co-scheduling, or capacity reservation. Scheduling profiles allow running multiple schedulers with different plugin configurations. Risks include increased scheduling latency, unintended Pod starvation, and complex debugging when plugins interact.
Detailed Answer
Think of a wedding seating planner with very specific rules. The basic planner checks table capacity and guest preferences. But this wedding also requires that certain groups of guests must all be seated simultaneously (gang scheduling), some tables are reserved for VIPs until the last minute (capacity reservation), and guests from rival families must never share an aisle (anti-affinity). Standard rules cannot express all of this, so the planner adds specialized checkers at different stages of the seating process. Kubernetes scheduler plugins work exactly this way. The Kubernetes scheduling framework replaced the old policy-based scheduler configuration with a plugin architecture. The scheduler processes each Pod through a pipeline of extension points: PreFilter (validate and preprocess), Filter (eliminate ineligible nodes), PostFilter (handle unschedulable Pods), PreScore (prepare scoring data), Score (rank eligible nodes), Reserve (tentatively claim resources), Permit (wait or approve), PreBind (prepare external resources), and Bind (commit the Pod to a node). Each extension point can have multiple plugins that run in order. Scheduling profiles allow a single kube-scheduler binary to expose multiple scheduler personalities. Each profile has a name and its own set of enabled, disabled, and configured plugins. A Pod selects its scheduler by setting spec.schedulerName. This means architects can run a default profile for general workloads and a specialized profile for GPU workloads, batch jobs, or latency-sensitive services without deploying separate scheduler binaries. The scheduler-plugins project from Kubernetes SIGs provides production-grade plugins like Coscheduling (gang scheduling for batch workloads that need all Pods scheduled together), Capacity Scheduling (enforcing elastic quotas across namespaces), and Trimaran (scoring based on real-time node use from metrics server). At production scale, custom scheduler plugins require careful testing because they affect every Pod placement decision. A slow PreFilter or Score plugin increases scheduling latency for all Pods using that profile. A buggy Filter plugin can make nodes ineligible when they should be available, causing Pods to remain pending. Plugin ordering matters because earlier plugins in the chain can mask or override later ones. Architects should measure scheduler latency percentiles (scheduling_duration_seconds), unschedulable Pod counts, and plugin-specific metrics before and after enabling custom plugins. The non-obvious gotcha is debugging scheduling failures with custom plugins. When a Pod is unschedulable, the scheduler event says which extension point rejected it, but the interaction between multiple plugins can create emergent behavior that is hard to trace. For example, a topology spread constraint combined with a capacity reservation plugin can create scenarios where Pods are pending not because of resource shortage but because the combination of constraints has no feasible solution. Architects should use the scheduler's verbose logging, the scheduling-queue metrics, and dry-run scheduling tools to validate plugin interactions before production deployment.
Code Example
# KubeSchedulerConfiguration with two profiles: default and batch-coscheduling
apiVersion: kubescheduler.config.k8s.io/v1 # Scheduler configuration API
kind: KubeSchedulerConfiguration # Configures the kube-scheduler binary
profiles:
- schedulerName: default-scheduler # Default profile for general workloads
plugins:
score:
enabled:
- name: NodeResourcesFit # Scores nodes by resource availability
weight: 1 # Standard weight
- name: InterPodAffinity # Scores based on Pod affinity preferences
weight: 1 # Standard weight
- schedulerName: batch-scheduler # Specialized profile for ML training jobs
plugins:
queueSort:
enabled:
- name: Coscheduling # Sorts Pods so gang members are scheduled together
preFilter:
enabled:
- name: Coscheduling # Validates that all gang members exist
postFilter:
enabled:
- name: Coscheduling # Preempts to make room for complete gangs
permit:
enabled:
- name: Coscheduling # Holds Pods until all gang members are schedulable
reserve:
enabled:
- name: Coscheduling # Reserves resources for the complete gang
# A batch training job that uses gang scheduling via the batch-scheduler profile
apiVersion: batch/v1 # Standard Job API
kind: Job # Batch workload requiring all workers to start together
metadata:
name: fraud-model-training # Distributed training job
namespace: ml-platform # ML team namespace
labels:
pod-group.scheduling.sigs.k8s.io/name: fraud-training-gang # Gang scheduling group name
pod-group.scheduling.sigs.k8s.io/min-available: "4" # All four workers must be scheduled
spec:
parallelism: 4 # Four parallel training workers
completions: 4 # Job completes when all four finish
template:
metadata:
labels:
pod-group.scheduling.sigs.k8s.io/name: fraud-training-gang # Same gang group label
spec:
schedulerName: batch-scheduler # Uses the coscheduling profile
containers:
- name: trainer # Distributed training worker container
image: registry.company.com/fraud-trainer:4.3.1 # ML training image
resources:
requests:
cpu: 4 # Four CPU cores per worker
memory: 16Gi # 16GB memory per worker
# Monitor scheduler performance metrics for the batch profile
kubectl get --raw='/metrics' | grep scheduling_duration_seconds | grep batch-scheduler◈ Architecture Diagram
┌──────────────────────────────────┐ │ Scheduling Pipeline │ │ │ │ PreFilter → Filter → PostFilter │ │ ↓ │ │ PreScore → Score → Reserve │ │ ↓ │ │ Permit → PreBind → Bind │ │ │ │ ┌──────────┐ ┌────────────────┐ │ │ │ Default │ │ Batch Profile │ │ │ │ Profile │ │ +Coscheduling │ │ │ └──────────┘ └────────────────┘ │ └──────────────────────────────────┘
Quick Answer
The control plane consists of the API Server (REST frontend that validates all requests), etcd (distributed key-value store holding all cluster state), Scheduler (assigns Pods to nodes based on constraints), and Controller Manager (runs reconciliation loops that drive actual state toward desired state).
Detailed Answer
Think of the control plane like the management floor of a large warehouse. The API Server is the front desk receptionist who handles every single request coming in — whether it's a customer placing an order, a supervisor checking inventory, or a new employee asking for directions. Every interaction with the warehouse goes through this one desk, no exceptions. etcd is the filing cabinet behind the desk that holds the single source of truth: every order, every employee record, every inventory count. The Scheduler is the floor manager who decides which aisle worker handles which incoming package based on who has capacity. The Controller Manager is the quality inspector who constantly walks the floor comparing what should be happening (orders to fulfill) with what is actually happening (packages on shelves), and files corrective actions when they don't match. The API Server (kube-apiserver) is the only component that talks directly to etcd. Every kubectl command, every internal component communication, and every webhook goes through the API Server as HTTPS REST calls. It performs authentication (who are you?), authorization via RBAC (are you allowed to do this?), admission control (should this request be modified or rejected?), and validation (is this YAML well-formed?) before persisting anything to etcd. It also serves the watch API, which lets other components subscribe to changes in real-time rather than polling — this is how the entire system stays reactive. etcd is a distributed, strongly-consistent key-value store built on the Raft consensus algorithm. It stores every object in the cluster: every Pod spec, every Service definition, every Secret, every ConfigMap. In production, etcd runs as a 3 or 5 node cluster (always odd numbers for quorum) and is often the first component to cause cluster-wide outages when it becomes unhealthy. etcd performance directly determines API Server response time — slow disk I/O on etcd nodes is the number one silent killer of Kubernetes clusters. Production teams typically dedicate SSD-backed nodes exclusively for etcd and monitor fsync latency religiously. The Scheduler (kube-scheduler) watches the API Server for newly created Pods that have no node assigned (spec.nodeName is empty). For each unscheduled Pod, it runs a two-phase algorithm: filtering (eliminate nodes that don't meet hard requirements like resource requests, nodeSelector, taints/tolerations, and affinity rules) and scoring (rank remaining nodes by soft preferences like spreading Pods across failure domains, preferring nodes with the image already cached, or balancing resource utilization). The highest-scoring node wins, and the Scheduler writes the node assignment back to the API Server. The Controller Manager (kube-controller-manager) is actually dozens of separate control loops compiled into a single binary for simplicity. Each controller watches a specific resource type and reconciles actual state with desired state. The ReplicaSet controller ensures the right number of Pods exist. The Deployment controller manages ReplicaSets during rollouts. The Node controller detects when nodes go offline. The Endpoint controller populates Service endpoints. The Job controller manages batch workloads. If the Controller Manager crashes, no reconciliation happens — Pods keep running but nothing self-heals until it's back. A critical production gotcha: many teams monitor Pod health but forget to monitor control plane health. If the API Server is overloaded (common with too many custom controllers or misconfigured HPA polling intervals), the entire cluster becomes unresponsive — you can't deploy, can't scale, can't even see what's broken. Production clusters should have dedicated monitoring for API Server request latency (apiserver_request_duration_seconds), etcd fsync duration (etcd_disk_wal_fsync_duration_seconds), and scheduler queue depth (scheduler_pending_pods).
Code Example
# Check control plane component health kubectl get componentstatuses kubectl get --raw='/healthz?verbose' # View control plane Pods (self-hosted clusters like kubeadm) kubectl get pods -n kube-system -l tier=control-plane # NAME READY STATUS # etcd-master-01 1/1 Running # kube-apiserver-master-01 1/1 Running # kube-controller-manager-master-01 1/1 Running # kube-scheduler-master-01 1/1 Running # Check API Server response time (latency issues?) kubectl get --raw='/metrics' | grep apiserver_request_duration # Verify etcd cluster health kubectl exec -n kube-system etcd-master-01 -- etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health # Check scheduler is making decisions kubectl get events --field-selector reason=Scheduled -A # View controller-manager logs for reconciliation errors kubectl logs -n kube-system kube-controller-manager-master-01 \ --tail=50 | grep -i error # Monitor API Server audit logs for troubleshooting # (configured via --audit-policy-file on API Server) kubectl logs -n kube-system kube-apiserver-master-01 \ | grep payments-api
◈ Architecture Diagram
┌─────────────── Control Plane ──────────────────────────┐
│ │
│ ┌────────────────────────────────────────┐ │
│ │ kube-apiserver │ │
│ │ REST frontend + auth + admission │ │
│ └──────────┬────────────┬────────────────┘ │
│ │ │ │
│ watch │ │ read/write │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ │ │ etcd │ │
│ │ │ key-value │ │
│ │ │ cluster state│ │
│ │ └──────────────┘ │
│ │ │
│ ┌────────┴──────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────────────────┐ │
│ │kube-scheduler│ │ kube-controller-manager │ │
│ │ │ │ │ │
│ │ filter → │ │ ReplicaSet controller │ │
│ │ score → │ │ Deployment controller │ │
│ │ bind Pod to │ │ Node controller │ │
│ │ best node │ │ Endpoint controller │ │
│ └──────────────┘ └──────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────┘
│
│ API Server communicates with
▼
┌─────── Worker Nodes ──────┐
│ kubelet │ kube-proxy │
└───────────────────────────┘Quick Answer
Each worker node runs kubelet (agent that starts and monitors Pods), kube-proxy (manages networking rules for Service routing), and a container runtime (containerd or CRI-O that actually runs containers). The kubelet communicates with the API Server via HTTPS to receive Pod assignments and report node status.
Detailed Answer
Think of a worker node like a restaurant kitchen station. The kubelet is the line cook who receives tickets (Pod specs) from the head chef (API Server) and actually prepares the dishes (starts containers). The container runtime (containerd or CRI-O) is the set of pots, pans, and ovens that the cook uses — the actual tools that do the cooking. And kube-proxy is the waiter who knows which table ordered which dish, making sure the right plate gets to the right customer via the correct route. Without any one of these three, the kitchen cannot function. The kubelet is the primary node agent. It registers the node with the API Server, then watches (via the API Server's watch mechanism) for PodSpecs assigned to its node. When a new Pod is scheduled to its node, the kubelet calls the container runtime through the Container Runtime Interface (CRI) to pull images and start containers. It then continuously monitors container health by executing liveness and readiness probes at configured intervals. If a liveness probe fails, the kubelet restarts the container. It reports Pod status and node conditions (memory pressure, disk pressure, PID pressure, network unavailable) back to the API Server every 10 seconds by default (the node-status-update-frequency). If the API Server doesn't receive heartbeats for 40 seconds (the node-monitor-grace-period), the node is marked NotReady. The container runtime is the software that actually creates and runs containers using Linux kernel features like namespaces (isolation) and cgroups (resource limits). Kubernetes removed direct Docker support in v1.24 — it now requires a CRI-compatible runtime. The two production choices are containerd (lightweight, used by EKS, GKE, and most managed platforms) and CRI-O (purpose-built for Kubernetes, used by OpenShift). The kubelet communicates with the runtime via a Unix socket using the gRPC-based CRI protocol. The runtime handles image pulling, container lifecycle, and log management. kube-proxy runs on every node and implements the Service abstraction. When you create a Service, kube-proxy watches the API Server for Service and Endpoint objects, then programs the node's networking stack to route traffic correctly. In the default iptables mode, it creates iptables rules that perform DNAT (destination NAT) to translate the virtual Service IP to a real Pod IP, with random selection for load balancing. In IPVS mode (better for clusters with thousands of Services), it uses the Linux kernel's IPVS load balancer which supports multiple algorithms (round-robin, least-connections, source-hash). kube-proxy does NOT proxy traffic through itself — it only configures networking rules; actual packets flow directly from source to destination Pod. The communication between worker nodes and the control plane is strictly one-directional in terms of initiation: the kubelet always initiates connections to the API Server, never the reverse. This is a security design — worker nodes can be in untrusted networks and the API Server never needs to push data to them. The kubelet establishes a persistent watch connection to the API Server, which means it receives updates the instant they happen (Pod scheduled, Pod deleted, ConfigMap changed) without polling. For features like `kubectl exec` and `kubectl logs`, the API Server does establish a reverse connection to the kubelet's HTTPS endpoint (port 10250), which is why kubelet has its own TLS certificate. A common production gotcha: if the container runtime's socket becomes unresponsive (containerd hung, disk full preventing image pulls), the kubelet cannot start new Pods or report accurate status. The node might show Ready because the kubelet process itself is fine, but Pods scheduled there will be stuck in ContainerCreating forever. Monitoring containerd/CRI-O process health separately from kubelet health is essential for catching this early.
Code Example
# Check node status and conditions kubectl get nodes -o wide kubectl describe node worker-node-01 # Look for: Conditions section (MemoryPressure, DiskPressure, PIDPressure) # View kubelet logs on the node (SSH required) sudo journalctl -u kubelet --since "10 minutes ago" | tail -50 # Check kubelet's view of Pods on this node kubectl get pods --field-selector spec.nodeName=worker-node-01 -A # Verify container runtime is healthy sudo crictl info # Runtime status sudo crictl ps # Running containers sudo crictl pods # Running Pod sandboxes # Check kube-proxy is programming iptables rules sudo iptables -t nat -L KUBE-SERVICES | head -20 # View kube-proxy mode and configuration kubectl get configmap kube-proxy -n kube-system -o yaml # Check if kubelet can reach the API Server kubectl get --raw='/api/v1/nodes/worker-node-01/proxy/healthz' # Debug networking by checking kube-proxy logs kubectl logs -n kube-system -l k8s-app=kube-proxy \ --tail=30 # Check node resource capacity vs allocatable kubectl describe node worker-node-01 | grep -A5 'Capacity\|Allocatable' # Capacity: # cpu: 8 # memory: 32Gi # Allocatable: ← what's available for Pods (after system reserved) # cpu: 7600m # memory: 30Gi
◈ Architecture Diagram
┌────────────── Worker Node ──────────────────────────────┐
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ kubelet │ │
│ │ • Registers node with API Server │ │
│ │ • Watches for Pod assignments │ │
│ │ • Executes liveness/readiness probes │ │
│ │ • Reports node status every 10s │ │
│ └────────┬──────────────────────┬─────────────────┘ │
│ │ CRI (gRPC) │ HTTPS │
│ ▼ │ │
│ ┌─────────────────┐ │ │
│ │ Container Runtime│ │ │
│ │ (containerd) │ │ │
│ │ │ │ │
│ │ ┌────┐ ┌────┐ │ │ │
│ │ │Pod │ │Pod │ │ │ │
│ │ │ A │ │ B │ │ │ │
│ │ └────┘ └────┘ │ │ │
│ └──────────────────┘ │ │
│ │ │
│ ┌─────────────────┐ │ │
│ │ kube-proxy │ │ │
│ │ │ │ │
│ │ iptables/IPVS │ │ │
│ │ rules for │ │ │
│ │ Service routing │ │ │
│ └────────┬────────┘ │ │
│ │ │ │
└───────────┼──────────────────────┼──────────────────────┘
│ │
│ watch Services │ watch PodSpecs
│ & Endpoints │ report status
▼ ▼
┌──────────────────────────────────┐
│ kube-apiserver │
│ (Control Plane) │
└──────────────────────────────────┘Quick Answer
The control plane is the brain of the cluster: the API Server is the single point of communication, etcd stores all cluster state, the Scheduler assigns Pods to nodes, and the Controller Manager runs reconciliation loops that maintain desired state. On worker nodes, the kubelet manages Pods and kube-proxy handles networking.
Detailed Answer
Think of a Kubernetes cluster like an airport. The control plane is the airport operations center — the people and systems that coordinate everything. The API server is the main radio tower: every communication between pilots (kubectl), ground crew (kubelets), and air traffic control (controllers) goes through it. Nobody talks directly to anyone else. Etcd is the flight database — every flight plan, gate assignment, and schedule is recorded there, and if this database goes down, the airport can't function. The Scheduler is the gate assignment officer who decides which arriving plane goes to which gate based on gate size, availability, and terminal capacity. The Controller Manager is the operations team that constantly walks the airport comparing the schedule to reality: 'Gate 3 should have a plane — it doesn't — redirect one there.' The API Server (kube-apiserver) is the only component that talks to etcd. When you run `kubectl get pods`, kubectl sends an HTTPS request to the API server, which authenticates you, checks your RBAC permissions, retrieves the data from etcd, and returns it. When you create a Deployment, the API server validates the manifest, stores it in etcd, and notifies watching controllers via its built-in watch mechanism. Every component in the cluster — scheduler, controllers, kubelets — communicates exclusively through the API server. Etcd is a distributed key-value store that holds the entire state of the cluster: every Pod, Service, Secret, ConfigMap, and node registration. It uses the Raft consensus algorithm to maintain consistency across multiple replicas (production clusters run 3 or 5 etcd members). Etcd is the most critical component — if etcd data is lost and there's no backup, the cluster is unrecoverable. This is why etcd backup is a non-negotiable operational requirement. The Scheduler (kube-scheduler) watches for newly created Pods that have no node assigned. For each unscheduled Pod, it runs a two-phase process: filtering (which nodes CAN run this Pod — enough CPU? right architecture? matching tolerations?) and scoring (which node is BEST — least loaded? closest to existing Pods? has the image cached?). The winning node is written to the Pod's spec.nodeName field, and the kubelet on that node picks it up. The Controller Manager (kube-controller-manager) runs dozens of control loops, each responsible for one type of resource. The Deployment controller watches Deployments and manages ReplicaSets. The ReplicaSet controller watches ReplicaSets and manages Pods. The Node controller monitors node heartbeats and marks nodes as unhealthy. The Endpoints controller updates Service endpoints when Pods change. Each controller follows the same pattern: observe current state → compare to desired state → take action to converge. This reconciliation loop is the fundamental operating principle of Kubernetes. On each worker node, the kubelet is the agent that actually runs Pods. It watches the API server for Pods assigned to its node, pulls container images, starts containers via the container runtime (containerd), and reports status back. Kube-proxy runs on every node and maintains network rules (iptables or IPVS) that implement Service routing. A common misconception is that kube-proxy proxies traffic — in iptables mode, it doesn't. It just programs the kernel's packet filtering rules and gets out of the way.
Code Example
# Check control plane component health kubectl get componentstatuses # View all control plane Pods (they run as static Pods on master nodes) kubectl get pods -n kube-system # Check API server endpoint kubectl cluster-info # Kubernetes control plane is running at https://10.0.0.1:6443 # View etcd members (if you have access to the master node) ETCDCTL_API=3 etcdctl member list \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Take an etcd backup (CRITICAL for disaster recovery) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # Check node status and kubelet version kubectl get nodes -o wide # View kubelet logs on a node (SSH required) journalctl -u kubelet -f # Check kube-proxy mode (iptables vs IPVS) kubectl get configmap kube-proxy -n kube-system -o yaml | grep mode
◈ Architecture Diagram
┌──────────── Control Plane ──────────────┐
│ │
│ ┌──────────┐ watches ┌────────┐ │
│ │Scheduler │◄─────────────►│ API │ │
│ │ │ │ Server │ │
│ └──────────┘ ┌─────────►│ │ │
│ │ │ only │ │
│ ┌──────────┐ │ │component│ │
│ │Controller│────┘ │that │ │
│ │ Manager │ watches │talks to │ │
│ └──────────┘ │ etcd │ │
│ └───┬────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ etcd │ │
│ │ (cluster │ │
│ │ state) │ │
│ └───────────┘ │
└─────────────────────────────────────────┘
│
API server
watches/updates
│
┌────────────┼────────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐┌──────────┐ ┌──────────┐
│ Node 1 ││ Node 2 │ │ Node 3 │
│ ││ │ │ │
│ kubelet ││ kubelet │ │ kubelet │
│ kube- ││ kube- │ │ kube- │
│ proxy ││ proxy │ │ proxy │
│ ││ │ │ │
│ Pod Pod ││ Pod Pod │ │ Pod Pod │
└──────────┘└──────────┘ └──────────┘Quick Answer
Three control plane nodes provide high availability through etcd's Raft consensus, which requires a majority quorum. With 3 members, quorum is 2 — so the cluster survives one node failure. With 2 members, losing one loses quorum.
Detailed Answer
Think of it like a committee that makes decisions by majority vote. If you have 3 committee members, you need 2 to agree (majority) to pass any decision. If one member is sick, the remaining 2 can still vote and make decisions. But if you only had 2 members and one got sick, you'd have 1 out of 2 — not a majority — so no decisions can be made and everything stops. The reason is etcd, the distributed key-value store that holds all Kubernetes cluster state. Etcd uses the Raft consensus algorithm, which requires a strict majority of members to agree on any write. This majority is called quorum. For 3 members, quorum = 2 (you can lose 1). For 5 members, quorum = 3 (you can lose 2). For 2 members, quorum = 2 (you can lose 0 — making 2 members WORSE than 1 for availability). When one control plane node fails in a 3-node setup, here's what happens: etcd continues operating because 2 of 3 members still form quorum. The API server pods on the remaining 2 nodes handle all requests (the load balancer in front of them routes around the failed node). The scheduler and controller manager use leader election — one was active, the others were on standby. If the active leader was on the failed node, a new leader is elected within seconds. From the user's perspective, kubectl commands might have a brief hiccup (~5-10 seconds) during leader re-election, but the cluster continues operating normally. However, losing TWO of three control plane nodes is catastrophic: etcd loses quorum (only 1 of 3 remaining), and all writes fail. The API server can serve reads from the remaining etcd member but cannot process any creates, updates, or deletes. Existing workloads on worker nodes keep running (the kubelet continues managing pods independently), but you cannot deploy anything new, scale, or recover pods that fail. The cluster is in a read-only degraded state until quorum is restored. Why not 5 or 7 control plane nodes? Each etcd write must be acknowledged by a majority before it's committed. More members means more network round trips and higher write latency. For most clusters, the trade-off of 3 nodes (survive 1 failure, fast writes) is optimal. Large enterprise clusters sometimes use 5 nodes for extra resilience, but 7+ is almost never justified because the write performance penalty outweighs the marginal availability gain.
Code Example
# Check etcd member health ETCDCTL_API=3 etcdctl endpoint health \ --endpoints=https://master-0:2379,https://master-1:2379,https://master-2:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key # master-0:2379 is healthy: committed index = 458923 # master-1:2379 is healthy: committed index = 458923 # master-2:2379 is healthy: committed index = 458923 # Check etcd member list ETCDCTL_API=3 etcdctl member list --write-out=table # Check which controller-manager and scheduler are the leader kubectl get endpoints kube-scheduler -n kube-system -o yaml kubectl get endpoints kube-controller-manager -n kube-system -o yaml # Check control plane node status kubectl get nodes -l node-role.kubernetes.io/control-plane # NAME STATUS ROLES AGE # master-0 Ready control-plane 365d # master-1 Ready control-plane 365d # master-2 NotReady control-plane 365d ← failed # Backup etcd (CRITICAL — do this before any maintenance) ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d).db \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key
◈ Architecture Diagram
3-Node Control Plane: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ │ │ │ │ │ │ API Srvr │ │ API Srvr │ │ API Srvr │ │ etcd │ │ etcd │ │ etcd │ │ Sched │ │ Sched │ │ Sched │ │ CtrlMgr │ │ CtrlMgr │ │ CtrlMgr │ └──────────┘ └──────────┘ └──────────┘ leader ★ standby standby Quorum math: 3 members → quorum = 2 → tolerate 1 failure ✓ 5 members → quorum = 3 → tolerate 2 failures ✓ 2 members → quorum = 2 → tolerate 0 failures ✗ Master-2 fails: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ etcd ✓ │ │ etcd ✓ │ │ etcd ✗ │ │ leader ★ │ │ standby │ │ DOWN │ └──────────┘ └──────────┘ └──────────┘ 2/3 = quorum ✓ → cluster operational Master-0 ALSO fails: ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Master-0 │ │ Master-1 │ │ Master-2 │ │ etcd ✗ │ │ etcd ✓ │ │ etcd ✗ │ │ DOWN │ │ alone! │ │ DOWN │ └──────────┘ └──────────┘ └──────────┘ 1/3 = NO quorum ✗ → read-only mode
Quick Answer
Classic pipelines use a GUI-based editor stored in Azure DevOps metadata, while YAML pipelines define CI/CD as code in a version-controlled file. YAML enables pull request reviews, template reuse, and multi-stage deployments in a single file.
Detailed Answer
Imagine two ways to give driving directions: a voice-guided GPS (Classic) where you click through turns on a screen, versus a written route card (YAML) that you can photocopy, annotate, version, and hand to anyone. Both get you to the destination, but the written card travels with the project and can be peer-reviewed before the trip. Classic pipelines were the original Azure DevOps experience. Build definitions use a visual task editor where you drag-and-drop tasks like NuGet Restore, MSBuild, or Docker Build. Release definitions add environments with deployment gates, approvals, and artifact triggers. The configuration is stored as JSON metadata inside Azure DevOps, not in your repository. This means pipeline changes do not go through pull requests, cannot be easily diffed, and are invisible in your Git history. YAML pipelines store the entire pipeline definition in an azure-pipelines.yml file committed alongside your source code. Every change to the pipeline goes through the same pull request workflow as application code. YAML supports multi-stage pipelines (build, test, deploy to staging, deploy to production) in a single file with conditional execution, template references, and environment approvals. The extends keyword and template repositories enable centralized governance across hundreds of pipelines. Under the hood, both pipeline types use the same agent infrastructure and task ecosystem. A Classic build task like DotNetCoreCLI@2 is the same task referenced in YAML as - task: DotNetCoreCLI@2. The difference is purely in how the orchestration is defined and stored. In production, most organizations are migrating from Classic to YAML because Microsoft has signaled Classic pipelines will not receive new features. The gotcha is that Classic Release pipelines have some features (like release gates with Azure Monitor integration and graphical deployment visualization) that require extra YAML configuration using environments and checks. Teams migrating often underestimate the effort to replicate approval workflows, variable scoping, and artifact filtering that Classic provided through the GUI.
Code Example
# Classic pipeline equivalent in YAML — azure-pipelines.yml
# This replaces a Classic Build + Release definition pair
trigger:
branches:
include:
- main
- release/*
pool:
vmImage: 'ubuntu-latest'
stages:
- stage: Build
displayName: 'Build payments-api'
jobs:
- job: BuildJob
steps:
- task: DotNetCoreCLI@2
displayName: 'Restore packages'
inputs:
command: 'restore'
projects: 'src/payments-api/*.csproj'
- task: DotNetCoreCLI@2
displayName: 'Build solution'
inputs:
command: 'build'
projects: 'src/payments-api/*.csproj'
arguments: '--configuration Release'
- task: DotNetCoreCLI@2
displayName: 'Run unit tests'
inputs:
command: 'test'
projects: 'tests/**/*.csproj'
- task: PublishBuildArtifacts@1
inputs:
PathtoPublish: '$(Build.ArtifactStagingDirectory)'
ArtifactName: 'drop'
- stage: DeployStaging
displayName: 'Deploy to Staging'
dependsOn: Build
jobs:
- deployment: DeployToStaging
environment: 'payments-staging'
strategy:
runOnce:
deploy:
steps:
- script: echo 'Deploying to staging environment'◈ Architecture Diagram
┌──────────────────────────────────────────────────────────┐ │ Classic Pipeline │ │ ┌──────────┐ ┌──────────────┐ ┌─────────────┐ │ │ │ Build │───→│ Release Def │───→│ Environment │ │ │ │ (GUI) │ │ (GUI) │ │ (GUI) │ │ │ └──────────┘ └──────────────┘ └─────────────┘ │ │ Stored in Azure DevOps metadata (not in Git) │ └──────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────┐ │ YAML Pipeline │ │ ┌────────────────────────────────────────────────────┐ │ │ │ azure-pipelines.yml (in Git repo) │ │ │ │ stages: Build → Test → Deploy Staging → Deploy Prod│ │ │ └────────────────────────────────────────────────────┘ │ │ Versioned │ PR-reviewed │ Templated │ Multi-stage │ └──────────────────────────────────────────────────────────┘
Quick Answer
Create an azure-pipelines.yml file in your repo root defining a trigger, agent pool, and build steps. Azure DevOps detects the file and offers to create the pipeline, or you can use az pipelines create to wire it up via CLI.
Detailed Answer
Think of a YAML pipeline file like a recipe card you tape to your refrigerator. Anyone who opens the fridge (clones the repo) immediately knows exactly how to cook the dish (build the project) without asking the chef (searching through a GUI for hidden configuration). The minimum viable YAML pipeline has three elements: a trigger that specifies which branches activate the pipeline, a pool that defines which agent runs the work, and steps that list the actual build commands. For a .NET project, the steps typically include restore, build, test, and publish. For Node.js, they include npm install, lint, test, and build. Azure DevOps provides starter templates when you create a new pipeline through the portal, automatically detecting your project type. When the pipeline runs, Azure DevOps provisions a fresh agent from the specified pool. Microsoft-hosted agents come pre-installed with common SDKs (.NET, Node.js, Python, Java, Go) and tools (Docker, kubectl, Terraform). The agent clones your repository, executes each step sequentially, and reports results back. Build artifacts like compiled binaries or Docker images can be published for downstream stages. In production, even a basic pipeline should include caching for package restore (to speed up builds from 5 minutes to 90 seconds), test result publishing (so failures appear in the PR UI), and branch filters (to avoid running on documentation-only branches). The variables section externalizes configuration like SDK versions so upgrades require changing one line. The most common gotcha for beginners is indentation errors in YAML causing cryptic parse failures. Azure DevOps provides a YAML editor with IntelliSense in the portal, but many teams prefer editing locally with the Azure Pipelines VS Code extension that provides schema validation. Another frequent issue is the agent not having a required tool version — use the UseDotNet@2 or NodeTool@0 tasks to explicitly install the version you need rather than relying on whatever is pre-installed.
Code Example
# azure-pipelines.yml — Basic .NET build pipeline for payments-api
trigger:
- main
- feature/*
pool:
vmImage: 'ubuntu-latest'
variables:
buildConfiguration: 'Release'
dotnetVersion: '8.0.x'
steps:
- task: UseDotNet@2
displayName: 'Install .NET SDK'
inputs:
packageType: 'sdk'
version: '$(dotnetVersion)'
- task: DotNetCoreCLI@2
displayName: 'Restore NuGet packages'
inputs:
command: 'restore'
projects: 'src/payments-api/**/*.csproj'
feedsToUse: 'select'
vstsFeed: 'contoso-internal-feed'
- task: DotNetCoreCLI@2
displayName: 'Build payments-api'
inputs:
command: 'build'
projects: 'src/payments-api/**/*.csproj'
arguments: '--configuration $(buildConfiguration) --no-restore'
- task: DotNetCoreCLI@2
displayName: 'Run unit tests'
inputs:
command: 'test'
projects: 'tests/**/*.csproj'
arguments: '--configuration $(buildConfiguration) --collect:"XPlat Code Coverage"'
- task: PublishTestResults@2
displayName: 'Publish test results'
inputs:
testResultsFormat: 'VSTest'
testResultsFiles: '**/*.trx'
---
# azure-pipelines.yml — Basic Node.js pipeline for fraud-detector
trigger:
- main
pool:
vmImage: 'ubuntu-latest'
steps:
- task: NodeTool@0
displayName: 'Use Node.js 20.x'
inputs:
versionSpec: '20.x'
- script: npm ci
displayName: 'Install dependencies (clean)'
- script: npm run lint
displayName: 'Run ESLint'
- script: npm run test -- --coverage
displayName: 'Run Jest tests with coverage'
- script: npm run build
displayName: 'Build production bundle'◈ Architecture Diagram
┌─────────────────────────────────────────────────┐ │ Pipeline Execution Flow │ │ │ │ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │ │ │ Trigger │──→│ Agent │──→│ Clone Repo │ │ │ │ (push) │ │ (pool) │ │ (checkout) │ │ │ └─────────┘ └─────────┘ └──────┬───────┘ │ │ │ │ │ ↓ │ │ ┌─────────┐ ┌─────────┐ ┌──────────────┐ │ │ │Publish │←──│ Test │←──│ Build │ │ │ │Artifacts│ │ Run │ │ Compile │ │ │ └─────────┘ └─────────┘ └──────────────┘ │ └─────────────────────────────────────────────────┘
Quick Answer
Multi-stage YAML pipelines define build, test, and deploy stages in a single azure-pipelines.yml file. Approval gates are configured on Azure DevOps Environments, requiring manual approval or automated checks before the deployment job targeting that environment can proceed.
Detailed Answer
Think of a relay race with checkpoints. Each runner (stage) must complete their leg before the next starts, and at certain checkpoints (environments), a judge (approver) must wave the green flag before the next runner can go. The entire race plan is written down in advance (YAML), but human judgment controls progression at critical points. Multi-stage YAML pipelines replaced the old Classic release pipelines with a code-as-configuration approach. A single YAML file defines multiple stages — typically Build, Test, Deploy-Dev, Deploy-Staging, Deploy-Prod. Each stage contains jobs, and each job contains steps. Stages run sequentially by default but can be configured with dependsOn to run in parallel or in custom orders. The deployment jobs reference Azure DevOps Environments. Environments are the key to approval gates. You create environments (dev, staging, production) in Azure DevOps under Pipelines > Environments. On each environment, you configure Approvals and Checks: manual approvals (specific users or groups must approve), business hours check (only deploy during work hours), branch control (only allow deployments from the main branch), and exclusive lock (prevent concurrent deployments). When a pipeline stage targets an environment with approvals, it pauses and notifies the approvers. At production scale, teams configure progressively stricter gates. Dev deploys automatically on every PR merge. Staging requires one approval from the QA lead. Production requires two approvals from different teams (dev lead + ops lead), business hours enforcement, and a branch control check that only allows the main branch. Templates extract common stage definitions into reusable files, so 50 pipelines share the same deploy-to-production stage with identical gates. The non-obvious gotcha is that environment approvals apply to the environment resource, not the pipeline. If you rename or recreate an environment, you lose all configured approvals and must set them up again. Also, approval timeouts default to 30 days — if nobody approves within that window, the pipeline run expires. Teams should set shorter timeouts (24-48 hours) and configure approval notifications to avoid stale pipeline runs accumulating.
Code Example
# azure-pipelines.yml — Multi-stage pipeline with approval gates
trigger:
branches:
include: [main] # Only trigger on main branch
stages:
- stage: Build
jobs:
- job: BuildApp
pool:
vmImage: ubuntu-latest # Microsoft-hosted agent
steps:
- script: dotnet build --configuration Release # Build the .NET application
- task: PublishBuildArtifacts@1 # Publish artifacts for deploy stages
inputs:
pathtoPublish: $(Build.ArtifactStagingDirectory)
artifactName: drop
- stage: DeployDev
dependsOn: Build # Runs after Build completes
jobs:
- deployment: DeployToDev
environment: dev # No approvals configured — auto deploys
strategy:
runOnce:
deploy:
steps:
- script: echo "Deploying to dev" # Deploy steps here
- stage: DeployProd
dependsOn: DeployDev # Runs after dev succeeds
jobs:
- deployment: DeployToProd
environment: production # Has manual approval configured
strategy:
runOnce:
deploy:
steps:
- script: echo "Deploying to production" # Deploy steps here