Kubernetes Troubleshooting Guide

A systematic approach to debugging any Kubernetes problem. Follow the decision tree for each symptom.

The Universal Debugging Checklist

# 1. What is the current state?
kubectl get pods -n <ns> -o wide

# 2. What happened recently?
kubectl get events --sort-by='.lastTimestamp' -n <ns> | tail -20

# 3. What does the object say?
kubectl describe pod <pod> -n <ns>

# 4. What does the container say?
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -f

Pod Status Decision Tree

`Pending` — Pod not scheduled

kubectl describe pod <pod> -n <ns>
# Look at "Events:" section at the bottom

Event message	Root cause	Fix
`Insufficient cpu` / `Insufficient memory`	No node has enough resources	Scale cluster, reduce requests, or remove resource hogs
`0/3 nodes are available: 3 node(s) had untolerated taint`	Tainted nodes, pod lacks toleration	Add toleration or remove taint
`0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector`	nodeSelector/nodeAffinity mismatch	Fix labels: `kubectl get nodes --show-labels`
`no persistent volumes available`	No PV matches PVC	Check StorageClass, create PV, or check provisioner
`pod has unbound immediate PersistentVolumeClaims`	PVC stuck Pending	`kubectl describe pvc <name>`
`Unschedulable: Too many pods`	Node at maxPods limit (110 default)	Add nodes or adjust maxPods

# Check node resource availability
kubectl describe nodes | grep -A 10 "Allocated resources"

# Check if scheduler is running
kubectl get pods -n kube-system -l component=kube-scheduler

`CrashLoopBackOff` — Container keeps crashing

# Get logs from the crashed container
kubectl logs <pod> -n <ns> --previous

# Describe for exit code
kubectl describe pod <pod> -n <ns>
# Look for: "Last State: Terminated  Reason: Error  Exit Code: X"

Exit Code	Meaning	Common cause
`1`	General application error	App crash — check logs
`2`	Misuse of shell built-in	Bad script/command
`126`	Command not executable	Wrong path or permissions
`127`	Command not found	Missing binary in image
`128+N`	Fatal signal N	`137` = OOMKilled (128+9), `143` = SIGTERM
`OOMKilled`	Out of memory	Increase memory limit

# Run a one-off debug pod to test the image
kubectl run debug --image=<same-image> --rm -it --command -- /bin/sh

# Override the entrypoint to prevent crashing, then exec in
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- /bin/sh

`ImagePullBackOff` / `ErrImagePull`

kubectl describe pod <pod> -n <ns>
# Events will show: "Failed to pull image ... 401 Unauthorized"
#                or "Failed to pull image ... not found"

Symptom	Fix
`401 Unauthorized`	Create/fix imagePullSecret
`not found` / `manifest unknown`	Wrong image name or tag
`connection refused`	Registry unreachable from node
Works locally, fails in cluster	Private registry — add imagePullSecret

# Create docker registry secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=<user> \
  --docker-password=<pass> \
  -n <namespace>

# Add to pod spec
spec:
  imagePullSecrets:
  - name: regcred

`OOMKilled` — Out Of Memory

kubectl describe pod <pod> -n <ns>
# Last State: Terminated  Reason: OOMKilled

# Check memory usage before it dies
kubectl top pods -n <ns>

# Fix: increase memory limit
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "1Gi"    # was too low

`Terminating` (stuck forever)

# Force delete
kubectl delete pod <pod> -n <ns> --grace-period=0 --force

# If still stuck — check for finalizers
kubectl get pod <pod> -n <ns> -o json | jq '.metadata.finalizers'

# Remove finalizers
kubectl patch pod <pod> -n <ns> -p '{"metadata":{"finalizers":[]}}' --type=merge

Service Not Reachable

Step-by-step when a Service returns connection refused or times out:

# Step 1: Does the service exist?
kubectl get svc <name> -n <ns>

# Step 2: Does it have endpoints? (If empty, selector isn't matching any pods)
kubectl get endpoints <name> -n <ns>

# Step 3: Do pods match the service selector?
kubectl get svc <name> -n <ns> -o jsonpath='{.spec.selector}'
kubectl get pods -n <ns> -l <key>=<value>   # use the selector labels

# Step 4: Is the targetPort correct?
kubectl get svc <name> -n <ns> -o yaml | grep -A5 ports

# Step 5: Test from inside the cluster
kubectl run net-test --image=busybox --rm -it -- sh
  wget -O- http://<svc>.<ns>.svc.cluster.local:<port>
  nslookup <svc>.<ns>.svc.cluster.local

# Step 6: Is a NetworkPolicy blocking traffic?
kubectl get networkpolicy -n <ns>

Common causes:

No endpoints — label selector mismatch between Service and Pods
Wrong port — port vs targetPort confusion
Pod not Ready — readiness probe failing, so endpoint excluded
NetworkPolicy — default-deny rule blocking traffic

Node `NotReady`

# Check node conditions
kubectl describe node <node>
# Look for: "Ready False" with reason

# SSH to the node
ssh <node>

# Check kubelet
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Common kubelet errors
# "failed to get node info" → API server unreachable
# "certificate has expired" → rotate certs
# "PLEG is not healthy" → container runtime issue

# Check container runtime
systemctl status containerd
crictl ps
crictl images

# Check disk space (DiskPressure)
df -h
du -sh /var/lib/containerd

# Clean up unused images
crictl rmi --prune

# Check memory (MemoryPressure)
free -h

DNS Resolution Failures

# Test DNS from inside a pod
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

DNS format:

<service>.<namespace>.svc.cluster.local       → ClusterIP
<pod-ip-dashes>.<namespace>.pod.cluster.local → Pod IP

# Test all DNS levels
nslookup my-svc                                       # short name (same namespace)
nslookup my-svc.staging                               # cross-namespace
nslookup my-svc.staging.svc.cluster.local             # FQDN
nslookup 8.8.8.8                                      # external DNS

Deployment Not Updating

# Check rollout status
kubectl rollout status deployment/<name> -n <ns>

# Check if pods are actually on new version
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

# Why is the rollout stuck?
kubectl describe deployment <name> -n <ns>
# Check: "Conditions" section — look for "Progressing False"

# Common: new pods won't start because of readiness probe failure
kubectl describe pod <new-pod> -n <ns>

# Rollback immediately
kubectl rollout undo deployment/<name> -n <ns>

PVC Stuck in `Pending`

kubectl describe pvc <name> -n <ns>
# Events will say why

Reason	Fix
`no persistent volumes available`	No PV with matching access mode/size exists
`storageclass not found`	Wrong `storageClassName` — check `kubectl get sc`
`waiting for first consumer`	StorageClass has `volumeBindingMode: WaitForFirstConsumer` — normal, binds when pod is created
`ProvisioningFailed`	Cloud provisioner error — check provisioner logs

# Check available StorageClasses
kubectl get storageclass

# Check if a matching PV exists
kubectl get pv

# Check provisioner pod logs (example: aws-ebs-csi-driver)
kubectl logs -n kube-system -l app=ebs-csi-controller

HPA Not Scaling

kubectl describe hpa <name> -n <ns>
# Look for: "Warning  FailedGetResourceMetric"

# Common causes:
# 1. metrics-server not installed
kubectl get pods -n kube-system | grep metrics-server

# 2. Resource requests not set on deployment (HPA can't calculate utilization %)
kubectl get deployment <name> -o yaml | grep -A4 resources

# 3. Current replicas already at max
kubectl get hpa <name> -n <ns>
# MAXPODS column = maxReplicas — if REPLICAS == MAXPODS, it's at ceiling

etcd Issues

# Check etcd pod
kubectl get pods -n kube-system | grep etcd

# Check etcd health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Check etcd member list
ETCDCTL_API=3 etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# etcd db size (warn at 2GB, hard limit 8GB by default)
ETCDCTL_API=3 etcdctl endpoint status \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --write-out=table

# Compact + defrag if db is too large
ETCDCTL_API=3 etcdctl compact <revision>
ETCDCTL_API=3 etcdctl defrag --endpoints=https://127.0.0.1:2379 ...

Quick Reference: Pod Status Meanings

Status	Meaning	First action
`Pending`	Scheduler can't place it	`kubectl describe pod` → Events
`ContainerCreating`	Image pulling / volumes mounting	`kubectl describe pod` → Events
`Running`	All containers running	Check readiness if service issues
`CrashLoopBackOff`	Container keeps exiting	`kubectl logs --previous`
`OOMKilled`	Memory limit hit	Increase limit, check for leak
`ImagePullBackOff`	Can't pull image	Check name, tag, imagePullSecret
`Error`	Container exited with error	`kubectl logs --previous`
`Terminating`	Being deleted	Check finalizers if stuck
`Evicted`	Node pressure evicted pod	Check node disk/memory
`Completed`	Job pod finished successfully	Normal for Jobs
`Unknown`	Node lost contact	Check node status

Kubernetes Troubleshooting Guide

The Universal Debugging Checklist

Pod Status Decision Tree

Pending — Pod not scheduled

CrashLoopBackOff — Container keeps crashing

ImagePullBackOff / ErrImagePull

OOMKilled — Out Of Memory

Terminating (stuck forever)

Service Not Reachable

Node NotReady

DNS Resolution Failures

Deployment Not Updating

PVC Stuck in Pending

HPA Not Scaling

etcd Issues

Quick Reference: Pod Status Meanings

`Pending` — Pod not scheduled

`CrashLoopBackOff` — Container keeps crashing

`ImagePullBackOff` / `ErrImagePull`

`OOMKilled` — Out Of Memory

`Terminating` (stuck forever)

Node `NotReady`

PVC Stuck in `Pending`