Kubernetes Troubleshooting Guide
A systematic approach to debugging any Kubernetes problem. Follow the decision tree for each symptom.
The Universal Debugging Checklist
kubectl get pods -n <ns> -o wide
kubectl get events --sort-by='.lastTimestamp' -n <ns> | tail -20
kubectl describe pod <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -f
Pod Status Decision Tree
Pending — Pod not scheduled
kubectl describe pod <pod> -n <ns>
| Event message |
Root cause |
Fix |
Insufficient cpu / Insufficient memory |
No node has enough resources |
Scale cluster, reduce requests, or remove resource hogs |
0/3 nodes are available: 3 node(s) had untolerated taint |
Tainted nodes, pod lacks toleration |
Add toleration or remove taint |
0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector |
nodeSelector/nodeAffinity mismatch |
Fix labels: kubectl get nodes --show-labels |
no persistent volumes available |
No PV matches PVC |
Check StorageClass, create PV, or check provisioner |
pod has unbound immediate PersistentVolumeClaims |
PVC stuck Pending |
kubectl describe pvc <name> |
Unschedulable: Too many pods |
Node at maxPods limit (110 default) |
Add nodes or adjust maxPods |
kubectl describe nodes | grep -A 10 "Allocated resources"
kubectl get pods -n kube-system -l component=kube-scheduler
CrashLoopBackOff — Container keeps crashing
kubectl logs <pod> -n <ns> --previous
kubectl describe pod <pod> -n <ns>
| Exit Code |
Meaning |
Common cause |
1 |
General application error |
App crash — check logs |
2 |
Misuse of shell built-in |
Bad script/command |
126 |
Command not executable |
Wrong path or permissions |
127 |
Command not found |
Missing binary in image |
128+N |
Fatal signal N |
137 = OOMKilled (128+9), 143 = SIGTERM |
OOMKilled |
Out of memory |
Increase memory limit |
kubectl run debug --image=<same-image> --rm -it --command -- /bin/sh
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- /bin/sh
ImagePullBackOff / ErrImagePull
kubectl describe pod <pod> -n <ns>
| Symptom |
Fix |
401 Unauthorized |
Create/fix imagePullSecret |
not found / manifest unknown |
Wrong image name or tag |
connection refused |
Registry unreachable from node |
| Works locally, fails in cluster |
Private registry — add imagePullSecret |
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com \
--docker-username=<user> \
--docker-password=<pass> \
-n <namespace>
spec:
imagePullSecrets:
- name: regcred
OOMKilled — Out Of Memory
kubectl describe pod <pod> -n <ns>
kubectl top pods -n <ns>
resources:
requests:
memory: "256Mi"
limits:
memory: "1Gi"
Terminating (stuck forever)
kubectl delete pod <pod> -n <ns> --grace-period=0 --force
kubectl get pod <pod> -n <ns> -o json | jq '.metadata.finalizers'
kubectl patch pod <pod> -n <ns> -p '{"metadata":{"finalizers":[]}}' --type=merge
Service Not Reachable
Step-by-step when a Service returns connection refused or times out:
kubectl get svc <name> -n <ns>
kubectl get endpoints <name> -n <ns>
kubectl get svc <name> -n <ns> -o jsonpath='{.spec.selector}'
kubectl get pods -n <ns> -l <key>=<value>
kubectl get svc <name> -n <ns> -o yaml | grep -A5 ports
kubectl run net-test --image=busybox --rm -it -- sh
wget -O- http://<svc>.<ns>.svc.cluster.local:<port>
nslookup <svc>.<ns>.svc.cluster.local
kubectl get networkpolicy -n <ns>
Common causes:
- No endpoints — label selector mismatch between Service and Pods
- Wrong port —
port vs targetPort confusion
- Pod not Ready — readiness probe failing, so endpoint excluded
- NetworkPolicy — default-deny rule blocking traffic
Node NotReady
kubectl describe node <node>
ssh <node>
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager
systemctl status containerd
crictl ps
crictl images
df -h
du -sh /var/lib/containerd
crictl rmi --prune
free -h
DNS Resolution Failures
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
kubectl get configmap coredns -n kube-system -o yaml
DNS format:
<service>.<namespace>.svc.cluster.local → ClusterIP
<pod-ip-dashes>.<namespace>.pod.cluster.local → Pod IP
nslookup my-svc
nslookup my-svc.staging
nslookup my-svc.staging.svc.cluster.local
nslookup 8.8.8.8
Deployment Not Updating
kubectl rollout status deployment/<name> -n <ns>
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'
kubectl describe deployment <name> -n <ns>
kubectl describe pod <new-pod> -n <ns>
kubectl rollout undo deployment/<name> -n <ns>
PVC Stuck in Pending
kubectl describe pvc <name> -n <ns>
| Reason |
Fix |
no persistent volumes available |
No PV with matching access mode/size exists |
storageclass not found |
Wrong storageClassName — check kubectl get sc |
waiting for first consumer |
StorageClass has volumeBindingMode: WaitForFirstConsumer — normal, binds when pod is created |
ProvisioningFailed |
Cloud provisioner error — check provisioner logs |
kubectl get storageclass
kubectl get pv
kubectl logs -n kube-system -l app=ebs-csi-controller
HPA Not Scaling
kubectl describe hpa <name> -n <ns>
kubectl get pods -n kube-system | grep metrics-server
kubectl get deployment <name> -o yaml | grep -A4 resources
kubectl get hpa <name> -n <ns>
etcd Issues
kubectl get pods -n kube-system | grep etcd
ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
ETCDCTL_API=3 etcdctl member list \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
ETCDCTL_API=3 etcdctl endpoint status \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
--write-out=table
ETCDCTL_API=3 etcdctl compact <revision>
ETCDCTL_API=3 etcdctl defrag --endpoints=https://127.0.0.1:2379 ...
Quick Reference: Pod Status Meanings
| Status |
Meaning |
First action |
Pending |
Scheduler can't place it |
kubectl describe pod → Events |
ContainerCreating |
Image pulling / volumes mounting |
kubectl describe pod → Events |
Running |
All containers running |
Check readiness if service issues |
CrashLoopBackOff |
Container keeps exiting |
kubectl logs --previous |
OOMKilled |
Memory limit hit |
Increase limit, check for leak |
ImagePullBackOff |
Can't pull image |
Check name, tag, imagePullSecret |
Error |
Container exited with error |
kubectl logs --previous |
Terminating |
Being deleted |
Check finalizers if stuck |
Evicted |
Node pressure evicted pod |
Check node disk/memory |
Completed |
Job pod finished successfully |
Normal for Jobs |
Unknown |
Node lost contact |
Check node status |