Advertisementslot: not configuredSet AdSense publisher and slot env vars in .env.local

Kubernetes Troubleshooting Guide

A systematic approach to debugging any Kubernetes problem. Follow the decision tree for each symptom.


The Universal Debugging Checklist

# 1. What is the current state?
kubectl get pods -n <ns> -o wide

# 2. What happened recently?
kubectl get events --sort-by='.lastTimestamp' -n <ns> | tail -20

# 3. What does the object say?
kubectl describe pod <pod> -n <ns>

# 4. What does the container say?
kubectl logs <pod> -n <ns> --previous
kubectl logs <pod> -n <ns> -f

Pod Status Decision Tree

Pending — Pod not scheduled

kubectl describe pod <pod> -n <ns>
# Look at "Events:" section at the bottom
Event message Root cause Fix
Insufficient cpu / Insufficient memory No node has enough resources Scale cluster, reduce requests, or remove resource hogs
0/3 nodes are available: 3 node(s) had untolerated taint Tainted nodes, pod lacks toleration Add toleration or remove taint
0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector nodeSelector/nodeAffinity mismatch Fix labels: kubectl get nodes --show-labels
no persistent volumes available No PV matches PVC Check StorageClass, create PV, or check provisioner
pod has unbound immediate PersistentVolumeClaims PVC stuck Pending kubectl describe pvc <name>
Unschedulable: Too many pods Node at maxPods limit (110 default) Add nodes or adjust maxPods
# Check node resource availability
kubectl describe nodes | grep -A 10 "Allocated resources"

# Check if scheduler is running
kubectl get pods -n kube-system -l component=kube-scheduler

CrashLoopBackOff — Container keeps crashing

# Get logs from the crashed container
kubectl logs <pod> -n <ns> --previous

# Describe for exit code
kubectl describe pod <pod> -n <ns>
# Look for: "Last State: Terminated  Reason: Error  Exit Code: X"
Exit Code Meaning Common cause
1 General application error App crash — check logs
2 Misuse of shell built-in Bad script/command
126 Command not executable Wrong path or permissions
127 Command not found Missing binary in image
128+N Fatal signal N 137 = OOMKilled (128+9), 143 = SIGTERM
OOMKilled Out of memory Increase memory limit
# Run a one-off debug pod to test the image
kubectl run debug --image=<same-image> --rm -it --command -- /bin/sh

# Override the entrypoint to prevent crashing, then exec in
kubectl run debug --image=<same-image> --command -- sleep 3600
kubectl exec -it debug -- /bin/sh

ImagePullBackOff / ErrImagePull

kubectl describe pod <pod> -n <ns>
# Events will show: "Failed to pull image ... 401 Unauthorized"
#                or "Failed to pull image ... not found"
Symptom Fix
401 Unauthorized Create/fix imagePullSecret
not found / manifest unknown Wrong image name or tag
connection refused Registry unreachable from node
Works locally, fails in cluster Private registry — add imagePullSecret
# Create docker registry secret
kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=<user> \
  --docker-password=<pass> \
  -n <namespace>

# Add to pod spec
spec:
  imagePullSecrets:
  - name: regcred

OOMKilled — Out Of Memory

kubectl describe pod <pod> -n <ns>
# Last State: Terminated  Reason: OOMKilled

# Check memory usage before it dies
kubectl top pods -n <ns>
# Fix: increase memory limit
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "1Gi"    # was too low

Terminating (stuck forever)

# Force delete
kubectl delete pod <pod> -n <ns> --grace-period=0 --force

# If still stuck — check for finalizers
kubectl get pod <pod> -n <ns> -o json | jq '.metadata.finalizers'

# Remove finalizers
kubectl patch pod <pod> -n <ns> -p '{"metadata":{"finalizers":[]}}' --type=merge

Service Not Reachable

Step-by-step when a Service returns connection refused or times out:

# Step 1: Does the service exist?
kubectl get svc <name> -n <ns>

# Step 2: Does it have endpoints? (If empty, selector isn't matching any pods)
kubectl get endpoints <name> -n <ns>

# Step 3: Do pods match the service selector?
kubectl get svc <name> -n <ns> -o jsonpath='{.spec.selector}'
kubectl get pods -n <ns> -l <key>=<value>   # use the selector labels

# Step 4: Is the targetPort correct?
kubectl get svc <name> -n <ns> -o yaml | grep -A5 ports

# Step 5: Test from inside the cluster
kubectl run net-test --image=busybox --rm -it -- sh
  wget -O- http://<svc>.<ns>.svc.cluster.local:<port>
  nslookup <svc>.<ns>.svc.cluster.local

# Step 6: Is a NetworkPolicy blocking traffic?
kubectl get networkpolicy -n <ns>

Common causes:

  • No endpoints — label selector mismatch between Service and Pods
  • Wrong portport vs targetPort confusion
  • Pod not Ready — readiness probe failing, so endpoint excluded
  • NetworkPolicy — default-deny rule blocking traffic

Node NotReady

# Check node conditions
kubectl describe node <node>
# Look for: "Ready False" with reason

# SSH to the node
ssh <node>

# Check kubelet
systemctl status kubelet
journalctl -u kubelet -n 100 --no-pager

# Common kubelet errors
# "failed to get node info" → API server unreachable
# "certificate has expired" → rotate certs
# "PLEG is not healthy" → container runtime issue

# Check container runtime
systemctl status containerd
crictl ps
crictl images

# Check disk space (DiskPressure)
df -h
du -sh /var/lib/containerd

# Clean up unused images
crictl rmi --prune

# Check memory (MemoryPressure)
free -h

DNS Resolution Failures

# Test DNS from inside a pod
kubectl run dns-test --image=busybox --rm -it -- nslookup kubernetes.default

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Check CoreDNS ConfigMap
kubectl get configmap coredns -n kube-system -o yaml

DNS format:

<service>.<namespace>.svc.cluster.local       → ClusterIP
<pod-ip-dashes>.<namespace>.pod.cluster.local → Pod IP
# Test all DNS levels
nslookup my-svc                                       # short name (same namespace)
nslookup my-svc.staging                               # cross-namespace
nslookup my-svc.staging.svc.cluster.local             # FQDN
nslookup 8.8.8.8                                      # external DNS

Deployment Not Updating

# Check rollout status
kubectl rollout status deployment/<name> -n <ns>

# Check if pods are actually on new version
kubectl get pods -n <ns> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].image}{"\n"}{end}'

# Why is the rollout stuck?
kubectl describe deployment <name> -n <ns>
# Check: "Conditions" section — look for "Progressing False"

# Common: new pods won't start because of readiness probe failure
kubectl describe pod <new-pod> -n <ns>

# Rollback immediately
kubectl rollout undo deployment/<name> -n <ns>

PVC Stuck in Pending

kubectl describe pvc <name> -n <ns>
# Events will say why
Reason Fix
no persistent volumes available No PV with matching access mode/size exists
storageclass not found Wrong storageClassName — check kubectl get sc
waiting for first consumer StorageClass has volumeBindingMode: WaitForFirstConsumer — normal, binds when pod is created
ProvisioningFailed Cloud provisioner error — check provisioner logs
# Check available StorageClasses
kubectl get storageclass

# Check if a matching PV exists
kubectl get pv

# Check provisioner pod logs (example: aws-ebs-csi-driver)
kubectl logs -n kube-system -l app=ebs-csi-controller

HPA Not Scaling

kubectl describe hpa <name> -n <ns>
# Look for: "Warning  FailedGetResourceMetric"

# Common causes:
# 1. metrics-server not installed
kubectl get pods -n kube-system | grep metrics-server

# 2. Resource requests not set on deployment (HPA can't calculate utilization %)
kubectl get deployment <name> -o yaml | grep -A4 resources

# 3. Current replicas already at max
kubectl get hpa <name> -n <ns>
# MAXPODS column = maxReplicas — if REPLICAS == MAXPODS, it's at ceiling

etcd Issues

# Check etcd pod
kubectl get pods -n kube-system | grep etcd

# Check etcd health
ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Check etcd member list
ETCDCTL_API=3 etcdctl member list \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# etcd db size (warn at 2GB, hard limit 8GB by default)
ETCDCTL_API=3 etcdctl endpoint status \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  --write-out=table

# Compact + defrag if db is too large
ETCDCTL_API=3 etcdctl compact <revision>
ETCDCTL_API=3 etcdctl defrag --endpoints=https://127.0.0.1:2379 ...

Quick Reference: Pod Status Meanings

Status Meaning First action
Pending Scheduler can't place it kubectl describe pod → Events
ContainerCreating Image pulling / volumes mounting kubectl describe pod → Events
Running All containers running Check readiness if service issues
CrashLoopBackOff Container keeps exiting kubectl logs --previous
OOMKilled Memory limit hit Increase limit, check for leak
ImagePullBackOff Can't pull image Check name, tag, imagePullSecret
Error Container exited with error kubectl logs --previous
Terminating Being deleted Check finalizers if stuck
Evicted Node pressure evicted pod Check node disk/memory
Completed Job pod finished successfully Normal for Jobs
Unknown Node lost contact Check node status
Advertisementslot: not configuredSet AdSense publisher and slot env vars in .env.local
Use the sidebar to navigate between topics.