List pods with status
kubectl get pods -n payments -o wideShows pod status, node placement, restarts, and pod IPs. This is the first quick scan during most Kubernetes incidents.
118 handy commands across 19 operational areas, grouped from simple checks to complex production moves.
Pods, deployments, events, logs, rollouts, and cluster debugging.
kubectl get pods -n payments -o wideShows pod status, node placement, restarts, and pod IPs. This is the first quick scan during most Kubernetes incidents.
kubectl describe pod payments-api-7c9df -n paymentsUse this for events, image pull failures, probe failures, scheduling errors, resource pressure, and volume mount problems.
kubectl debug -n payments -it pod/payments-api-7c9df --image=nicolaka/netshoot --target=appAttaches a temporary troubleshooting container to inspect DNS, TCP, routes, certificates, and process-level symptoms without rebuilding the app image.
kubectl top pods -A --sort-by=memoryUse during node pressure or OOM investigations to identify noisy workloads before checking limits, requests, JVM heap, or sidecars.
kubectl get events -n payments --sort-by=.lastTimestampEvents explain scheduling, image pull, probe, OOM, and admission failures that may not appear in application logs.
kubectl rollout history deployment/payments-api -n payments && kubectl rollout undo deployment/payments-api -n paymentsUse history to confirm the bad revision, then roll back quickly when a deployment is clearly causing user impact.
kubectl get endpointslices -n payments -l kubernetes.io/service-name=payments-api -o wideConfirms whether ready pods are actually behind the Service. This catches selector mismatches and readiness-gate failures.
Images, containers, logs, runtime inspection, and local troubleshooting.
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}"Gives a clean status table without noisy columns. Useful when checking whether a local service actually started.
docker logs --since=15m -f apiStreams only recent logs so you do not waste time scrolling through old startup output.
docker inspect api --format "{{json .NetworkSettings.Networks}}"Shows networks, aliases, IP addresses, and gateway information when containers cannot reach each other.
docker diff apiLists files changed since the container started. Helpful for debugging unexpected writes, generated config, or missing mounted volumes.
docker run --rm -it --entrypoint sh api:latestBypasses the default entrypoint so you can inspect files, permissions, environment expectations, and installed tools.
docker stats --no-streamShows CPU, memory, network, and block I/O usage for quick local saturation checks.
docker builder prune --filter until=24hSafely cleans older build cache while keeping recent layers available. Useful when local Docker builds fill disk.
Multi-container local stacks, service dependencies, health checks, and logs.
docker compose up -dCreates networks, volumes, and containers from compose YAML without blocking your terminal.
docker compose configRenders the final compose file after variable interpolation and merges. Use this before blaming Docker for a YAML/env problem.
docker compose up -d --build appRebuilds and recreates only the app service, keeping databases and supporting services running.
docker compose run --rm --entrypoint sh appStarts a temporary container using the app service config so you can inspect env vars, DNS, mounted files, and binaries.
docker compose logs -f --tail=100 appKeeps database and helper service logs out of the way while debugging the app container.
docker compose up -d --no-deps --build appRebuilds only the changed service and avoids bouncing databases, queues, or observability services.
docker inspect --format "{{json .State.Health}}" thedevopsproject-app-1Reads Docker health-check state directly when compose says a service is running but the app is not ready.
Plan review, state inspection, drift checks, imports, and safe infrastructure changes.
terraform initDownloads providers and configures backend state. Run this first in a new workspace or after provider/backend changes.
terraform plan -out=tfplanCreates an exact plan artifact. Applying this file prevents accidental differences between review and execution.
terraform state show aws_instance.apiInspects the current state values Terraform believes exist, useful when provider drift or imports are confusing.
terraform state mv aws_instance.old aws_instance.newRenames a resource in state without recreating infrastructure. Use during refactors after reviewing a plan carefully.
terraform fmt -recursive && terraform validateCatches syntax, provider schema, and formatting issues before plan review.
terraform plan -var-file=prod.tfvars -out=prod.tfplanKeeps environment inputs explicit and saves the exact plan that should be reviewed and applied.
terraform import aws_s3_bucket.logs company-prod-logsBrings an existing resource under Terraform state. Always follow with plan review to align configuration.
Identity, EC2, IAM, EKS, S3, CloudWatch, RDS, and production triage.
aws sts get-caller-identityAlways verify the account, role, and user before making production changes.
aws logs filter-log-events --log-group-name /aws/eks/payments --filter-pattern "ERROR" --limit 20Quickly searches managed logs without opening the console. Add start time filters during real incidents.
aws rds describe-events --source-type db-instance --duration 60Shows recent RDS failovers, maintenance, backups, parameter changes, and availability events.
aws s3api list-buckets --query "Buckets[].Name" --output textStart with inventory, then inspect each bucket policy, ACL, block public access, encryption, lifecycle, and access logs.
aws eks update-kubeconfig --region us-east-1 --name prod-platformWrites the cluster context locally so kubectl can authenticate through AWS IAM.
aws ecs list-tasks --cluster prod --desired-status STOPPED --query "taskArns[0:10]"Starts ECS incident triage by locating recently stopped tasks before describing exit codes and stopped reasons.
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=AttachRolePolicy --max-results 20Helps investigate privilege changes, emergency access, and unexpected role policy attachments.
CPU, memory, disk, processes, networking, cron, rsync, and systemd.
df -hShows filesystem usage. Full root, log, or data partitions commonly cause outages and failed deploys.
find /var/log -type f -size +100M -exec ls -lh {} \;Use when disk fills unexpectedly. Pair with logrotate or application log settings.
ss -lntpShows TCP listeners and owning processes. Faster and more reliable than guessing whether a service bound correctly.
strace -tt -p 1234Attaches to a live process to see blocking file, network, DNS, or permission calls. Use carefully in production.
systemctl status nginx --no-pagerShows active state, recent logs, restart behavior, and unit-file hints in one place.
ps aux --sort=-%mem | head -15Quickly identifies processes consuming memory before deeper heap, cache, or leak investigation.
tcpdump -i eth0 host 10.0.2.15 -w incident.pcapCreates a packet capture for DNS, TLS, retransmission, or protocol analysis. Use with care on busy hosts.
Scheduled jobs, backup syncs, lock files, permissions, and missed executions.
crontab -lShows the current user schedule. Remember cron has a smaller environment than your shell.
journalctl -u cron --since "1 hour ago"Use this to confirm whether cron triggered the job at all before debugging the script.
rsync -avz --delete --dry-run /data/ app@backup:/data/Preview destructive sync behavior before deleting files on the destination.
flock -n /tmp/backup.lock rsync -avz /data/ app@backup:/data/Uses a lock file so long-running jobs do not overlap and corrupt backups or saturate I/O.
crontab backup.cron && crontab -lLoads a reviewed cron file and immediately confirms what cron will run.
rsync -aHAX --numeric-ids /srv/data/ backup:/srv/data/Preserves hard links, ACLs, extended attributes, and numeric ownership for system-style backups.
rsync -az --partial --bwlimit=50000 /data/ app@backup:/data/Keeps a large sync from saturating links and preserves partial files if the transfer is interrupted.
Certificate expiry, SANs, chains, trust stores, handshakes, and mutual auth.
openssl x509 -in server.crt -noout -datesQuickly confirms notBefore and notAfter dates for a certificate file.
openssl s_client -connect api.example.com:443 -servername api.example.com -showcertsChecks SNI, served certificates, chain order, and handshake errors from the client point of view.
openssl verify -CAfile ca.pem server.crtConfirms whether a certificate chains to the expected CA bundle.
curl --cert client.crt --key client.key --cacert ca.pem https://api.example.com/healthValidates both server trust and client authentication. Use when service mesh or gateway mTLS fails.
openssl x509 -in server.crt -noout -textLook for Subject Alternative Name entries when hostname validation fails despite an unexpired certificate.
openssl s_client -connect api.example.com:443 -tls1_2 -servername api.example.comConfirms whether a service still accepts or rejects a specific TLS version.
kubectl get secret api-tls -n ingress -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -subject -issuer -datesDecodes the served certificate from a Kubernetes secret and verifies subject, issuer, and expiry.
Thread dumps, heap dumps, GC, native memory, non-heap, and container memory limits.
jps -lvShows JVM process IDs and startup arguments so you can target the right process.
jstack -l 1234 > thread-dump.txtUse for deadlocks, blocked threads, high CPU, stuck requests, or pool starvation.
jcmd 1234 GC.heap_dump /tmp/heap.hprofCaptures heap for memory leak analysis. Ensure enough disk space before running in production.
jcmd 1234 VM.native_memory summaryHelps explain memory outside Java heap: metaspace, threads, code cache, direct buffers, and native allocations.
jcmd 1234 GC.heap_infoShows heap layout and usage. Pair with metaspace and native memory checks when container RSS is high.
jcmd 1234 GC.class_histogram > class-histogram.txtCounts live objects by class and helps identify suspicious growth before taking a full heap dump.
jcmd 1234 JFR.start name=incident settings=profile duration=120s filename=/tmp/incident.jfrCaptures CPU, allocation, lock, GC, and thread events with lower overhead than many ad hoc profilers.
Topics, partitions, consumer lag, replication, ISR, retention, and broker health.
kafka-topics.sh --bootstrap-server broker:9092 --listConfirms the cluster is reachable and shows available topics.
kafka-consumer-groups.sh --bootstrap-server broker:9092 --describe --group payments-consumerShows current offset, log end offset, lag, and partition assignment for a consumer group.
kafka-topics.sh --bootstrap-server broker:9092 --describe --topic paymentsUse to inspect partition count, leader broker, replicas, and in-sync replicas.
kafka-consumer-groups.sh --bootstrap-server broker:9092 --group payments-consumer --topic payments --reset-offsets --to-earliest --executeReplays messages from earliest offset. Treat as a controlled operation because it can duplicate processing.
kafka-console-producer.sh --bootstrap-server broker:9092 --topic paymentsUseful for validating producer connectivity, ACLs, and topic availability during setup or incident triage.
kafka-console-consumer.sh --bootstrap-server broker:9092 --topic payments --from-beginning --max-messages 10Samples stored messages to verify serialization, routing, headers, and whether data is arriving.
kafka-broker-api-versions.sh --bootstrap-server broker:9092Confirms protocol compatibility between clients and brokers after upgrades.
PromQL, scrape targets, alerts, dashboards, SLI/SLO, and burn-rate analysis.
curl -s http://prometheus:9090/api/v1/targetsShows which scrape targets are up or down and why metrics may be missing.
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Basic SLI query for server error ratio over five minutes.
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))Calculates p99 latency from histogram buckets grouped by service.
sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) > 14.4 * 0.001Example fast-burn alert for a 99.9 percent availability SLO. Tune windows and multiplier to your policy.
curl -s http://prometheus:9090/api/v1/alertsShows currently firing and pending alerts, including labels and annotations.
curl -G http://prometheus:9090/api/v1/query --data-urlencode 'query=sum(rate(http_requests_total{status=~"5.."}[5m]))'Runs a PromQL query from automation or CI without opening the Prometheus UI.
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 14.4 * 0.001) and (sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) > 14.4 * 0.001)Combines short and longer windows to reduce noisy SLO alerts while still catching fast user-impacting burn.
Connection checks, slow queries, indexes, replication, backups, and production safety.
mongosh --eval "db.adminCommand({ ping: 1 })"Confirms the client can connect and the server can answer a basic command.
mysql -e "SHOW FULL PROCESSLIST;"Shows active queries, locks, long-running sessions, and blocked connections.
db.orders.find({status: "pending"}).explain("executionStats")Use in mongosh to identify collection scans, bad indexes, and high document examination counts.
mysql -e "SHOW REPLICA STATUS\G"Checks replication lag, IO thread, SQL thread, last error, and failover readiness.
db.currentOp({ "secs_running": { $gt: 5 } })Run in mongosh to find long-running operations, blocked writes, and expensive reads.
mysql -e "SHOW INDEX FROM orders;" appdbShows existing indexes so you can compare them with slow query predicates and joins.
rs.status()Run in mongosh to inspect primary/secondary health, election state, replication lag hints, and member errors.
Container scanning, SAST gates, policy exceptions, and remediation evidence.
trivy image api:latestFinds OS and application dependency vulnerabilities in a container image.
trivy image --severity HIGH,CRITICAL --exit-code 1 api:latestTurns scanning into a CI gate while keeping lower severity findings visible but non-blocking.
trivy config ./terraformFinds Terraform, Kubernetes, and other infrastructure misconfigurations before deployment.
cx scan create --project-name payments-api --source . --branch mainStarts a SAST scan from CI/CD. Pair with policy gates and a triage workflow for false positives.
trivy image --format json --output trivy-report.json api:latestCreates machine-readable evidence for CI artifacts, exception review, or vulnerability dashboards.
trivy fs --scanners vuln,secret,misconfig .Checks the working tree for dependency vulnerabilities, leaked secrets, and configuration issues.
cx scan create --project-name payments-api --source . --branch main --threshold "high=0;medium=10"Example CI-style SAST gate that blocks new high-risk findings and limits medium-risk accumulation.
Operational scripts, APIs, JSON/YAML, subprocess safety, retries, and cloud SDKs.
python3 -m venv .venv && source .venv/bin/activateKeeps automation dependencies isolated from system Python.
python3 -m json.tool response.jsonQuickly validates and formats JSON output from scripts or curl captures.
python3 -Wd scripts/smoke_api.pySurfaces deprecation warnings that can break automation during future runtime upgrades.
python3 -m cProfile -o profile.out scripts/sync_inventory.pyCaptures timing data so you can identify slow API calls, inefficient parsing, or accidental loops.
python3 -m pip install -r requirements.txtInstalls automation dependencies from a reviewed manifest so scripts behave consistently across hosts.
python3 -m unittest discover -s scripts/tests -vRuns standard-library tests without requiring pytest, useful for lightweight automation repos.
python3 -X importtime scripts/sync_inventory.pyFinds slow imports and startup overhead in automation that runs frequently from cron or CI.
Cluster health, shards, indexes, snapshots, search latency, and log analytics.
curl -s http://opensearch:9200/_cluster/health?prettyShows green/yellow/red status, active shards, initializing shards, relocating shards, and unassigned shards.
curl -s "http://opensearch:9200/_cat/indices?v&s=store.size:desc"Finds large indexes that may be driving disk pressure, slow snapshots, or retention issues.
curl -s http://opensearch:9200/_cluster/allocation/explain?prettyExplains why a shard cannot allocate, such as disk watermark, missing node attributes, or replica constraints.
curl -X PUT "http://opensearch:9200/_snapshot/prod_repo/snap-2026-06-26?wait_for_completion=true"Runs a cluster snapshot before risky maintenance. Repository must already be configured and healthy.
Log pipelines, grok parsing, backpressure, outputs, registry state, and delivery checks.
logstash --path.settings /etc/logstash -tValidates pipeline syntax before restarting Logstash.
filebeat test config -eChecks YAML, modules, inputs, and output configuration with logs printed to stderr.
filebeat test output -eConfirms connectivity and authentication to OpenSearch, Elasticsearch, Logstash, or another configured output.
logstash -f pipeline.conf --log.level debugRuns a pipeline with verbose logs so you can troubleshoot grok failures, conditionals, and output retries.
CI/CD pipelines, deployments, connectors, delegates, environments, and rollback evidence.
harness loginConfirms CLI access before triggering pipelines or inspecting project resources.
harness pipeline list --project-id payments --org-id platformShows available pipelines and identifiers needed for automation.
harness pipeline run --project-id payments --org-id platform --pipeline-id deploy-apiTriggers a deployment or CI workflow from the CLI using known Harness identifiers.
harness delegate list --account-id ACCOUNT_IDChecks whether delegates are available to execute Kubernetes, cloud, or artifact operations.
Quorum health, znodes, Kafka metadata legacy mode, sessions, watches, and latency.
echo ruok | nc zookeeper 2181Returns imok when ZooKeeper is reachable and responding to four-letter commands.
echo stat | nc zookeeper 2181Shows leader/follower mode, connections, latency, packets, and znode count.
zkCli.sh -server zookeeper:2181 ls /Confirms namespace contents and whether clients are writing expected znodes.
echo wchs | nc zookeeper 2181Shows watch counts and helps diagnose clients creating too many watches or sessions.
Reliability indicators, error budgets, customer satisfaction, support satisfaction, and incident review.
good_events / total_eventsThe core SLI shape: define good user-visible events, divide by total valid events, and track over the SLO window.
1 - ((1 - current_availability) / (1 - slo_target))Shows how much reliability budget remains for a target such as 99.9 percent monthly availability.
(positive_survey_responses / total_survey_responses) * 100Measures customer satisfaction from survey responses. Useful as a business-facing reliability companion metric.
(satisfied_support_responses / total_support_responses) * 100Support satisfaction helps detect reliability pain that may not be visible in service-level telemetry alone.