39 reviewed items across 4 content types.
A payments team built 34 container images on every merge. CI build time reached 42 minutes during peak hours and delayed production hotfixes.
The payments-api Java container restarts every 2-4 hours in production. Kubernetes events show OOMKilled exit code 137. Application logs stop abruptly with no graceful shutdown messages. The JVM never logs an OutOfMemoryError because the kernel kills the process before the JVM can detect the condition.
CI pipeline Docker build steps take 10-15 minutes when they previously completed in under 2 minutes. The build log shows 'Sending build context to Docker daemon 4.2GB' at the start. Developer local builds are equally slow. The actual build layers complete quickly but the context transfer dominates total build time.
CI agents begin failing with 'no space left on device' errors during docker build or docker pull steps. The failures are intermittent, affecting different pipelines at random times. Agent disk usage monitoring shows /var/lib/docker consuming 95%+ of available disk space. Restarting the CI agent temporarily resolves the issue but it recurs within 24-48 hours.
The checkout-worker container cannot connect to the payments-api container by service name in a docker-compose environment. Curl requests to http://payments-api:8080/health return 'Could not resolve host: payments-api'. The same containers work fine when using IP addresses directly. The issue affects local development and integration testing environments but not production Kubernetes.
Security audit revealed that multiple CI pipeline containers had read-write access to the Docker daemon via a mounted /var/run/docker.sock. A compromised container in a third-party dependency scan pipeline was able to spawn privileged containers on the host, access host filesystem, read environment variables from other containers including database credentials, and exfiltrate secrets to an external endpoint.
Every CI pipeline run for the payments-api service rebuilds all Docker layers from scratch, taking 8-12 minutes per build instead of the expected 1-2 minutes with cache hits. The Dockerfile uses multi-stage builds and layer ordering appears correct, but cache is never reused between CI runs. Local developer builds do use cache and complete in under 2 minutes.
A large runtime image with build tools, test fixtures, and package caches expands CVE surface and slows deployment. When dependency installation is placed after copying all source code, every source edit invalidates the most expensive build layers. This becomes painful during incident hotfixes. Without following this practice, teams typically discover the problem during a production incident. The cost of fixing issues reactively is 10-100x higher than preventing them proactively. This becomes especially dangerous at scale when multiple teams depend on the same infrastructure, because one team's shortcut becomes another team's outage. Organizations that skip this practice often find themselves in a cycle of firefighting instead of building. The pattern is predictable: it works fine in development, survives staging, and fails spectacularly in production under real traffic and real failure conditions.