9 reviewed items across 4 content types.
A security analytics team migrated 400 million historical log documents into a new Elastic cluster while production search stayed online. The cluster ran 18 data nodes across hot and warm tiers.
A log backfill job started at 1:00 AM. Indexing throughput dropped by 65%, search latency rose, and the hot tier showed sustained merge pressure.
A logging cluster with two master-eligible nodes worked until a maintenance reboot split coordination. Search was intermittently unavailable even though most data nodes were alive. Elasticsearch resilience depends on maintaining a safe majority for cluster-state decisions and enough shard copies to keep serving traffic during recovery. Without following this practice, teams typically discover the problem during a production incident. The cost of fixing issues reactively is 10-100x higher than preventing them proactively. This becomes especially dangerous at scale when multiple teams depend on the same infrastructure, because one team's shortcut becomes another team's outage. Organizations that skip this practice often find themselves in a cycle of firefighting instead of building. The pattern is predictable: it works fine in development, survives staging, and fails spectacularly in production under real traffic and real failure conditions.