Cloudflare post-mortem ties global outage to bot-management bug

Cloudflare post-mortem ties global outage to bot-management bug - GNcrypto

Cloudflare has published a detailed post-mortem on the November 18, 2025 outage that took major platforms such as X and ChatGPT offline, saying a faulty Bot Management configuration file generated by a database permissions change caused its core proxy software to fail across the global network.

In a follow-up to the incident covered in our earlier report on the outage’s impact on major websites, Cloudflare said the disruption began at 11:20 UTC, when its systems started returning widespread HTTP 5xx errors due to an internal bug, not a cyberattack or any form of malicious activity. The company restored most traffic flows by 14:30 UTC and reported all systems back to normal at 17:06 UTC.

According to the technical account, the chain of failures started with a change to permissions in a ClickHouse database cluster at 11:05 UTC. That change altered how metadata queries behaved and caused the database to output duplicate rows when generating a “feature file” used by Cloudflare’s Bot Management system. As a result, the feature file unexpectedly doubled in size and was then propagated across all machines that handle traffic in Cloudflare’s network. 

The affected feature file is read by software modules in Cloudflare’s core proxy layer, which use it to assign bot scores and apply security rules to HTTP requests. That software had a hard limit on the size and number of features it could safely load. When the new, oversized configuration arrived, the Bot Management module exceeded its limit and triggered a panic in the Rust code that powers the newer FL2 version of Cloudflare’s proxy engine, causing it to return HTTP 5xx errors for dependent traffic.

Cloudflare said the first visible effect on users was a spike in 5xx error codes starting at 11:20 UTC, with the error rate then oscillating as new versions of the feature file were generated every five minutes. When the query happened to run on updated nodes in the ClickHouse cluster, it produced a bad file; when it ran on nodes that had not yet been updated, it produced a valid one. This led to alternating periods of partial recovery and renewed failure until all nodes were generating the faulty configuration and the system stabilized in a fully degraded state.

The impact extended beyond core content delivery and security. Cloudflare’s summary lists its CDN and security services, the Turnstile challenge system, Workers KV key-value store, the Access authentication product and the Cloudflare dashboard among the affected components. Customers saw internal error pages, Turnstile failed to load on login screens, Workers KV experienced elevated 5xx rates and Access users were unable to authenticate, though existing sessions continued to work. Latency across the CDN also increased as debugging and observability tools consumed additional CPU.

During the early stages of the incident, teams initially suspected a large-scale attack, in part because the publicly hosted status page also became unavailable around the same time, even though it does not depend on Cloudflare’s own infrastructure. Internal incident chats cited concern that a high-volume attack might be targeting both production systems and the status page, before engineers traced the behavior back to the Bot Management feature file and the underlying database query change.

Mitigation progressed in several stages. At 13:05 UTC, Cloudflare implemented bypasses so that Workers KV and Access could fall back to a prior version of the core proxy, reducing error rates for those services. By 14:24 UTC, the team had stopped the creation and propagation of new Bot Management configuration files and tested a known-good version. At 14:30 UTC, the corrected file was deployed globally, restoring most services, while remaining systems were restarted and residual errors cleared over the following hours. Full recovery was recorded at 17:06 UTC.

The company called the incident its worst outage since 2019 in terms of impact on core traffic. While past incidents had affected the dashboard or specific new features, Cloudflare noted that it had not seen a failure in more than six years that prevented the majority of core traffic from transiting its network. In the post-mortem, Cloudflare’s leadership emphasized the disruption to customers and the wider internet, stating that any period during which its network cannot route traffic is unacceptable for an infrastructure provider of its scale.

The material on GNcrypto is intended solely for informational use and must not be regarded as financial advice. We make every effort to keep the content accurate and current, but we cannot warrant its precision, completeness, or reliability. GNcrypto does not take responsibility for any mistakes, omissions, or financial losses resulting from reliance on this information. Any actions you take based on this content are done at your own risk. Always conduct independent research and seek guidance from a qualified specialist. For further details, please review our Terms, Privacy Policy and Disclaimers.

Articles by this author