Cloudflare CEO Matthew Prince has explained that a global outage experienced by the company was caused by an issue within its Bot Management system. This system uses an AI model to score website traffic requests, determining if they are likely to be from a bot by analyzing various features held in a "feature file." A recent change to the query that generated this feature file caused it to duplicate information repeatedly, making the file excessively large.
This unexpected size triggered an error within the Bot Management system, leading to error codes for users attempting to access websites protected by Cloudflare. The outage began approximately 15 minutes after the feature file update was implemented.
While Cloudflare initially suspected a malicious cyberattack, including a DDoS attack, they later determined the problem was not caused by any external malicious activity. The issue was identified as an internal error stemming from the oversized feature file.
Cloudflare engineers were able to stop the propagation of the larger file and replace it with an earlier version, restoring services. The company is implementing measures to prevent similar outages, such as preventing error reports from overwhelming their systems.
The outage was the worst the company has experienced since 2019.