Inspiring Tech Leaders

When the Internet Breaks – Cloudflare’s 2025 Outage

Dave Roberts Season 5 Episode 34

The recent global internet outage on November 18th, 2025, was a reminder of the fragility of our digital world. When a single configuration error at Cloudflare took down major services like OpenAI, X, Spotify, and thousands of others, it exposed a critical vulnerability in modern infrastructure.

In this episode of the Inspiring Tech Leaders podcast, I look at the root cause of the Cloudflare incident that cascaded into a global catastrophe.

This episode is a must-listen for anyone concerned with digital supply chain risk and systemic failure. I discuss: 

💡 The Concentration Risk - Why a failure in one core provider can cause a systemic shock across the global economy, following similar recent incidents at AWS and Microsoft Azure.

💡 Building for Failure – The critical importance of the Principle of Graceful Degradation.

💡 Mandating Out-of-Band Access - Why your internal tools must be independent of the infrastructure they manage to ensure you can fix problems when the main network is down.

💡 Auditing Automation - The need for rigorous pre-deployment testing and blast radius analysis for all automated configuration changes.

The question is no longer if another failure will happen, but whether your organisation will be prepared when it does.

Available on: Apple Podcasts | Spotify | YouTube | All major podcast platforms

Send me a message

Start building your thought leadership portfolio today with INSPO.  Wherever you are in your professional journey, whether you're just starting out or well established, you have knowledge, experience, and perspectives worth sharing. Showcase your thinking, connect through ideas, and make your voice part of something bigger at INSPO - https://www.inspo.expert/

Everyday AI: Your daily guide to grown with Generative AI
Can't keep up with AI? We've got you. Everyday AI helps you keep up and get ahead.

Listen on: Apple Podcasts   Spotify

Support the show

I’m truly honoured that the Inspiring Tech Leaders podcast is now reaching listeners in over 85 countries and 1,250+ cities worldwide. Thank you for your continued support! If you’d enjoyed the podcast, please leave a review and subscribe to ensure you're notified about future episodes. For further information visit - https://priceroberts.com

Welcome to the Inspiring Tech Leaders podcast, with me Dave Roberts.  This is the podcast that talks with tech leaders from across the industry, exploring their insights, sharing their experiences, and offering valuable advice to technology professionals.  The podcast also explores technology innovations and the evolving tech landscape, providing listeners with actionable guidance and inspiration.

In today’s podcast I’m discussing yet another big Internet outage. Just a few days ago on November 18th, 2025, the internet broke again! OpenAI’s ChatGPT crashed. X went dark. Spotify, DoorDash, Discord, Patreon, along with thousands of websites and apps experienced outages. Many felt like half the internet had suddenly evaporated, and all of it pointed back to a single company, this being Cloudflare.

It was at around 11:20 UTC on the 18th November, Cloudflare, a company most consumers never interact with directly but rely on constantly, began experiencing a severe global outage. Because Cloudflare sits at crucial junction points of the internet, providing security, DNS, performance optimisation, and traffic routing, the incident had widespread ripple effects. Users across the world began encountering server errors, stalled loading screens, and sudden connection failures. Early speculation suggested everything from a massive DDoS attack to a state-sponsored cyber incident. But Cloudflare moved quickly to clarify that this was not an attack. The root cause was internal, and frankly, it was a subtle, almost mundane failure that cascaded into a global catastrophe.

Cloudflare’s investigation revealed that one of their automated bot-classification systems had begun generating a corrupted configuration file. This feature file is rebuilt every five minutes and is used across Cloudflare’s global network to determine how traffic is handled. The core problem was a permission change in their ClickHouse database. This change, intended to improve security and reliability, inadvertently caused the query that generates the feature to return a large number of duplicate columns. The result? The configuration file, which was supposed to be a fixed size, began to double, then triple, in size. Many of Cloudflare’s servers had built-in, hard limits on how large that file could be, a memory limit of 200 features, to be precise. When the servers tried to load this unexpectedly oversized file, it resulted in a crash. Because the system regenerates the file every five minutes, servers would boot up, attempt to load the updated corrupted file, crash again, and repeat this cycle. This created a vicious, self-inflicted denial-of-service attack across their own infrastructure.

Engineers eventually halted the corrupted file generation, pushed out a clean version, and began the painstaking process of manually rebooting core systems across their global network. By around 14:30 UTC, much of Cloudflare’s traffic was restored, but the disruption continued to affect downstream services for hours.

This incident matters not just because of the disruption it caused, but because of what it reveals about the modern internet. And the most sobering realisation is that this is not a new story. We have seen this movie before, and the script is always the same, a small, internal configuration change that leads to a massive, global outage. 

In the span of just a few weeks, we've seen both AWS and Microsoft Azure, suffer massive global outages too. The AWS disruption, centred in the critical US-EAST-1 region, was tracked back to a bug in the automated DNS management for DynamoDB, which prevented services from being found and led to a cascading failure across thousands of dependent applications. Similarly, the Azure outage was caused by a simple, yet catastrophic, configuration error in the Azure Front Door service, which a software defect allowed to bypass all safety checks.  These back-to-back failures serve as a stark reminder of the concentration risk inherent in modern digital infrastructure, where a failure in one core provider can cause a systemic shock across the global economy.

Cloudflare isn’t just a content delivery network though, it’s a critical piece of the global web. Its tools protect and accelerate millions of websites. When something that is centralised fails, the impact is enormous. The consolidation of internet infrastructure into a handful of platforms, such as Cloudflare, AWS, Google Cloud, and Microsoft Azure, means that a single point of failure can now affect a significant percentage of global internet traffic. Furthermore, the outage wasn’t triggered by a massive cyberattack or a catastrophic hardware fault, it was another configuration error that cascaded through an automated system. In a world increasingly powered by AI workloads, real-time apps, and edge computing, we are relying on complex, automated systems to manage other complex, automated systems. The Cloudflare incident is a reminder that automation can accelerate failure just as quickly as it accelerates success. For many organisations affected, the most concerning part wasn’t the outage itself but how little they realised they depended on Cloudflare until something went wrong. It raised tough questions for leaders about their digital supply chains, dependency risks, and the resilience of the systems they rely on daily. The outage demonstrated that while centralisation offers speed and convenience, it also creates systemic vulnerability.

The lessons from this event are clear, and they must become board-level concerns, not just engineering tasks. The first lesson is to Build for Failure, Not Perfection, also known as the Principle of Graceful Degradation. Outages will happen. The goal is not to prevent all failures, but to ensure that when they do occur, services degrade gracefully instead of collapsing entirely. Tech leaders must invest in architectural patterns that isolate failures. This means moving away from monolithic systems and towards microservices, ensuring that a failure in the bot management system, for example, does not take down the entire DNS resolution service. An actionable step here is to implement circuit breakers and bulkheads in your architecture. A circuit breaker can stop a failing service from being called repeatedly, giving it time to recover. A bulkhead partitions your system so that a failure in one area doesn't sink the entire ship.

There should be focus to Diversify and Decentralise Dependencies. Redundancy in CDN, DNS, and routing is no longer optional, it really is essential. Relying on a single provider, no matter how reliable, is a single point of failure. An actionable step is to implement a multi-CDN strategy and use multiple DNS providers. For critical applications, explore multi-cloud or hybrid-cloud architectures. This is expensive, yes, but the cost of a multi-hour global outage is far greater than the cost of redundancy. The Cloudflare outage showed that even if you have a secondary provider, if your primary one fails, the sheer volume of traffic suddenly shifting to the secondary can overwhelm it. True resilience requires active-active redundancy, where traffic is constantly split between providers.

In the Cloudflare incident, a critical failure was the inability of engineers to access the tools they needed to fix the problem. Their internal systems were dependent on the very infrastructure that had failed. Therefore, an actionable step is to mandate out-of-band access for critical infrastructure. This means having a completely separate, isolated network path, perhaps a physical console or a dedicated, non-internet-dependent VPN, that engineers can use to access core routers and configuration systems when the main network is down. This is the digital equivalent of having a backup generator for your emergency room.

We must also Audit the Automation. The root cause was a corrupted configuration file generated by an automated system. This points to a failure in configuration management and testing. An actionable step is to implement rigorous pre-deployment testing for all automated configuration changes. This should include a blast radius analysis to determine the maximum impact of a change before it is deployed globally. Cloudflare’s file size limit was a good idea, but the failure was in the automated system that generated the file. All automated systems must have a kill switch or a roll-back mechanism that is independent of the system itself.

This Cloudflare outage will likely be remembered not for how long it lasted, but for what it revealed. It showed that even the strongest parts of the internet can falter, that tiny misconfigurations can trigger global incidents, and that resilience must be a board-level concern. Cloudflare will improve, as they always do. But this won’t be the last major outage. The question is not whether another failure will happen, but whether organisations will be prepared when it does. Tech leaders, your job is no longer just to build systems that work, but to build systems that fail safely.

Well, that is all for today. Thanks for tuning in to the Inspiring Tech Leaders podcast. If you enjoyed this episode, don’t forget to subscribe, leave a review, and share it with your network.  You can find more insights, show notes, and resources at www.inspiringtechleaders.com

Head over to the social media channels, you can find Inspiring Tech Leaders on X, Instagram, INSPO and TikTok.  Let me know your thoughts on these mega scale outages that we’ve seen consistently happen over the last few months.

Thanks again for listening, and until next time, stay curious, stay connected, and keep pushing the boundaries of what is possible in tech.