Cloudflare Outage Hits Internet Backbone: What Happened and What Comes Next

Carlos Martinez
a few seconds ago
9 min read

On November 18, 2025, around 11:28 UTC, major services like X, ChatGPT, Spotify, and Canva stopped responding. This wasn’t a single app failing - Cloudflare’s network was having issues, and anything relying on it was affected.

The outage came just weeks after the AWS incident on October 20, 2025. It’s another example of how much of the internet depends on a small set of providers.

Cloudflare traced the problem to a permission change in their ClickHouse cluster. The change created an oversized configuration file for their bot-management system. When it propagated globally, proxy processes failed. The outage lasted just over five hours and affected thousands of services.

Let’s take a look at what caused the outage and why it affected so many services.

What Caused the Cloudflare Outage on November 18, 2025?

The outage started with a small change to database access controls in Cloudflare's ClickHouse cluster. While the change seemed routine, it set off a chain of events that quickly affected the entire network.

The Initial Trigger: Database Permissions and Feature File Growth

At 11:05 UTC, Cloudflare deployed a change to improve security and reliability of their distributed database queries. The change made implicit table access explicit, allowing users to see metadata for all tables they could access, including underlying data storage tables in the r0 database.

This change affected a query that generated the Bot Management feature file. The query didn't filter by database name, so it started returning duplicate rows for each feature, effectively doubling the file size from around 60 features to over 200.

The Bot Management system uses machine learning to score every request passing through Cloudflare's network. It relies on a feature file that refreshes every few minutes, allowing rapid response to new bot attack patterns. This file gets propagated to every server in Cloudflare's global network.

Why the Distributed Proxy Network Couldn't Handle It

Cloudflare's proxy system preallocates memory for performance optimization. The Bot Management module had a hard limit of 200 features, well above the typical 60 features in use. When servers received the oversized file with more than 200 features, the system panicked and crashed.

The real problem was the failure mode. Every proxy server across Cloudflare's network received the same corrupted file at roughly the same time. Instead of isolated failures that could be routed around, every server running the affected software version failed simultaneously.

What made diagnosis harder was the fluctuating behavior. The feature file regenerated every five minutes from the ClickHouse cluster. Because the cluster was being gradually updated, sometimes it produced good files, sometimes bad ones. Services would recover for a few minutes, then fail again. This pattern initially suggested a DDoS attack rather than an internal configuration issue.

Cloudflare's Response Timeline

Cloudflare detected the first errors at 11:28 UTC, 23 minutes after the database change deployed. Their automated monitoring flagged elevated error rates across customer traffic. The team initially investigated Workers KV service degradation, which appeared to be the primary symptom.

By 13:05 UTC, they implemented bypasses for Workers KV and Cloudflare Access, routing them to an older version of the proxy. This reduced impact for dependent services but didn't solve the core problem.

At 13:37 UTC, the team confirmed the Bot Management configuration file was the trigger and began working on restoration. They stopped automatic deployment of new feature files at 14:24 UTC and deployed a known good version at 14:30 UTC. Most services began recovering at that point.

Full resolution took until 17:06 UTC as teams restarted remaining services that had entered bad states. The total duration from initial impact to complete recovery was five hours and 38 minutes.

Which Major Companies and Apps Were Affected?

X, ChatGPT, Spotify, Canva, and League of Legends were all confirmed to have been impacted by the outage. The disruption wasn’t limited to these services - it also affected platforms like Dropbox, Shopify, and Coinbase.

Consumer Services

X experienced widespread access issues. Requests couldn’t reach origin servers because Cloudflare’s proxy layer was affected, even though the backend systems were functioning normally.

ChatGPT returned HTTP 5xx errors. OpenAI’s systems were fine, but traffic routed through Cloudflare was disrupted.

Spotify streams were interrupted for many users, and Canva’s platform was temporarily unreachable, affecting access to projects and collaboration features. League of Legends players encountered connection errors that prevented them from joining matches.

All of these services rely on Cloudflare for DNS, DDoS protection, CDN, and edge computing. When that layer encounters problems, services can become inaccessible regardless of backend health.

Critical Infrastructure: Transit Systems, Retail, Financial Services

Transit: The MTA had issues with its website and trip planning tools during rush hour, and NJ Transit reported similar problems. Some airline operations and hospitals were also affected.
Retail and E-commerce: Shopify and Dropbox experienced access and checkout problems. The timing near Black Friday disrupted peak shopping hours and paused some digital ad campaigns. Even services like McDonald’s ordering kiosks were affected.
Financial Services: Banking and fintech apps had intermittent access issues, and payment systems were disrupted, though core systems stayed online, with companies like Moody’s also reporting problems.

In all cases, the backend systems were operational, but users couldn’t reach them because the infrastructure layer failed.

The Business Impact

The business impact of the outage was significant and widespread, primarily resulting in substantial revenue loss, damage to customer trust and brand reputation, and operational chaos for businesses across various sectors, from social media to financial services and critical infrastructure.

Duration and Service Level Breaches

The outage lasted about five and a half hours. For services aiming for 99.9% uptime, this used a large portion of their annual downtime allowance. Services promising 99.99% uptime exceeded their typical yearly limit in a single incident.

Recovery didn’t happen all at once. Some regions started functioning again within a couple of hours, while others remained affected for the full duration. Even after the outage was officially resolved, returning traffic added load and required time to stabilize fully.

Financial and Reputational Costs

Every minute of downtime meant users couldn’t access services. E-commerce platforms couldn’t complete transactions, and financial services experienced temporary disruptions. Teams focused on restoring services, and customer support quickly became overwhelmed as users reported issues.

The outage occurred just before Black Friday, and many retailers and marketing teams paused high-spending ad campaigns. This affected immediate revenue and disrupted the algorithms that optimize ad delivery, which could influence performance through the holiday period.

Even brief interruptions can affect user trust, as customers may hesitate to return after service problems.

Why This Wasn't a Cyberattack

When the outage began, some initially speculated it could be a coordinated attack due to the broad impact across multiple services.

The root cause, however, was an internal configuration change. Unlike a malicious attack, which usually targets specific services or attempts to access data, this failure affected all services in the same way because every server received the same misconfigured file.

Cloudflare confirmed that a database permissions change unintentionally doubled the size of a critical configuration file. Servers couldn’t process the oversized file and crashed as a result. Internal errors like this can have widespread effects because they bypass protections designed for external threats.

Pattern Recognition: Previous Major Outages

Similar failures have happened before. On June 12, 2025, a Google Cloud software update caused a null pointer error that disrupted authentication across many apps for over three hours.

On October 20, 2025, AWS had a DynamoDB API update error affecting 113 services including EC2, Lambda, and SQS.

In March 2025, Microsoft rolled out a code change that caused multi-hour downtime for Microsoft 365 services, including Outlook, Teams, and Azure. Fastly in June 2021 and AWS S3 in February 2017 also had configuration errors that caused widespread service disruptions.

Provider	Date	Duration	Root Cause	Impact
AWS S3	Feb 2017	4 hours	Command typo	Netflix, Slack, Trello offline
Fastly	June 2021	2 hours	Config error	CNN, Reddit, Gov.uk unreachable
Google Cloud	June 12, 2025	3+ hours	Software update bug	Authentication failures, many apps inaccessible
Microsoft	March 2025	Multi-hour	Code change	Microsoft 365, Outlook, Teams, Azure affected
AWS	Oct 2025	~4 hours	DynamoDB API update error	113 services, including EC2, Lambda, SQS
Cloudflare	Nov 2025	5.5 hours	Database permissions change	X, ChatGPT, Spotify offline

The Real Problem: Infrastructure Concentration

The November outage shows how much of the internet runs through just a few major providers.

A few companies now handle most of the cloud infrastructure. AWS covers about 32% of compute, Microsoft Azure 23%, and Google Cloud 11%. These three dominate compute and storage layers.

At the edge and CDN level, concentration is even higher. Cloudflare alone processes over 46 million HTTP requests per second, and with Fastly, they handle a large portion of global web traffic.

Running global infrastructure requires huge investment, and scale brings efficiency and performance. These providers give capabilities that would be hard to replicate, including reliable performance and global reach.

Single Points of Failure at Scale

When many services depend on one provider, that provider can become a single point of failure. A company might run compute on AWS, CDN on Cloudflare, and DNS elsewhere. If Cloudflare fails, the service can stop responding even though other layers remain operational.

This pattern repeats across the internet. In November, social media platforms, AI tools, gaming services, public transit apps, and financial services all experienced disruptions because they relied on the same infrastructure layer.

Learning from History

Major outages have a consistent pattern. After the 2017 AWS incident, some companies restructured their systems to run across multiple regions or providers.

The takeaway isn’t speculation - if you rely on a single provider, any failure in their infrastructure can directly impact your services. Planning for redundancy and testing failover isn’t optional; it’s part of running reliable systems today.

Multi-Provider Architecture as Risk Management

Using multiple providers is no longer optional for critical systems. Relying on a single provider introduces a clear risk: any failure in their infrastructure can directly affect your services.

Benefits Beyond Resilience

Multi-provider setups maintain service availability when one provider has issues. They also provide negotiating leverage, reduce vendor lock-in, and improve geographic redundancy by spreading infrastructure across different locations.

This approach works best for stateless services - content delivery, APIs, static assets, and frontend apps. Stateful services are more complex, but patterns exist to handle them.

Implementing Automated Failover

Failover requires coordinated systems. Health checks continuously monitor provider performance, and routing automatically shifts traffic when failures occur.

DNS-level failover redirects traffic if the primary provider is down, though propagation delays can occur.
Load balancers distribute traffic dynamically and stop routing to failed providers immediately.
Orchestration tools like Terraform and Kubernetes help manage infrastructure consistently across providers.

Multi-Provider Strategies

High-risk sectors like financial services rely on multi-cloud and multi-CDN frameworks because downtime is not acceptable. These setups often combine providers such as Cloudflare, Akamai, and AWS CloudFront to improve reliability.

Video streaming services use multi-CDN strategies to maintain performance, automatically or manually redirecting traffic to the best-performing CDN during disruptions.

Organizations with mature DevOps practices that invest in multi-DNS and multi-CDN architectures with automated failover are generally better positioned to handle outages. Platforms like IO River, Cedexis, and Cloudharmony help orchestrate traffic dynamically across providers, ensuring services remain accessible even when one provider experiences issues.

Your Next Move

Outages like the one at Cloudflare remind us that no single provider is fail-proof. The key is to design systems that can continue operating even if one layer fails.

Consider spreading critical workloads across multiple providers and setting up automated failover where possible. Focus on monitoring and testing your traffic routing so you can respond quickly when issues arise.

You can also connect with us to review your infrastructure and implement multi-provider strategies that improve resilience and reduce downtime.

Frequently Asked Questions

What caused the Cloudflare outage on November 18, 2025?

A database permissions change at 11:05 UTC caused a query to return duplicate rows, doubling the size of Cloudflare's Bot Management feature file. When this oversized file propagated to proxy servers globally, it exceeded preset memory limits and caused widespread crashes.

How long did the Cloudflare outage last?

The outage lasted 5 hours and 38 minutes, from 11:28 UTC to 17:06 UTC. Service restoration happened in waves, with some regions recovering within two hours while others remained affected for the full duration.

Which companies were affected by the Cloudflare downtime?

Major affected services included X, ChatGPT, Spotify, Canva, and League of Legends. Public transit systems, e-commerce platforms, and banking interfaces also experienced disruptions. Any service depending on Cloudflare for CDN, DDoS protection, or DNS faced potential impact.

Was the Cloudflare outage a cyberattack?

No. Cloudflare confirmed the outage resulted from an internal database configuration change, not external attack. The uniform failure pattern across all services indicated a shared configuration issue rather than targeted malicious activity.

How can companies protect themselves from future outages?

Implement multi-provider architecture for critical services. Use automated health checks and failover mechanisms to route traffic to healthy providers when problems occur. Start with stateless services like CDN and API endpoints where failover is straightforward, then extend to more complex systems as expertise develops.

What is a multi-provider strategy?

A multi-provider strategy distributes infrastructure across multiple vendors rather than relying on a single provider. For example, using both Cloudflare and Fastly for CDN services with automatic failover, or running compute workloads across AWS and Azure. This approach provides resilience when any single provider experiences issues while offering benefits like vendor leverage and reduced lock-in.