AWS Outage Hits Major Services: What Happened and What Comes Next

Carlos Martinez
6 days ago
11 min read

On October 20, 2025, AWS experienced a major outage starting at 07:11 GMT. A technical update error to the DynamoDB API broke DNS configuration, preventing apps from finding server addresses. This cascaded across 113 AWS services, including EC2, Lambda, and SQS.

The outage lasted approximately three hours. By 10:11 GMT, Amazon reported all services had returned to normal operations, though backlogs remained.

Downdetector recorded over 11 million outage reports globally, with more than 3 million coming from the U.S. At the height of the incident, around 2,500 companies experienced disruptions, affecting platforms like Snapchat, Reddit, Venmo, Roblox, and major airlines. Even hours later, nearly 400 companies were still reporting issues.

This wasn't a cyberattack. Scythe CEO Bryson Bort confirmed it was an infrastructure failure, likely human error.

The bigger issue is architectural: if your entire stack runs on one cloud provider, a regional outage means complete downtime. AWS holds 30% of the global cloud market. When US-EAST-1 fails, that's a lot of broken services.

Let's look at what happened, why it matters, and what you can do about it.

What Caused the AWS Outage in October 2025?

The disruption began when AWS's US-EAST-1 data center in Virginia suffered a failure. A technical update to the DynamoDB API contained an error that affected DNS configuration inside AWS.

The Initial Trigger: DynamoDB and DNS Failures

According to Amazon's updates, the issue started after a technical update to the API of DynamoDB, a database service that powers thousands of apps and platforms. The update error affected the Domain Name System inside AWS, which directs traffic to the correct servers.

When DNS requests failed, services couldn’t reach DynamoDB’s API endpoints. Applications dependent on that data stopped working, and other AWS components began failing in sequence.

Within minutes, systems tied to EC2, Lambda, and other core AWS services were affected. The DNS failure caused a domino effect, including issues with Network Load Balancers that route traffic between servers.

US-EAST-1: A Single Point of Failure?

Northern Virginia is home to AWS's oldest and largest data center, located in "Data Center Alley" where hundreds of facilities operate. Many companies default to US-EAST-1 because it was AWS's first region and still hosts the most services. AWS tests new features here before rolling them out globally.

This concentration creates vulnerability. This is the third major incident in US-EAST-1 since 2020. While technical causes have varied - DNS errors, load balancer issues, API misconfigurations - the pattern is consistent: a single region supports a large share of internet traffic.

AWS divides regions into Availability Zones, separate data centers designed to isolate failures. In theory, an application spread across multiple zones should survive a problem in one zone. The October 20 outage affected the network layer connecting these zones, so failover between them didn't work effectively.

Amazon's Official Response Timeline

07:11 GMT: AWS outage begins
Morning: AWS acknowledges the problem and engages engineers immediately
Throughout the day: AWS works on "multiple parallel paths to accelerate recovery"
10:11 GMT (6 p.m. ET): Amazon reports all services returned to normal operations
Hours after: Backlog "of messages that they will finish processing over the next few hours"
Later: AWS commits to publishing a detailed post-event summary

Which Major Companies and Apps Were Affected?

The outage disrupted platforms across every industry, from communication apps to financial systems and IoT devices.

Company/App	Company/App	Company/App
Adobe Creative Cloud	HBO Max	Signal
Ally	Hinge	Slack
Amazon Prime Video	Hulu	Smartsheet
Apple Music	IMDb	Snapchat
Asana	InstaCart	Square
AT&T	Instructure	Starbucks
Availity	League of Legends	Steam
Blackboard	Life 360	Strava
Blink Security	Lyft	T-Mobile
Boost Mobile	McDonalds app	Tidal
Chime	Microsoft Office	Trello
Classlink	Microsoft Teams	Truist
Coinbase	My Fitness Pal	Ubisoft Connect
CollegeBoard	Navy Federal Credit Union	Venmo
Coursera	New York Times	Verizon
Dead By Daylight	Office 365	VR Chat
Delta Air Lines	Outlook	Whatnot
Duolingo	Pinterest	Wordle
Epic Games Store	PlayStation Network	Xbox
Fanduel	Pokémon Go	Xero
Fetch	Reddit	Zillow
Fortnite	Ring	Zoom
GoDaddy	Roblox
GrubHub	Roku

Airlines and Travel Services

Delta and United Airlines experienced system failures. AT&T and Verizon had network issues. Lyft couldn't process rides.

The disruptions affected check-in systems and operational workflows, though specific details about flight delays or cancellations weren't publicly reported.

Apps and Services: Snapchat, Reddit, Fortnite

Consumer-facing applications failed across the board. Snapchat users couldn't send messages or view stories. Reddit became inaccessible. Fortnite players lost connection to game servers. Roblox went offline. Gaming platforms including Xbox, PlayStation Network, Steam, and League of Legends experienced disruptions.

Ring doorbells stopped recording. Amazon's own services, including Prime Video and the main website, became partially inaccessible. Entertainment platforms like Netflix, Hulu, HBO Max, Apple Music, and Roku reported problems.

Banking and Financial Apps: Venmo, Coinbase

Financial services disruptions carried significant impact. Venmo users couldn't send or receive money. Coinbase had issues during trading hours.

Banks including Navy Federal Credit Union, Truist, Ally, and Chime reported problems with mobile apps and online banking platforms.

Other affected services included Square payment processing, Robinhood trading platform, and various financial institutions. The outage highlighted the vulnerability of running critical financial infrastructure on shared cloud platforms.

Complete List of Affected Services

About 2,500 companies experienced disruptions. Here are the major platforms that went down:

Communication & Collaboration: WhatsApp, Signal, Zoom, Slack, Microsoft Teams, Office 365, Outlook
Streaming & Entertainment: Netflix, Hulu, HBO Max, Prime Video, Apple Music, Roku, IMDb, Tidal
Gaming: Roblox, Fortnite, Xbox, PlayStation Network, Steam, League of Legends, Dead By Daylight, Wordle, Pokémon Go
Financial: Venmo, Robinhood, Coinbase, Navy Federal Credit Union, Truist, Square, Ally, Chime, FanDuel
Retail & Food: Starbucks, McDonald's app, Instacart, GrubHub, Fetch, Etsy
Smart Home & IoT: Ring, Blink Security, Life360
Education: Duolingo, Coursera, Blackboard, CollegeBoard, Canvas, Classlink, Instructure
Productivity: Asana, Trello, Smartsheet, Adobe Creative Cloud, Canva, GoDaddy
Media: The New York Times, Associated Press, The Wall Street Journal, ESPN, Pinterest

Other: AT&T, Verizon, Lyft, T-Mobile, My Fitness Pal, Strava, Zillow, Xero, Availity, Boost Mobile, Ubisoft Connect, VR Chat, Whatnot

Some platforms recovered within an hour. Others continued facing issues for several hours.

The Business Impact: Billions in Lost Productivity

The outage disrupted more than operational systems. It affected business continuity, customer trust, and revenue streams worldwide.

How Long the Outage Lasted

The main outage period ran from 07:11 GMT to 10:11 GMT on October 20, 2025 - approximately three hours. However, Amazon noted a backlog "of messages that they will finish processing over the next few hours." Even after official resolution, Downdetector continued showing problems with platforms like OpenAI, ESPN, and Apple Music.

At the height of the incident, around 2,500 companies reported disruptions. Hours later, nearly 400 companies were still experiencing issues. Different services recovered at different rates depending on their dependencies and architecture.

Estimated Financial Losses

Exact figures aren’t publicly available, but the outage affected thousands of businesses, freezing online payments, delaying logistics, and interrupting day-to-day operations.

Downdetector logged over 11 million reports globally, including more than 3 million in the U.S., showing the scale of users and workers affected. Joshua Mahony, chief market analyst at Scope Markets, noted that the impact spanned multiple sectors but was manageable once services were restored.

Financial markets were largely unaffected, with Amazon’s stock closing slightly higher, reflecting that such outages are considered infrastructure risks rather than systemic company issues.

Customer Frustration and Communication Challenges

The outage shows how dependent daily operations have become on cloud infrastructure. People couldn't work, access money, or use communication tools. Smart home devices stopped functioning. Students couldn't access coursework. Gaming communities went dark simultaneously.

Companies struggled to communicate with customers when their own platforms ran on AWS.

The breadth of simultaneous failures across unrelated apps confused users who didn't understand the common infrastructure dependency.

Why This Wasn't a Cyberattack

Early reports speculated that the outage might be a cyberattack, but security experts quickly ruled that out. The root cause was a technical update to DynamoDB’s API that disrupted DNS routing inside AWS. Engineers were immediately engaged and worked along multiple paths to restore services.

Bryson Bort, CEO of Scythe, summarized the situation clearly: most major cloud outages result from human or configuration errors, not malicious activity.

The distinction matters: security protects against threats, but avoiding outages depends on sound system design, proper redundancy, and thorough testing.

Exploring alternative infrastructure options, such as Google Cloud Platform development, can help diversify workloads and reduce reliance on a single provider.

Pattern of Provider-Level Outages

While this specific incident wasn't an attack, it fits a pattern of provider-level infrastructure failures. Similar DNS-related incidents have occurred at other major cloud providers:

2021 - Microsoft Azure: DNS outage caused by a traffic spike
2021 - Akamai Edge: DNS bug in their system
July 2025 - Cloudflare: DNS resolver outage from an internal misconfiguration

These incidents share common characteristics: single points of failure, cascading effects across dependent services, and widespread disruption despite being technical rather than malicious in origin.

While DNS issues are common for internet service providers, the scale of disruption is rare due to AWS's market reach and the concentration of workloads on these platforms.

The Real Problem: Overreliance on a Single Cloud Provider

The October outage wasn't primarily about AWS's technical failure. It shows how the industry has concentrated risk in ways that make such failures inevitable and catastrophic.

Cloud Concentration and Market Share: AWS, Azure, Google Cloud

In Q2 2025, the global cloud market approached $100 billion in quarterly revenue. According to Statista and Synergy Research Group:

AWS: $29.7 billion (30%)
Microsoft Azure: $19.8 billion (20%)
Google Cloud: $12.9 billion (13%)
Oracle: $3.0 billion (3%)
Alibaba Cloud: $4.0 billion (4%)
Others: $31.8 billion (30%)

Three companies control 63% of global cloud infrastructure. Only two other cloud service providers operate at similar scale to AWS: Microsoft Azure and Google Cloud Platform. When one provider experiences a regional failure, a massive portion of internet services face potential disruption.

Joshua Mahony explained: "Amazon Web Services has cornered 30 percent of the market alone. Their users are not going to suddenly jump ship. Their businesses are deeply ingrained."

This concentration creates systemic risk. Organizations building exclusively on AWS infrastructure face inherent vulnerability, regardless of how well they architect their own systems within that platform.

Systemic Risk of Monocloud Architectures

A monocloud architecture means running your entire infrastructure on a single cloud provider. This approach offers real benefits: simpler operations, better integration between services, potential cost advantages from volume discounts, and a single vendor relationship to manage.

The downside is mathematical. If 100% of your infrastructure runs on one provider, you have 0% uptime when that provider experiences a regional outage. Your redundancy measures, backup systems, and disaster recovery plans become irrelevant when they all depend on the same underlying infrastructure experiencing the failure.

Because so many workloads and services are concentrated in US-EAST-1, even a relatively small technical fault can affect a large number of customers and applications simultaneously. The solution isn't leaving AWS but designing systems that can fail safely.

Lessons from Past Outages: Why Companies Keep Making the Same Mistake

This is the third major AWS outage in US-EAST-1 since 2020. Each event had different technical causes - DNS errors, load balancer issues, API misconfigurations - but all share the same pattern: a single region supporting too much internet traffic.

After each major outage, companies discuss diversification and resilience. Industry experts issue warnings about concentration risk. Some organizations implement changes. Most don't. The pattern repeats because changing infrastructure is expensive, complex, and doesn't show immediate return on investment until disaster strikes.

Companies continue defaulting to single-provider architectures because:

Migration costs are high
Teams have deep expertise in one platform
Integrated services work seamlessly within one provider
Multi-cloud adds operational complexity
The risk feels theoretical until it's not

Building for resilience requires intentional investment: using multiple regions within AWS, running critical workloads across more than one provider, and testing failover systems regularly.

Why Multi-Cloud Architecture Is No Longer Optional

Building systems that span multiple cloud providers adds complexity and costs more initially. For critical systems that can't afford extended downtime, these tradeoffs are necessary.

Benefits of Multi-Cloud for Critical Systems

Distributing your infrastructure across AWS, Azure, and Google Cloud offers clear operational advantages:

Redundancy: If one provider experiences an outage, services on others continue running. Users may notice slower performance, but core operations remain online.
Reduced vendor lock-in: Multi-cloud setups allow you to move workloads or adjust usage based on cost, performance, or service changes.
Geographic optimization: Different providers perform better in different regions. Multi-cloud lets you choose the best provider for local performance and compliance needs.
Workload flexibility: You can allocate compute-intensive tasks, store data according to regulations, and use the strongest services from each provider without compromise.

How Failover Across Providers Prevents Catastrophe

Failover means automatically redirecting traffic from a failed system to a working backup. In multi-cloud architectures, this means routing requests to a different cloud provider when your primary experiences problems.

Implementation requires your planning and infrastructure investment:

Data replication: Critical data must exist across providers, synchronized in real-time or near-real-time. This ensures your backup systems have current information when they need to take over.

Application compatibility: You need to maintain synchronized application versions that can run on different platforms. This might mean containerized applications that run identically on AWS, Azure, or Google Cloud.

Traffic routing: Configure DNS or load balancers that detect failures and switch providers quickly. This requires health checks, automated decision-making, and pre-configured routing rules.

Two implementation approaches:

Active-active architectures send traffic to multiple providers simultaneously. All providers handle production load at all times. When one fails, the others absorb its traffic. This provides fastest failover - often transparent to users—but costs more to operate since you're running full infrastructure across multiple providers.

Active-passive setups keep backup providers ready but idle until the primary fails. This costs less because you're not paying for full production capacity on backup providers, but failover takes longer and requires manual or automated intervention to activate backups.

Both approaches require regular testing. Failover systems that haven't been tested in production-like conditions often fail when you need them most.

Real-World Multi-Cloud Success

Companies using multi-cloud setups can keep services running during provider outages, while single-provider systems risk downtime.

Financial services: Critical transaction systems span multiple providers, so issues with one don’t halt operations.
Streaming platforms: Content delivery spreads across providers to maintain availability if one has a disruption.
E-commerce: Checkout and payment systems replicate across clouds, allowing order processing even if some components slow down.

These setups aren’t about avoiding outages but maintaining essential services when disruptions happen.

Build Resilient, Multi-Cloud Systems Today

Maintaining availability across cloud providers requires careful planning. We help design architectures that run across AWS, Azure, and Google Cloud to match the reliability requirements of your business.

The first step is identifying which systems are critical, what recovery times are acceptable for other services, and how to balance resilience with operational complexity and cost. Decisions are based on practical risk assessment, not assumptions.

Technically, this includes managing data replication, failover routing, and monitoring across providers while keeping operations efficient.

You can also connect with our experts to assess your current cloud setup and design architectures that maintain availability across providers and regions.

Frequently Asked Questions

What caused the AWS outage in October 2025?

An error in a technical update to the DynamoDB API broke DNS configuration inside AWS. DNS translates website names into IP addresses. When the configuration failed, applications couldn't find DynamoDB's API endpoints. The DNS issue cascaded to 113 AWS services including EC2, Lambda, and SQS. The problem originated from a routine update error, not cyberattacks or security breaches.

How long did the AWS outage last?

Approximately three hours on October 20, 2025, from 07:11 GMT to 10:11 GMT (6 p.m. ET). AWS reported all services returned to normal operations by 10:11 GMT, though Amazon noted a backlog "of messages that they will finish processing over the next few hours." Some platforms continued showing issues on Downdetector after the main outage was resolved.

Which companies were affected by the AWS outage?

Roughly 1,000 platforms were disrupted including Snapchat, Reddit, Roblox, Slack, Venmo, Delta Airlines, United Airlines, Coinbase, The New York Times, Netflix, Fortnite, WhatsApp, Signal, Zoom, Ring, Wordle, and hundreds of others across communication, gaming, financial services, streaming, retail, education, and smart home categories.

Was the AWS outage a cyberattack?

No. Scythe CEO Bryson Bort confirmed it resulted from human error in a technical update. AWS identified the problem as an error in a DynamoDB API update that broke DNS configuration. No security breaches or external interference occurred.

How can companies protect themselves from future AWS outages?

Implement multi-cloud architectures distributing infrastructure across AWS, Azure, and Google Cloud. Set up automated failover systems, replicate critical data across providers, and maintain application versions that run on different platforms. Use multiple regions within AWS. Test failover systems regularly. Maintain incident response plans for provider-level failures.

What is a multi-cloud strategy?

Using multiple cloud providers simultaneously instead of relying on one vendor. Companies run workloads across AWS, Azure, and Google Cloud, or maintain redundant systems across providers. This reduces vendor lock-in, provides redundancy if one provider fails, and allows optimization based on each provider's strengths in different regions or services.