How a DNS Failure in AWS Stopped Half the Internet for 15 Hours

October 20, 2025

My smart home stopped being smart. Alexa was silent as a tomb. Ring doorbell showed connecting… indefinitely. Snapchat, Fortnite, and other working services turned into screens displaying: Warning - Technical Maintenance.

It wasn’t a cyberattack. It wasn’t a hardware failure. It was three letters: DNS.

What Actually Happened

The timeline of events from AWS’s official reports looked like this:

23:49 PDT ─┐
           │ DNS resolution error for DynamoDB endpoints
           ├─> DynamoDB API errors (US-EAST-1)
           │
02:24 PDT ─┤ DNS FIXED! 🎉 
           │ ...but...
           │
           ├─> EC2 launch system failed (DynamoDB dependency)
           │   └─> RDS, ECS, Glue - everything using EC2
           │
09:38 PDT ─┤ Network Load Balancer health checks failure
           │   └─> Lambda execution environments
           │   └─> CloudWatch metrics
           │   └─> Total connection chaos
           │
15:01 PDT ─┴─> FULL RESTORATION (15h 12min later)

Affected services:
├─ Gaming: Fortnite, Roblox, PlayStation Network
├─ Social: Snapchat, Signal
├─ Finance: Coinbase, Robinhood, Venmo, US banks
├─ Amazon's own: Alexa, Ring, Prime Video
├─ Atlassian: Jira, Confluence
└─ Infrastructure: IAM, Support tickets, DynamoDB Global Tables

DNS: The 12 Bytes That Stopped Half the Internet

What exactly is DNS and why did it cause such massive problems and losses? DNS is essentially the internet’s phone book, allowing us to communicate freely with the internet.

Let’s imagine we don’t use DNS - we want to search something on Google. Instead of google.com, we’d have to type an IP address, like 142.250.179.142.

Could you remember a string of numbers instead of google.com? Probably not! DNS is our phone book, allowing us to communicate using easy-to-remember web addresses.

How Does It Work?

Code is worth more than 1000 words. Using the C language and this very simplified program, let’s query DNS server for DynomDb IP! (or you can use ready-made programs such as host lub nslookup)

#include <stdio.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>

int main() {
    int sock = socket(AF_INET, SOCK_DGRAM, 0);
    struct sockaddr_in dns = {AF_INET, htons(53), {inet_addr("8.8.8.8")}};
    
    // This is the ENTIRE `DNS` query - just 12 bytes header + domain name
    unsigned char q[] = {
        0xAA,0xAA,0x01,0x00,0,1,0,0,0,0,0,0,  // `DNS` Header
        8,'d','y','n','a','m','o','d','b',
        9,'u','s','-','e','a','s','t','-','1',
        9,'a','m','a','z','o','n','a','w','s',
        3,'c','o','m',0,
        0,1,0,1  // Type A, Class IN
    };
    
    unsigned char r[512];
    
    sendto(sock, q, sizeof(q), 0, (struct sockaddr*)&dns, sizeof(dns));
    int len = recvfrom(sock, r, 512, 0, NULL, NULL);
    
    // `DNS` response has a complex structure, but IP is at the end
    printf("IP: %d.%d.%d.%d\n", r[len-4], r[len-3], r[len-2], r[len-1]);
    
    return 0;
}

And the response looks like this:

IP: 3.218.180.124

53 is the DNS port. 8.8.8.8 is Google’s public DNS server (you can use any DNS server). That’s all you need to ask where is dynamodb.us-east-1.amazonaws.com.

DNS servers use UDP protocol, which is much simpler than TCP/IP and therefore very fast. Of course, like any protocol, it has standardization marked as RFC 1035.

The outage itself affected DynamoDB - a NoSQL database used for storing and processing large amounts of data, especially at high scale and traffic intensity.

Even those who don’t use DynamoDB were affected by the outage because, for example, EC2 internally uses DynamoDB to store metadata.

DynamoDB as Distributed Architecture

DynamoDB, like other AWS services, operates in distributed architecture, which we can illustrate with the following diagram:

🏗️ DynamoDB ARCHITECTURE (distributed)

                    LOAD BALANCER VIP
                 (Virtual IP: 52.94.133.131)
                            ↓
        ┌───────────────────┼───────────────────┐
        │                   │                   │
    ┌───▼────┐         ┌───▼────┐         ┌───▼────┐
    │ Node 1 │         │ Node 2 │         │ Node 3 │
    │  AZ-1a │         │  AZ-1b │         │  AZ-1c │
    │   IP   │         │   IP   │         │   IP   │
    └────────┘         └────────┘         └────────┘
        ✅                 ✅                 ✅
     WORKING            WORKING            WORKING

Since everything is distributed, even DNS servers, what happened? The entire DNS system for us-east-1 had problems.

As it turns out, “multi-region” requires true independence, not just data replication.

How does this relate to our DNS problem?

Many AWS services, despite being located in different places, had one point of contact - in this case, DNS address resolution in us-east-1.

// Entire infrastructure dependent on a single service
DynamoDB_IP = dns_resolve("dynamodb.us-east-1.amazonaws.com");
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                          When this fails - everything fails!

What Did This Mean in Practice?

The problem wasn’t in DynamoDB. The problem wasn’t in the servers. The problem was in the MAP.

All AWS services were asking:

Where is dynamodb.us-east-1.amazonaws.com?

And nobody could answer. Servers were running. Data was safe. But NOBODY knew how to get there.

What Does This Mean for Us and What Lessons Can We Learn?

The cloud isn’t magic, just someone else’s servers that can fail
Multi-region doesn’t mean independence (something can always use a single location we don’t know about)
Test failures, especially network failures like DNS
Don’t rely on a single DNS provider

October 20, 2025#

What Actually Happened#

DNS: The 12 Bytes That Stopped Half the Internet#

How Does It Work?#

DynamoDB as Distributed Architecture#

🏗️ DynamoDB ARCHITECTURE (distributed)#

What Did This Mean in Practice?#

What Does This Mean for Us and What Lessons Can We Learn?#

Enjoyed this article?