Incident Report: ENA PPS Limits, Bandwidth Throttling & DNS Blackholes

Sanitized post-mortem data extracted from official AWS Sev-1 documentation. Select the exact symptom below to traverse the diagnostic tree.

🚨 T-0: Monitoring Alert / Select Initial Symptom

Identify the primary failure mode from your ethtool metrics:

1. Hard Packet-Per-Second (PPS) Cap Exceeded

🩸 The "Bleeding" Indicator (Raw Log):

$ ethtool -S eth0 | grep pps_allowance_exceeded
pps_allowance_exceeded: 14502

🧠 Under the Hood (5 Whys Root Cause): The bidirectional network packet-per-second volume exceeds the physical hardware limitations of the specific EC2 instance size. Even if total bandwidth (Gbps) is low, pushing millions of tiny packets triggers the Nitro hypervisor to enforce microburst traffic shaping, dropping excess packets instantly.

🛠️ The Fix (CLI Remediation):

# 1. Enable Jumbo Frames (MTU 9001) to pack more data per packet, drastically reducing overall PPS
sudo ip link set dev eth0 mtu 9001

# 2. Make the MTU change persistent across reboots (Amazon Linux 2/CentOS)
echo "MTU=9001" | sudo tee -a /etc/sysconfig/network-scripts/ifcfg-eth0
sudo systemctl restart network

⚠️ Blast Radius: Jumbo frames must be supported end-to-end. If traffic crosses an internet gateway or VPN that enforces a 1500 MTU, enabling 9001 MTU will cause fragmentation or complete packet loss.

2. Aggregate Network Bandwidth Allowance Exceeded

🩸 The "Bleeding" Indicator:

$ ethtool -S eth0 | grep bw_allowance_exceeded
bw_in_allowance_exceeded: 8432
bw_out_allowance_exceeded: 1021

🧠 Under the Hood: The instance has entirely depleted its network I/O burst credit bucket, or is attempting to sustain traffic above its hard baseline aggregate throughput. The AWS hypervisor actively polices the traffic, dropping packets to enforce the instance-type bandwidth limits.

🛠️ The Fix:

# 1. Stop the instance to safely modify its underlying hardware type
aws ec2 stop-instances --instance-ids <your-instance-id>

# 2. Upgrade to a network-optimized instance type (e.g., from c5.large to c5n.large) for higher baseline Gbps
aws ec2 modify-instance-attribute --instance-id <your-instance-id> --instance-type c5n.xlarge

# 3. Start the instance back up
aws ec2 start-instances --instance-ids <your-instance-id>

⚠️ Blast Radius: Instance type changes require downtime. Ensure your architecture supports a hard stop and start.

3. Local Proxy Service PPS Throttle (IMDS/DNS Blackhole)

🩸 The "Bleeding" Indicator:

$ ethtool -S eth0 | grep linklocal_allowance_exceeded
linklocal_allowance_exceeded: 5042

🧠 Under the Hood: The instance is hammering the AWS VPC DNS resolver (169.254.169.253) or the Instance Metadata Service (169.254.169.254). AWS enforces a strict, unchangeable hard limit of 1024 PPS for link-local services. Exceeding this causes immediate DNS resolution failures and application timeouts.

🛠️ The Fix:

# Implement a local DNS caching daemon to drastically reduce queries sent to the VPC resolver
sudo yum install dnsmasq -y

# Configure dnsmasq as the primary resolver and enable it
echo "nameserver 127.0.0.1" | sudo tee /etc/resolv.conf
sudo systemctl enable --now dnsmasq

⚠️ Blast Radius: Safe to execute. Modifying resolv.conf might require adjustments in NetworkManager to ensure it persists across reboots.