Incident Report: ENA Watchdog Timeouts, MMIO Failures & Conntrack Drops

Sanitized post-mortem data extracted from official AWS Sev-1 documentation. Select the exact symptom below to traverse the diagnostic tree.

🚨 T-0: Monitoring Alert / Select Initial Symptom

Identify the primary failure mode:

1. ENA Keep-Alive Watchdog Timeout (Hardware Desync)

🩸 The "Bleeding" Indicator (Raw Log):

ena 0000:00:07.0 eth1: Keep alive watchdog timeout.

🧠 Under the Hood (5 Whys Root Cause): The Elastic Network Adapter (ENA) fails to send keep-alive notifications to the EC2 Nitro hypervisor. This hardware-to-driver desynchronization triggers a hard device reset, instantly severing all active network streams.

🛠️ The Fix (CLI Remediation):

# 1. Force an instance Stop/Start to migrate to healthy underlying hardware
aws ec2 stop-instances --instance-ids <your-instance-id>
aws ec2 start-instances --instance-ids <your-instance-id>

# 2. Once recovered, verify the ENA driver version (aim for >= 2.2.11)
modinfo ena

⚠️ Blast Radius: Stopping and starting the instance causes immediate downtime and changes ephemeral public IPs. Data on instance-store volumes will be permanently wiped.

2. ENA Memory Mapped I/O (MMIO) Read Timeout

🩸 The "Bleeding" Indicator:

[ENA_COM: ena_com_reg_bar_read32] reading reg failed for timeout. expected: req id offset

🧠 Under the Hood: An incompatible driver version, transient hardware failure, or severely busy PCIe bus prevents the initialization procedure from reading MMIO registers. The OS fails to attach the virtual network interface.

🛠️ The Fix:

# 1. Check if the driver is attempting to load but hanging in dmesg
dmesg | grep -i ena

# 2. If the OS boots but network is dead, update driver via DKMS (using Systems Manager or user-data on next boot)
sudo dkms install -m amzn-drivers -v <latest_ena_version>
sudo update-initramfs -c -k all

⚠️ Blast Radius: Modifying kernel modules requires a reboot to take effect. If the driver compilation fails, the instance may become permanently unreachable.

3. Security Group Connection Tracking Exhaustion

🩸 The "Bleeding" Indicator:

# Visible via ethtool metrics
conntrack_allowance_exceeded: 45021

🧠 Under the Hood: The instance exceeded its maximum allowance of stateful tracked connections (Conntrack) at the Security Group level. The EC2 hypervisor acts as a circuit breaker and silently drops all new packets attempting to establish a stateful flow.

🛠️ The Fix:

# 1. Confirm the packet drops are strictly due to conntrack exhaustion
ethtool -S eth0 | grep conntrack_allowance_exceeded

# 2. Convert high-traffic Security Group rules to "Untracked" (e.g., Allow 0.0.0.0/0 on ports 80/443 in BOTH directions)
aws ec2 authorize-security-group-ingress --group-id sg-12345678 --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-egress --group-id sg-12345678 --protocol tcp --port 80 --cidr 0.0.0.0/0

⚠️ Blast Radius: Opening outbound rules to 0.0.0.0/0 to bypass conntrack tracking may violate strict internal compliance policies. Review with your security team.