Incident Report: PMTUD Blackholes, Untracked SG Drops, and EIP PTR Blocks

Sanitized post-mortem data extracted from official AWS Sev-1 documentation. Select the exact symptom below to traverse the diagnostic tree.

🚨 T-0: Monitoring Alert / Select Initial Symptom

Identify the primary failure mode from your network packet captures or API logs:

1. PMTUD Fragmentation Drop (Network Blackhole)

🩸 The "Bleeding" Indicator (Raw Log):

# Seen via tcpdump during a stalled SSH or HTTPS session
ICMP 172.31.10.5 > 10.0.0.5: Destination Unreachable: Fragmentation Needed and Don't Fragment was Set

🧠 Under the Hood (5 Whys Root Cause): Jumbo frames (9001 MTU) attempt to cross a VPN or Transit Gateway that only supports a 1500 MTU. The transit hop drops the oversized packet and sends back an ICMP Type 3 Code 4 (Fragmentation Needed) message. However, the EC2 instance's Security Group blocks inbound ICMP, blinding the OS to the drop. Path MTU Discovery (PMTUD) fails, creating a silent traffic blackhole.

🛠️ The Fix (CLI Remediation):

# Option A: Fix PMTUD by allowing the required ICMP "Fragmentation Needed" messages through the Security Group
aws ec2 authorize-security-group-ingress \
    --group-id sg-0123456789abcdef0 \
    --ip-permissions IpProtocol=icmp,FromPort=3,ToPort=4,IpRanges="[{CidrIp=0.0.0.0/0}]"

# Option B: If ICMP MUST be blocked, forcefully shrink the instance MTU down to standard Ethernet size
sudo ip link set dev eth0 mtu 1500

⚠️ Blast Radius: Lowering MTU to 1500 limits maximum throughput and increases CPU overhead per Gbps due to higher packet counts, but instantly restores connectivity across VPN tunnels.

2. Untracked Security Group Connection Sever

🩸 The "Bleeding" Indicator:

# Application log output mid-transfer
[ERROR] Server unexpectedly closed network connection. Connection reset by peer.

🧠 Under the Hood: Security groups use "conntrack" to automatically allow return traffic. However, if a rule is completely open (0.0.0.0/0 on all ports), AWS bypasses tracking to save memory (creating an "untracked" flow). If an engineer suddenly tightens this rule, the SG realizes it has no tracked state for existing active sessions, immediately dropping all established TCP connections.

🛠️ The Fix:

# 1. Temporarily revert the SG rule back to the untracked state to stop the bleeding
aws ec2 authorize-security-group-ingress \
    --group-id sg-0123456789abcdef0 \
    --protocol tcp --port 80 --cidr 0.0.0.0/0

# 2. To implement least-privilege rules safely: Set up an ALB/NLB in front, OR wait for a maintenance window to drain active connections before tightening.

⚠️ Blast Radius: Modifying rules from untracked (0.0.0.0/0) to tracked (specific IPs) is extremely destructive to live traffic. Prepare for an instant hard-sever of all active client sessions.

3. Elastic IP Transfer Blocked by Reverse DNS

🩸 The "Bleeding" Indicator:

{
    "ErrorCode": "InvalidTransfer.AddressCustomPtrSet",
    "ErrorMessage": "The Elastic IP address cannot be transferred because it has a custom Reverse DNS (PTR) record configured."
}

🧠 Under the Hood: An engineer attempted to transfer an Elastic IP (EIP) to a different AWS Account. The EC2 control plane intercepts and aborts the transfer because the EIP still has a custom Reverse DNS (PTR) record attached to it. This is a hard security measure to prevent accidental cross-account DNS spoofing.

🛠️ The Fix:

# 1. In the SOURCE account, reset the EIP's reverse DNS attribute back to the AWS default
aws ec2 reset-address-attribute \
    --allocation-id eipalloc-0abcdef1234567890 \
    --attribute domain-name

# 2. Re-initiate the cross-account EIP transfer
aws ec2 enable-address-transfer \
    --allocation-id eipalloc-0abcdef1234567890 \
    --transfer-account-id 123456789012

⚠️ Blast Radius: Removing the PTR record is safe, but email servers (like Postfix/Exchange) relying on that EIP will immediately start failing strict spam checks (Forward-Confirmed reverse DNS) until the new account configures the PTR record.