Incident Report: Stale SG Peering Drops & EFA OS-Bypass Routing Failures

Sanitized post-mortem data extracted from official AWS Sev-1 documentation. Select the exact symptom below to traverse the diagnostic tree.

🚨 T-0: Monitoring Alert / Select Initial Symptom

Identify the primary failure mode from your network topology or HPC logs:

1. EFA OS-Bypass Subnet Routing Failure (Silent Drop)

🩸 The "Bleeding" Indicator (Raw Log):

# Standard TCP/ping works, but MPI/NCCL hangs indefinitely
[WARN] ep_read: node 10.0.2.55 is unreachable via EFA.
# tcpdump shows NO Libfabric SRD packets arriving at the destination.

🧠 Under the Hood (5 Whys Root Cause): The Elastic Fabric Adapter (EFA) relies on OS-bypass to deliver microsecond latency. However, EFA OS-bypass traffic is strictly non-routable and hardware-bound to a single AWS Subnet (Layer 2 domain). If HPC instances are deployed across different subnets or Availability Zones, standard IP traffic (like SSH/Ping) routes successfully via the VPC router, completely masking the fact that the Nitro hypervisor is silently dropping all underlying OS-bypass packets at the subnet boundary.

🛠️ The Fix (CLI Remediation):

# 1. Audit the subnets of your HPC cluster instances to prove they are split
aws ec2 describe-instances \
    --instance-ids i-0abcdef1234567890 i-0987654321fedcba0 \
    --query "Reservations[*].Instances[*].[InstanceId, SubnetId, Placement.AvailabilityZone]" \
    --output table

# 2. To fix OS-bypass, you MUST terminate the split instances and relaunch them into a single subnet.
# 3. Best Practice: Enforce a Cluster Placement Group to guarantee single-AZ, single-subnet deployment.
aws ec2 create-placement-group --group-name hpc-cluster-pg --strategy cluster

⚠️ Blast Radius: Fixing this requires a complete cluster teardown and redeployment. You cannot live-migrate an instance from one subnet to another.

2. Stale Security Group Rules (VPC Peering Disconnect)

🩸 The "Bleeding" Indicator:

{
    "StaleSecurityGroupSet": [
        {
            "GroupId": "sg-0123456789abcdef0",
            "StaleIpPermissions": [
                {
                    "UserIdGroupPairs": [
                        {
                            "GroupId": "sg-peer-deleted",
                            "PeeringStatus": "deleted"
                        }
                    ]
                }
            ]
        }
    ]
}

🧠 Under the Hood: Your local Security Group was configured to trust a Security Group ID residing in a peered VPC. The VPC peering connection was unexpectedly deleted, or the peered AWS account destroyed their security group. This severs the logical cryptographic reference. The AWS control plane immediately marks your local rule as stale, effectively blackholing any traffic attempting to use that specific rule evaluation path.

🛠️ The Fix:

# 1. Identify all stale security groups in your VPC
aws ec2 describe-stale-security-groups --vpc-id vpc-0123456789abcdef0

# 2. Forcefully revoke the stale ingress rule to clean up the routing logic
aws ec2 revoke-security-group-ingress \
    --group-id sg-0123456789abcdef0 \
    --source-group sg-peer-deleted \
    --protocol tcp --port 443

# 3. If peering was re-established, re-add the rule with the NEW peering connection's target SG.

⚠️ Blast Radius: Removing stale rules is generally safe and recommended for security hygiene. However, ensure the peer wasn't temporarily recreating their SG before you nuke your local references.