Incident Report: EFA, SRD, and ENI Provisioning Failures

Sanitized post-mortem data extracted from official AWS Sev-1 documentation. Select the exact symptom below to traverse the diagnostic tree.

🚨 T-0: Monitoring Alert / Select Initial Symptom

Identify the primary failure mode:

1. EBS Volume Detachment Deadlock (State: busy)

🩸 The "Bleeding" Indicator (Raw Log):

"Attachments": [
  {
    "AttachTime": "2016-07-21T23:44:52.000Z",
    "InstanceId": "i-fedc9876",
    "VolumeId": "vol-1234abcd",
    "State": "busy",
    "DeleteOnTermination": false,
    "Device": "/dev/sdf"
  }
]

🧠 Under the Hood (5 Whys Root Cause): An API detach was requested while the guest OS still holds an active mount lock on the block device. The EC2 hypervisor refuses to execute an unsafe hardware yank to prevent filesystem corruption, trapping the volume in an infinite busy state loop.

🛠️ The Fix (CLI Remediation):

# 1. SSH into the instance and flush/unmount gracefully
sudo umount /dev/sdf

# 2. If the OS is unresponsive, force the hypervisor to rip the volume
aws ec2 detach-volume --volume-id vol-1234abcd --force

⚠️ Blast Radius: Forcing detachment bypasses OS cache flushing. This will cause file system corruption and data loss if I/O operations were in flight. Use only as a last resort on failed instances.

2. EC2 Instance Connect Host Key Rejection

🩸 The "Bleeding" Indicator:

Error: Host key validation failed for EC2 Instance Connect

🧠 Under the Hood: The instance's host keys were rotated locally, but the new cryptographic fingerprints were not pushed to the AWS trusted host keys database. EIC detects a fingerprint mismatch and severs the connection to prevent a MITM attack.

🛠️ The Fix:

# 1. Bypass EIC and connect via standard SSH using the raw PEM key
ssh -i my_ec2_private_key.pem ec2-user@<instance-public-dns>

# 2. Execute the EIC host key harvester script to sync fingerprints
sudo /opt/aws/bin/eic_harvest_hostkeys

⚠️ Blast Radius: Safe to execute in production. Does not interrupt existing SSH sessions or application traffic.

3. Immediate Instance Termination on Launch (VolumeLimitExceeded)

🩸 The "Bleeding" Indicator:

"StateReason": {
  "Message": "Client.VolumeLimitExceeded: Volume limit exceeded",
  "Code": "Server.InternalError"
}

🧠 Under the Hood: The EC2 provisioning request exceeded the AWS account's regional EBS storage quota limit. The control plane fails to allocate the root block device, causing a hard abort of the instance boot sequence and instantly shifting the state from pending to terminated.

🛠️ The Fix:

# 1. Identify orphaned or unattached volumes eating up quota
aws ec2 describe-volumes --filters Name=status,Values=available --query "Volumes[*].VolumeId"

# 2. Destroy unused volumes to free up capacity blocks
aws ec2 delete-volume --volume-id <orphaned-vol-id>

# 3. Retry instance launch

⚠️ Blast Radius: Deleting available volumes is destructive and permanent. Verify volumes contain no required data before purging.