EC2 Instance Unresponsive or Failing Status Checks

Scenario Overview

You've received an alert: your EC2 instance is failing status checks or has become unresponsive. Users can't access the application, and SSH connections are timing out. What do you do?

This is one of the most common scenarios DevOps engineers face. In this guide, I'll walk you through a systematic approach to diagnose and resolve this issue.

Understanding EC2 Status Checks

AWS performs two types of status checks on EC2 instances:

System Status Checks

Monitors the AWS infrastructure hosting your instance:

Loss of network connectivity
Loss of system power
Software issues on the physical host
Hardware issues on the physical host

Instance Status Checks

Monitors the software and network configuration:

Failed system status checks
Incorrect networking or startup configuration
Exhausted memory
Corrupted file system
Incompatible kernel

Step-by-Step Troubleshooting Guide

Step 1: Check the AWS Console

First, gather information from the AWS Console:

# Using AWS CLI
aws ec2 describe-instance-status \
  --instance-ids i-1234567890abcdef0 \
  --include-all-instances

Look for:

System Status Check: impaired or insufficient-data
Instance Status Check: impaired or insufficient-data
System Reachability Check: passed or failed

Step 2: Review System Logs

Access the system log without needing SSH:

# Get system log
aws ec2 get-console-output \
  --instance-id i-1234567890abcdef0 \
  --latest

Common issues to look for:

Kernel panic messages
Disk mount failures
Network configuration errors
Out of memory (OOM) killer messages

Step 3: Check CloudWatch Metrics

Review key metrics before the failure:

# Check CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2024-12-10T00:00:00Z \
  --end-time 2024-12-10T12:00:00Z \
  --period 300 \
  --statistics Average

Look for patterns:

100% CPU before failure (runaway process)
Memory pressure (if CloudWatch agent installed)
Disk I/O spikes
Network anomalies

Resolution Strategies

For System Status Check Failures

If the system status check fails, the issue is with AWS infrastructure:

Stop and Start the Instance (not reboot)
```
aws ec2 stop-instances --instance-ids i-1234567890abcdef0
# Wait for stopped state
aws ec2 start-instances --instance-ids i-1234567890abcdef0
```
Note: Stop/Start migrates the instance to new hardware. This changes the public IP unless you use an Elastic IP.
Wait for AWS Resolution - Sometimes AWS is already aware and working on it
Create AMI and Launch New Instance - For persistent issues

For Instance Status Check Failures

If the instance status check fails, the issue is within your instance:

Detach and Mount Root Volume

# Stop the failed instance
aws ec2 stop-instances --instance-ids i-failed-instance

# Detach the root volume
aws ec2 detach-volume --volume-id vol-049df61146c4d7901

# Attach to a rescue instance
aws ec2 attach-volume \
  --volume-id vol-049df61146c4d7901 \
  --instance-id i-rescue-instance \
  --device /dev/sdf

Review and Fix Configuration

# On rescue instance, mount the volume
sudo mount /dev/xvdf1 /mnt/rescue

# Check fstab for errors
cat /mnt/rescue/etc/fstab

# Review system logs
less /mnt/rescue/var/log/messages

Common Fixes
- Remove problematic entries from /etc/fstab
- Fix network configuration in /etc/sysconfig/network-scripts/
- Clear problematic cron jobs
- Increase swap space

Memory-Related Issues

If OOM (Out of Memory) caused the failure:

# Check for OOM killer in messages
grep -i "out of memory" /mnt/rescue/var/log/messages
grep -i "oom" /mnt/rescue/var/log/messages

Solutions:

Upgrade to larger instance type
Add swap space
Identify and fix memory-leaking applications
Set up memory-based CloudWatch alarms

Prevention Strategies

Set Up Monitoring

# Create CloudWatch alarm for status check
aws cloudwatch put-metric-alarm \
  --alarm-name "EC2-StatusCheck-Failed" \
  --metric-name StatusCheckFailed \
  --namespace AWS/EC2 \
  --statistic Maximum \
  --period 300 \
  --threshold 1 \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789:alerts

Enable Detailed Monitoring for 1-minute granularity

Use Auto Recovery for automatic instance recovery

aws cloudwatch put-metric-alarm \
  --alarm-name "EC2-AutoRecover" \
  --namespace AWS/EC2 \
  --metric-name StatusCheckFailed_System \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --statistic Maximum \
  --period 60 \
  --evaluation-periods 2 \
  --threshold 1 \
  --alarm-actions arn:aws:automate:us-east-1:ec2:recover

Implement Health Checks at the application level

Key Takeaways

Always check both system and instance status checks
System issues = Stop/Start (not reboot)
Instance issues = Investigate logs and configuration
Monitor proactively with CloudWatch alarms
Document your troubleshooting steps for future reference

This is Part 1 of our 10-part EC2 Troubleshooting series. View all scenarios →

EC2 Instance Unresponsive or Failing Status Checks

Scenario Overview

Understanding EC2 Status Checks

System Status Checks

Instance Status Checks

Step-by-Step Troubleshooting Guide

Step 1: Check the AWS Console

Step 2: Review System Logs

Step 3: Check CloudWatch Metrics

Resolution Strategies

For System Status Check Failures

For Instance Status Check Failures

Memory-Related Issues

Prevention Strategies

Key Takeaways

Written by Anshul Awasthi

KalpBot

Anshul Awasthi

Access Denied

Scenario Overview

Understanding EC2 Status Checks

System Status Checks

Instance Status Checks

Step-by-Step Troubleshooting Guide

Step 1: Check the AWS Console

Step 2: Review System Logs

Step 3: Check CloudWatch Metrics

Resolution Strategies

For System Status Check Failures

For Instance Status Check Failures

Memory-Related Issues

Prevention Strategies

Key Takeaways

Written by Anshul Awasthi

Session Timeout Warning