Scenario Overview
You've received an alert: your EC2 instance is failing status checks or has become unresponsive. Users can't access the application, and SSH connections are timing out. What do you do?
This is one of the most common scenarios DevOps engineers face. In this guide, I'll walk you through a systematic approach to diagnose and resolve this issue.
Understanding EC2 Status Checks
AWS performs two types of status checks on EC2 instances:
System Status Checks
Monitors the AWS infrastructure hosting your instance:
- Loss of network connectivity
- Loss of system power
- Software issues on the physical host
- Hardware issues on the physical host
Instance Status Checks
Monitors the software and network configuration:
- Failed system status checks
- Incorrect networking or startup configuration
- Exhausted memory
- Corrupted file system
- Incompatible kernel
Step-by-Step Troubleshooting Guide
Step 1: Check the AWS Console
First, gather information from the AWS Console:
# Using AWS CLI
aws ec2 describe-instance-status \
--instance-ids i-1234567890abcdef0 \
--include-all-instances Look for:
- System Status Check:
impairedorinsufficient-data - Instance Status Check:
impairedorinsufficient-data - System Reachability Check:
passedorfailed
Step 2: Review System Logs
Access the system log without needing SSH:
# Get system log
aws ec2 get-console-output \
--instance-id i-1234567890abcdef0 \
--latest Common issues to look for:
- Kernel panic messages
- Disk mount failures
- Network configuration errors
- Out of memory (OOM) killer messages
Step 3: Check CloudWatch Metrics
Review key metrics before the failure:
# Check CPU utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2024-12-10T00:00:00Z \
--end-time 2024-12-10T12:00:00Z \
--period 300 \
--statistics Average Look for patterns:
- 100% CPU before failure (runaway process)
- Memory pressure (if CloudWatch agent installed)
- Disk I/O spikes
- Network anomalies
Resolution Strategies
For System Status Check Failures
If the system status check fails, the issue is with AWS infrastructure:
- Stop and Start the Instance (not reboot)
aws ec2 stop-instances --instance-ids i-1234567890abcdef0 # Wait for stopped state aws ec2 start-instances --instance-ids i-1234567890abcdef0Note: Stop/Start migrates the instance to new hardware. This changes the public IP unless you use an Elastic IP.
- Wait for AWS Resolution - Sometimes AWS is already aware and working on it
- Create AMI and Launch New Instance - For persistent issues
For Instance Status Check Failures
If the instance status check fails, the issue is within your instance:
- Detach and Mount Root Volume
# Stop the failed instance aws ec2 stop-instances --instance-ids i-failed-instance # Detach the root volume aws ec2 detach-volume --volume-id vol-049df61146c4d7901 # Attach to a rescue instance aws ec2 attach-volume \ --volume-id vol-049df61146c4d7901 \ --instance-id i-rescue-instance \ --device /dev/sdf - Review and Fix Configuration
# On rescue instance, mount the volume sudo mount /dev/xvdf1 /mnt/rescue # Check fstab for errors cat /mnt/rescue/etc/fstab # Review system logs less /mnt/rescue/var/log/messages - Common Fixes
- Remove problematic entries from
/etc/fstab - Fix network configuration in
/etc/sysconfig/network-scripts/ - Clear problematic cron jobs
- Increase swap space
- Remove problematic entries from
Memory-Related Issues
If OOM (Out of Memory) caused the failure:
# Check for OOM killer in messages
grep -i "out of memory" /mnt/rescue/var/log/messages
grep -i "oom" /mnt/rescue/var/log/messages Solutions:
- Upgrade to larger instance type
- Add swap space
- Identify and fix memory-leaking applications
- Set up memory-based CloudWatch alarms
Prevention Strategies
- Set Up Monitoring
# Create CloudWatch alarm for status check aws cloudwatch put-metric-alarm \ --alarm-name "EC2-StatusCheck-Failed" \ --metric-name StatusCheckFailed \ --namespace AWS/EC2 \ --statistic Maximum \ --period 300 \ --threshold 1 \ --comparison-operator GreaterThanOrEqualToThreshold \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --evaluation-periods 2 \ --alarm-actions arn:aws:sns:us-east-1:123456789:alerts - Enable Detailed Monitoring for 1-minute granularity
- Use Auto Recovery for automatic instance recovery
aws cloudwatch put-metric-alarm \ --alarm-name "EC2-AutoRecover" \ --namespace AWS/EC2 \ --metric-name StatusCheckFailed_System \ --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \ --statistic Maximum \ --period 60 \ --evaluation-periods 2 \ --threshold 1 \ --alarm-actions arn:aws:automate:us-east-1:ec2:recover - Implement Health Checks at the application level
Key Takeaways
- Always check both system and instance status checks
- System issues = Stop/Start (not reboot)
- Instance issues = Investigate logs and configuration
- Monitor proactively with CloudWatch alarms
- Document your troubleshooting steps for future reference
This is Part 1 of our 10-part EC2 Troubleshooting series. View all scenarios →