KalpOps Evolving Eternally

Authenticating...

Access Denied

Your account has been blocked from accessing this site.

If you believe this is an error, please contact the site administrator.

Back to Blog

EC2 Instance Unresponsive or Failing Status Checks

Learn how to diagnose and fix EC2 instances that become unresponsive or fail AWS status checks. A step-by-step troubleshooting guide.

Scenario Overview

You've received an alert: your EC2 instance is failing status checks or has become unresponsive. Users can't access the application, and SSH connections are timing out. What do you do?

This is one of the most common scenarios DevOps engineers face. In this guide, I'll walk you through a systematic approach to diagnose and resolve this issue.

Understanding EC2 Status Checks

AWS performs two types of status checks on EC2 instances:

System Status Checks

Monitors the AWS infrastructure hosting your instance:

  • Loss of network connectivity
  • Loss of system power
  • Software issues on the physical host
  • Hardware issues on the physical host

Instance Status Checks

Monitors the software and network configuration:

  • Failed system status checks
  • Incorrect networking or startup configuration
  • Exhausted memory
  • Corrupted file system
  • Incompatible kernel

Step-by-Step Troubleshooting Guide

Step 1: Check the AWS Console

First, gather information from the AWS Console:

# Using AWS CLI
aws ec2 describe-instance-status \
  --instance-ids i-1234567890abcdef0 \
  --include-all-instances

Look for:

  • System Status Check: impaired or insufficient-data
  • Instance Status Check: impaired or insufficient-data
  • System Reachability Check: passed or failed

Step 2: Review System Logs

Access the system log without needing SSH:

# Get system log
aws ec2 get-console-output \
  --instance-id i-1234567890abcdef0 \
  --latest

Common issues to look for:

  • Kernel panic messages
  • Disk mount failures
  • Network configuration errors
  • Out of memory (OOM) killer messages

Step 3: Check CloudWatch Metrics

Review key metrics before the failure:

# Check CPU utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2024-12-10T00:00:00Z \
  --end-time 2024-12-10T12:00:00Z \
  --period 300 \
  --statistics Average

Look for patterns:

  • 100% CPU before failure (runaway process)
  • Memory pressure (if CloudWatch agent installed)
  • Disk I/O spikes
  • Network anomalies

Resolution Strategies

For System Status Check Failures

If the system status check fails, the issue is with AWS infrastructure:

  1. Stop and Start the Instance (not reboot)
    aws ec2 stop-instances --instance-ids i-1234567890abcdef0
    # Wait for stopped state
    aws ec2 start-instances --instance-ids i-1234567890abcdef0

    Note: Stop/Start migrates the instance to new hardware. This changes the public IP unless you use an Elastic IP.

  2. Wait for AWS Resolution - Sometimes AWS is already aware and working on it
  3. Create AMI and Launch New Instance - For persistent issues

For Instance Status Check Failures

If the instance status check fails, the issue is within your instance:

  1. Detach and Mount Root Volume
    # Stop the failed instance
    aws ec2 stop-instances --instance-ids i-failed-instance
    
    # Detach the root volume
    aws ec2 detach-volume --volume-id vol-049df61146c4d7901
    
    # Attach to a rescue instance
    aws ec2 attach-volume \
      --volume-id vol-049df61146c4d7901 \
      --instance-id i-rescue-instance \
      --device /dev/sdf
  2. Review and Fix Configuration
    # On rescue instance, mount the volume
    sudo mount /dev/xvdf1 /mnt/rescue
    
    # Check fstab for errors
    cat /mnt/rescue/etc/fstab
    
    # Review system logs
    less /mnt/rescue/var/log/messages
  3. Common Fixes
    • Remove problematic entries from /etc/fstab
    • Fix network configuration in /etc/sysconfig/network-scripts/
    • Clear problematic cron jobs
    • Increase swap space

Memory-Related Issues

If OOM (Out of Memory) caused the failure:

# Check for OOM killer in messages
grep -i "out of memory" /mnt/rescue/var/log/messages
grep -i "oom" /mnt/rescue/var/log/messages

Solutions:

  • Upgrade to larger instance type
  • Add swap space
  • Identify and fix memory-leaking applications
  • Set up memory-based CloudWatch alarms

Prevention Strategies

  1. Set Up Monitoring
    # Create CloudWatch alarm for status check
    aws cloudwatch put-metric-alarm \
      --alarm-name "EC2-StatusCheck-Failed" \
      --metric-name StatusCheckFailed \
      --namespace AWS/EC2 \
      --statistic Maximum \
      --period 300 \
      --threshold 1 \
      --comparison-operator GreaterThanOrEqualToThreshold \
      --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
      --evaluation-periods 2 \
      --alarm-actions arn:aws:sns:us-east-1:123456789:alerts
  2. Enable Detailed Monitoring for 1-minute granularity
  3. Use Auto Recovery for automatic instance recovery
    aws cloudwatch put-metric-alarm \
      --alarm-name "EC2-AutoRecover" \
      --namespace AWS/EC2 \
      --metric-name StatusCheckFailed_System \
      --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
      --statistic Maximum \
      --period 60 \
      --evaluation-periods 2 \
      --threshold 1 \
      --alarm-actions arn:aws:automate:us-east-1:ec2:recover
  4. Implement Health Checks at the application level

Key Takeaways

  • Always check both system and instance status checks
  • System issues = Stop/Start (not reboot)
  • Instance issues = Investigate logs and configuration
  • Monitor proactively with CloudWatch alarms
  • Document your troubleshooting steps for future reference

This is Part 1 of our 10-part EC2 Troubleshooting series. View all scenarios →

Session Timeout Warning

You've been inactive. Your session will expire in 60 seconds.