Mike Slinn
Mike Slinn

Rescuing a Catastrophic Upgrade to Ubuntu 20.10

Published 2020-10-25. Last modified 2022-01-27.
Time to read: 7 minutes.

This site is categorized under AWS, Bash, Ubuntu.

The upgrade from Ubuntu 20.04 to 20.10 has been especially problematic for each of the half-dozen XUbuntu systems that I manage. One important server that I run on Scaleway became unresponsive and would not boot shortly after starting the installation, and another important server on AWS ran fine, but did not allow logins.

This blog post details what I did to recover the AWS server using a standard *nix procedure that any competent system administrator would be comfortable with: chroot.

Before Linux had cgroups, we used chroot and its close cousin, jail. I used chroot for the technical basis of Zamples back in 2001.

Because the chroot environment will be set up in a way that it shares the rescue system’s /var/run directory, the rescue system should have all upgrades in place and should be rebooted if /var/run/reboot-required exists.

AWS also provides a tool called EC2Rescue, which does a complicated series of actions to accomplish something similar. Here is additional documentation. I find the AWS documentation is frequently obtuse, and the approach taken by most AWS products and tools is extremely general. Consequently I often find myself wasting a lot of time trying to get things to work. I don’t subscribe to AWS support; if I had subscribed to expensive enterprise-level support, complete with an AWS expert to hold my hand while I attempted to resurrect the server, I might have tried using EC2Rescue. On the other hand, when pressed with an emergency, I prefer to lean on tried-and-true methods like chroot.

This blog post concludes with two Bash scripts to automate the details. I wrote the second script in late January 2022, approximately 2 years after this post was first published. The second script is less user-friendly, but more fine-grained. That is a reasonable tradeoff, and both approaches have merit.

Setup

AWS CLI

I prefer to use the AWS CLI instead of the web console. Installation instructions are here. This article uses the AWS CLI exclusively in favor of the AWS web console.

jq

I also use jq for parsing JSON in the bash shell. Install it on Debian-style Linux distros such as Ubuntu like this:

Shell
$ sudo apt install jq

Discover information about the Problem EC2 instance

Getting the AWS EC2 Instance Information

Because my problem EC2 instance has a tag called Name with Value production, I was able to easily obtain a JSON representation of all the information about it. I stored the JSON in an environment variable called AWS_EC2_PRODUCTION.

The results are shown in unselectable text. This is so you can easily use this sample code yourself. You can copy the code to run into your clipboard. Just click on the little copy icon at the top right hand corner of the scrolling code display area. Because the prompt and the results and are unselectable, your clipboard will only pick up the code you need to paste in order to run the code example yourself.

Shell
$ AWS_EC2_PRODUCTION="$(
  aws ec2 describe-instances | \
  jq '.Reservations[].Instances[] | select((.Tags[]?.Key=="Name") and (.Tags[]?.Value=="production"))'
)"

$ echo "$AWS_EC2_PRODUCTION"
{
  "AmiLaunchIndex": 0,
  "ImageId": "ami-e29b9388",
  "InstanceId": "i-825eb905",
  "InstanceType": "t2.small",
  "KeyName": "sslTest",
  "LaunchTime": "2017-10-12T16:24:14.000Z",
  "Monitoring": {
    "State": "disabled"
  },
  "Placement": {
    "AvailabilityZone": "us-east-1c",
    "GroupName": "",
    "Tenancy": "default"
  },
  "PrivateDnsName": "ip-10-0-0-201.ec2.internal",
  "PrivateIpAddress": "10.0.0.201",
  "ProductCodes": [],
  "PublicDnsName": "",
  "PublicIpAddress": "52.207.225.143",
  "State": {
    "Code": 16,
    "Name": "running"
  },
  "StateTransitionReason": "",
  "SubnetId": "subnet-49de033f",
  "VpcId": "vpc-f16a0895",
  "Architecture": "x86_64",
  "BlockDeviceMappings": [
    {
      "DeviceName": "/dev/sda1",
      "Ebs": {
        "AttachTime": "2016-04-05T19:07:17.000Z",
        "DeleteOnTermination": true,
        "Status": "attached",
        "VolumeId": "vol-1c8903b4"
      }
    }
  ],
  "ClientToken": "GykZz1459883236367",
  "EbsOptimized": false,
  "Hypervisor": "xen",
  "NetworkInterfaces": [
    {
      "Association": {
        "IpOwnerId": "amazon",
        "PublicDnsName": "",
        "PublicIp": "52.207.225.143"
      },
      "Attachment": {
        "AttachTime": "2016-04-05T19:07:16.000Z",
        "AttachmentId": "eni-attach-a58bd15f",
        "DeleteOnTermination": true,
        "DeviceIndex": 0,
        "Status": "attached"
      },
      "Description": "Primary network interface",
      "Groups": [
        {
          "GroupName": "testSG",
          "GroupId": "sg-4cbc6f35"
        }
      ],
      "Ipv6Addresses": [],
      "MacAddress": "0a:a4:be:1b:8e:eb",
      "NetworkInterfaceId": "eni-fa4f65bb",
      "OwnerId": "031372724784",
      "PrivateIpAddress": "10.0.0.201",
      "PrivateIpAddresses": [
        {
          "Association": {
            "IpOwnerId": "amazon",
            "PublicDnsName": "",
            "PublicIp": "52.207.225.143"
          },
          "Primary": true,
          "PrivateIpAddress": "10.0.0.201"
        }
      ],
      "SourceDestCheck": true,
      "Status": "in-use",
      "SubnetId": "subnet-49de033f",
      "VpcId": "vpc-f16a0895",
      "InterfaceType": "interface"
    }
  ],
  "RootDeviceName": "/dev/sda1",
  "RootDeviceType": "ebs",
  "SecurityGroups": [
    {
      "GroupName": "testSG",
      "GroupId": "sg-4cbc6f35"
    }
  ],
  "SourceDestCheck": true,
  "Tags": [
    {
      "Key": "Name",
      "Value": "production"
    }
  ],
  "VirtualizationType": "hvm",
  "CpuOptions": {
    "CoreCount": 1,
    "ThreadsPerCore": 1
  },
  "CapacityReservationSpecification": {
    "CapacityReservationPreference": "open"
  },
  "HibernationOptions": {
    "Configured": false
  },
  "MetadataOptions": {
    "State": "applied",
    "HttpTokens": "optional",
    "HttpPutResponseHopLimit": 1,
    "HttpEndpoint": "enabled"
  }
}

Getting the AWS EC2 Problem Instance Id

The instance ID for the problem EC2 instance can be extracted from the JSON returned by the preceding results easily:

Shell
$ AWS_PROBLEM_INSTANCE_ID="$(
  jq -r .InstanceId <<< "$AWS_EC2_PRODUCTION"
)"

$ echo "$AWS_PROBLEM_INSTANCE_ID"
i-825eb905 

Getting the AWS EC2 Problem Instance IP Address

The IP address for the problem EC2 instance can be extracted from the JSON returned by the preceding results easily:

Shell
$ AWS_PROBLEM_IP="$(
  jq -r .PublicIpAddress <<< "$AWS_EC2_PRODUCTION"
)"

$ echo "$AWS_PROBLEM_IP"
52.207.225.143

Getting the AWS EC2 Problem Availability Zone

The AWS availability zone for the problem EC2 instance can be extracted from the JSON returned by the preceding results easily:

Shell
$ AWS_AVAILABILITY_ZONE="$(
  jq -r .Placement.AvailabilityZone <<< "$AWS_EC2_PRODUCTION"
)"

$ echo "$AWS_AVAILABILITY_ZONE"
us-east-1c 

Getting the AWS EC2 Problem Volume ID

The following command line extracts the volume id of the problem server’s system drive into an environment variable called $AWS_PROBLEM_VOLUME_ID:

Shell
$ AWS_PROBLEM_VOLUME_ID="$(
  jq -r '.BlockDeviceMappings[].Ebs.VolumeId' <<< "$AWS_EC2_PRODUCTION"
)"

$ echo "$AWS_PROBLEM_VOLUME_ID"
vol-1c8903b4 

Make a Snapshot of the Problem Server

One approach, which would be living dangerously, would be to mount the system volume of the problem server on another server, set up chroot, attempt to repair the drive image, remount the repaired drive on the problem server, and reboot the server. I am never that optimistic. Things invariably go wrong. Instead, we will take a snapshot of the problem drive, turn the snapshot into a volume, repair the volume, swap in the repaired volume on the problem system, and reboot that system.

It is better to shut down the EC2 instance before making a snapshot, however a snapshot can be taken whenever the server is idling. We will need to shut down the server anyway, so that could be done now, or at the last minute.

I made a snapshot with a tag called Name with the value like production 2020-10-25 and saved the snapshot id in an environment variable called AWS_PROBLEM_SNAPSHOT_ID:

Shell
$ AWS_PROBLEM_SNAPSHOT_ID="$(
  aws ec2 create-snapshot --volume-id "$AWS_PROBLEM_VOLUME_ID" \
    --description "production `date '+%Y-%m-%d'`" \
    --tag-specifications "ResourceType=snapshot,Tags=[{Key=Created, Value=`date '+%Y-%m-%d'`},{Key=Name, Value=\"Broken do-release-upgrade 20.{04,10\"}]" | \
  jq -r .SnapshotId
)"

$ echo "$AWS_PROBLEM_SNAPSHOT_ID"
snap-0a856be1f58b8a856 

$ aws ec2 wait snapshot-completed --snapshot-ids "$AWS_PROBLEM_SNAPSHOT_ID"

Snapshots only take a few minutes to complete. The aws ec2 wait command blocks until the specified operation finishes.

Create Rescue Volume From Snapshot

Once the snapshot process has completed, create a new volume from the snapshot. The default volume type is gp2. We’ll refer to this volume as $AWS_RESCUE_VOLUME_ID. It is important to create the volume in the same availability zone as the problem EC2 instance so that it can easily be attached. This command applies a tag called Name, with the value rescue, for easy identification.

Shell
$ AWS_RESCUE_VOLUME_ID="$(
  aws ec2 create-volume \
    --availability-zone $AWS_AVAILABILITY_ZONE \
    --snapshot-id $AWS_PROBLEM_SNAPSHOT_ID \
    --tag-specifications 'ResourceType=volume,Tags=[{Key=Name,Value=rescue}]' | \
  jq -r .VolumeId
)"

$ echo "$AWS_RESCUE_VOLUME_ID"
vol-0e20fd22d2dc5a933 

$ aws ec2 wait volume-available --volume-id "$AWS_RESCUE_VOLUME_ID"

Use an EC2 Spot Instance For the Rescue Server

Now that the rescue volume is available, we need to mount it on a server, which I’ll call the rescue server. We’ll refer to the server where the rescue volume is prepared via its instance id, saved as AWS_EC2_RESCUE_ID. You can either create a new EC2 instance for this purpose, or use an existing EC2 instance.

The rescue server does not need to be anything special; a tiny virtual machine of any description will do fine. However, some rescue operations will be much easier if the type of operating system is the same as that on the problem drive. Yesterday I blogged about how to find a suitable AMI, and determine its image-id.

Shell
$ AWS_AMI="$(
  aws ec2 describe-images \
    --owners 099720109477 \
    --filters "Name=name,Values=ubuntu/images/hvm-ssd/ubuntu-????????-????????-amd64-server-????????" \
              "Name=state,Values=available" \
    --query "reverse(sort_by(Images, &CreationDate))[:1]" | \
  jq -r '.[0]'
)"

$ echo "$AWS_AMI"
{
  "Architecture": "x86_64",
  "CreationDate": "2020-10-30T14:07:42.000Z",
  "ImageId": "ami-0c71ec98278087e60",
  "ImageLocation": "099720109477/ubuntu/images/hvm-ssd/ubuntu-groovy-20.10-amd64-server-20201030",
  "ImageType": "machine",
  "Public": true,
  "OwnerId": "099720109477",
  "PlatformDetails": "Linux/UNIX",
  "UsageOperation": "RunInstances",
  "State": "available",
  "BlockDeviceMappings": [
    {
      "DeviceName": "/dev/sda1",
      "Ebs": {
        "DeleteOnTermination": true,
        "SnapshotId": "snap-00bf581086dd686e5",
        "VolumeSize": 8,
        "VolumeType": "gp2",
        "Encrypted": false
      }
    },
    {
      "DeviceName": "/dev/sdb",
      "VirtualName": "ephemeral0"
    },
    {
      "DeviceName": "/dev/sdc",
      "VirtualName": "ephemeral1"
    }
  ],
  "Description": "Canonical, Ubuntu, 20.10, amd64 groovy image build on 2020-10-30",
  "EnaSupport": true,
  "Hypervisor": "xen",
  "Name": "ubuntu/images/hvm-ssd/ubuntu-groovy-20.10-amd64-server-20201030",
  "RootDeviceName": "/dev/sda1",
  "RootDeviceType": "ebs",
  "SriovNetSupport": "simple",
  "VirtualizationType": "hvm"
} 

Now let's extract the ID of the AMI image and save it as AWS_AMI_ID.

Shell
$ AWS_AMI_ID="$( jq -r '.[0].ImageId' <<< "$AWS_AMI" )"

$ echo "$AWS_AMI_ID"
ami-0c71ec98278087e60 

Volumes can be attached to running and stopped server instances. The load on the rescue server will likely be light and short-lived. An EC2 spot instance is ideal, and only costs two cents per hour! The spot instance will likely only be needed for 15 minutes. I specified my VPC id as SubnetId, the security group sg-4cbc6f35 and the AvailabilityZone.

Shell
$ AWS_EC2_RESCUE="$(
  aws ec2 run-instances \
    --image-id "$AWS_AMI_ID" \
    --instance-market-options '{ "MarketType": "spot" }' \
    --instance-type t2.medium \
    --key-name rsa-2020-11-03.pub \
    --network-interfaces '[ {
        "DeviceIndex": 0,
        "Groups": ["sg-4cbc6f35"],
        "SubnetId": "subnet-49de033f",
        "DeleteOnTermination": true,
        "AssociatePublicIpAddress": true
      } ]' \
    --placement '{ "AvailabilityZone": "us-east-1c" }'
)"

$ echo "$AWS_EC2_RESCUE"
{
    "Groups": [],
    "Instances": [
        {
            "AmiLaunchIndex": 0,
            "ImageId": "ami-0dba2cb6798deb6d8",
            "InstanceId": "i-012a54aefcd333de9",
            "InstanceType": "t2.small",
            "KeyName": "rsa-2020-11-03.pub",
            "LaunchTime": "2020-11-03T23:19:50.000Z",
            "Monitoring": {
                "State": "disabled"
            },
            "Placement": {
                "AvailabilityZone": "us-east-1c",
                "GroupName": "",
                "Tenancy": "default"
            },
            "PrivateDnsName": "ip-10-0-0-210.ec2.internal",
            "PrivateIpAddress": "10.0.0.210",
            "ProductCodes": [],
            "PublicDnsName": "",
            "State": {
                "Code": 0,
                "Name": "pending"
            },
            "StateTransitionReason": "",
            "SubnetId": "subnet-49de033f",
            "VpcId": "vpc-f16a0895",
            "Architecture": "x86_64",
            "BlockDeviceMappings": [],
            "ClientToken": "026583fb-c94e-4bca-bdd2-8dcdcaa3aae9",
            "EbsOptimized": false,
            "EnaSupport": true,
            "Hypervisor": "xen",
            "InstanceLifecycle": "spot",
            "NetworkInterfaces": [
                {
                    "Attachment": {
                        "AttachTime": "2020-11-03T23:19:50.000Z",
                        "AttachmentId": "eni-attach-04feb4d36cf5c6792",
                        "DeleteOnTermination": true,
                        "DeviceIndex": 0,
                        "Status": "attaching"
                    },
                    "Description": "",
                    "Groups": [
                        {
                            "GroupName": "testSG",
                            "GroupId": "sg-4cbc6f35"
                        }
                    ],
                    "Ipv6Addresses": [],
                    "MacAddress": "0a:6d:ba:c5:65:4b",
                    "NetworkInterfaceId": "eni-09ef90920cfb29dd9",
                    "OwnerId": "031372724784",
                    "PrivateIpAddress": "10.0.0.210",
                    "PrivateIpAddresses": [
                        {
                            "Primary": true,
                            "PrivateIpAddress": "10.0.0.210"
                        }
                    ],
                    "SourceDestCheck": true,
                    "Status": "in-use",
                    "SubnetId": "subnet-49de033f",
                    "VpcId": "vpc-f16a0895",
                    "InterfaceType": "interface"
                }
            ],
            "RootDeviceName": "/dev/sda1",
            "RootDeviceType": "ebs",
            "SecurityGroups": [
                {
                    "GroupName": "testSG",
                    "GroupId": "sg-4cbc6f35"
                }
            ],
            "SourceDestCheck": true,
            "SpotInstanceRequestId": "sir-rrs9gm3j",
            "StateReason": {
                "Code": "pending",
                "Message": "pending"
            },
            "VirtualizationType": "hvm",
            "CpuOptions": {
                "CoreCount": 1,
                "ThreadsPerCore": 1
            },
            "CapacityReservationSpecification": {
                "CapacityReservationPreference": "open"
            },
            "MetadataOptions": {
                "State": "pending",
                "HttpTokens": "optional",
                "HttpPutResponseHopLimit": 1,
                "HttpEndpoint": "enabled"
            }
        }
    ],
    "OwnerId": "031372724784",
    "ReservationId": "r-0d45e1919e7bad5c9"
} 

We can use jq to extract the EC2 InstanceId of the spot instance:

Shell
$ AWS_SPOT_INSTANCE_ID="$(
  jq -r '.Instances[].InstanceId' <<< "$AWS_EC2_RESCUE"
)"

$ echo "$AWS_SPOT_INSTANCE_ID"
ami-0dba2cb6798deb6d8 

We need to retrieve the IP address of the newly created EC2 spot instance. This instance will disappear (terminate) once it shuts down, so do not reboot it.

Shell
$ aws ec2 describe-instances --instance-ids "$AWS_SPOT_INSTANCE_ID"
{
    "Reservations": [
        {
            "Groups": [],
            "Instances": [
                {
                    "AmiLaunchIndex": 0,
                    "ImageId": "ami-0dba2cb6798deb6d8",
                    "InstanceId": "i-012a54aefcd333de9",
                    "InstanceType": "t2.small",
                    "KeyName": "rsa-2020-11-03.pub",
                    "LaunchTime": "2020-11-03T23:19:50.000Z",
                    "Monitoring": {
                        "State": "disabled"
                    },
                    "Placement": {
                        "AvailabilityZone": "us-east-1c",
                        "GroupName": "",
                        "Tenancy": "default"
                    },
                    "PrivateDnsName": "ip-10-0-0-210.ec2.internal",
                    "PrivateIpAddress": "10.0.0.210",
                    "ProductCodes": [],
                    "PublicDnsName": "",
                    "PublicIpAddress": "54.242.88.254",
                    "State": {
                        "Code": 16,
                        "Name": "running"
                    },
                    "StateTransitionReason": "",
                    "SubnetId": "subnet-49de033f",
                    "VpcId": "vpc-f16a0895",
                    "Architecture": "x86_64",
                    "BlockDeviceMappings": [
                        {
                            "DeviceName": "/dev/sda1",
                            "Ebs": {
                                "AttachTime": "2020-11-03T23:19:51.000Z",
                                "DeleteOnTermination": true,
                                "Status": "attached",
                                "VolumeId": "vol-0c44c8c009d1fafda"
                            }
                        }
                    ],
                    "ClientToken": "026583fb-c94e-4bca-bdd2-8dcdcaa3aae9",
                    "EbsOptimized": false,
                    "EnaSupport": true,
                    "Hypervisor": "xen",
                    "InstanceLifecycle": "spot",
                    "NetworkInterfaces": [
                        {
                            "Association": {
                                "IpOwnerId": "amazon",
                                "PublicDnsName": "",
                                "PublicIp": "54.242.88.254"
                            },
                            "Attachment": {
                                "AttachTime": "2020-11-03T23:19:50.000Z",
                                "AttachmentId": "eni-attach-04feb4d36cf5c6792",
                                "DeleteOnTermination": true,
                                "DeviceIndex": 0,
                                "Status": "attached"
                            },
                            "Description": "",
                            "Groups": [
                                {
                                    "GroupName": "testSG",
                                    "GroupId": "sg-4cbc6f35"
                                }
                            ],
                            "Ipv6Addresses": [],
                            "MacAddress": "0a:6d:ba:c5:65:4b",
                            "NetworkInterfaceId": "eni-09ef90920cfb29dd9",
                            "OwnerId": "031372724784",
                            "PrivateIpAddress": "10.0.0.210",
                            "PrivateIpAddresses": [
                                {
                                    "Association": {
                                        "IpOwnerId": "amazon",
                                        "PublicDnsName": "",
                                        "PublicIp": "54.242.88.254"
                                    },
                                    "Primary": true,
                                    "PrivateIpAddress": "10.0.0.210"
                                }
                            ],
                            "SourceDestCheck": true,
                            "Status": "in-use",
                            "SubnetId": "subnet-49de033f",
                            "VpcId": "vpc-f16a0895",
                            "InterfaceType": "interface"
                        }
                    ],
                    "RootDeviceName": "/dev/sda1",
                    "RootDeviceType": "ebs",
                    "SecurityGroups": [
                        {
                            "GroupName": "testSG",
                            "GroupId": "sg-4cbc6f35"
                        }
                    ],
                    "SourceDestCheck": true,
                    "SpotInstanceRequestId": "sir-rrs9gm3j",
                    "VirtualizationType": "hvm",
                    "CpuOptions": {
                        "CoreCount": 1,
                        "ThreadsPerCore": 1
                    },
                    "CapacityReservationSpecification": {
                        "CapacityReservationPreference": "open"
                    },
                    "HibernationOptions": {
                        "Configured": false
                    },
                    "MetadataOptions": {
                        "State": "applied",
                        "HttpTokens": "optional",
                        "HttpPutResponseHopLimit": 1,
                        "HttpEndpoint": "enabled"
                    }
                }
            ],
            "OwnerId": "031372724784",
            "ReservationId": "r-0d45e1919e7bad5c9"
        }
    ]
}

Mount the Rescue Volume On the Rescue Server

We need to select a device name to be assigned to the rescue disk once it is attached to an EC2 instance. The available names depend on what names are already in use on the rescue server. After logging into the rescue server, I ran the lsblk Linux command to see the available disk devices and their mount points.

$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop1     7:1    0 53.1M  1 loop /snap/lxd/10984
loop2     7:2    0 88.4M  1 loop /snap/core/7169
loop3     7:3    0 97.8M  1 loop /snap/core/10185
loop4     7:4    0 53.1M  1 loop /snap/lxd/11348
xvda    202:0    0    8G  0 disk
└─xvda1 202:1    0    8G  0 part /

The lsblk output does not show full device paths, instead, the /dev/ prefix is omitted. With that in mind we can see that the only disk device on the rescue server is /dev/xvda, and its only partition called /dev/xvda1 is mounted on the root directory. Because Linux drives are normally named sequentially, we should name the rescue disk /dev/xvdb. Let’s define an environment variable called AWS_RESCUE_DRIVE to memorialize that decision.

$ AWS_RESCUE_DRIVE=/dev/xvdb

The aws ec2 attach-volume command will attach the rescue volume to the rescue server. It automatically selects an appropriate device name for the rescue volume, which in the following example is /dev/xvdb:

Shell
$ AWS_ATTACH_VOLUME="$(
  aws ec2 attach-volume \
    --device $AWS_RESCUE_DRIVE \
    --instance-id $AWS_EC2_RESCUE_ID \
    --volume-id $AWS_RESCUE_VOLUME_ID
)"

$ echo "$AWS_ATTACH_VOLUME"
{
    "AttachTime": "2020-10-26T14:34:55.222Z",
    "InstanceId": "i-d3b03954",
    "VolumeId": "vol-0e20fd22d2dc5a933",
    "State": "attaching",
    "Device": "/dev/xvdb"
}

$ aws ec2 wait volume-in-use --volume-id "$AWS_RESCUE_VOLUME_ID"

The details of the mounted rescue drive are provided by fdisk -l:

Shell
$ sudo fdisk -l | sed -n -e '/xvdb/,$p'
Disk /dev/xvdb: 12 GiB, 12884901888 bytes, 25165824 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00000000

Device     Boot Start      End  Sectors Size Id Type
/dev/xvdb1 *    16065 25165790 25149726  12G 83 Linux

Now it is time to mount the rescue drive on the rescue server. Ubuntu has a directory called /mnt whose purpose is to act as a mount point:

Shell
$ sudo mount /dev/xvdb1 /mnt

Let’s confirm that the drive is mounted:

Shell
$ df -h | grep '^/dev/' | grep -v '^/dev/loop'
/dev/xvda1      7.8G  6.3G  1.1G  86% /
/dev/xvdb1       12G  9.0G  2.2G  82% /mnt

The last line shows that this drive is mounted on /mnt and it is 82% full.

Set Up a chroot to Establish an Environment for Making Repairs

We need to mount some more file systems before we perform the chroot. The following mounts the rescue server’s /dev, /dev/shm, /sys, and /run to the same paths within the rescue volume. Because programs like do-release-upgrade need a tty, I also mount devtps and proc. These mounts only last until the next server reboot. After all the mounts the chroot is issued.

Warning - mounting /run and then updating the system on the rescue disk from within a chroot may change the host system’s /run contents; if the package managers (apt and dpkg) get out of sync with the actual state on the host system you won’t be able to update the host system until you restore the host system’s image from the snapshot that we made earlier.

Shell
$ sudo mount -o bind /dev /mnt/dev

$ sudo mount -o bind /dev/shm /mnt/dev/shm

$ sudo mount -o bind /sys /mnt/sys

$ sudo mount -o bind /run /mnt/run

$ sudo mount -t proc proc /mnt/proc

$ sudo mount -t devpts devpts /mnt/dev/pts

$ sudo chroot /mnt

root@ip-10-0-0-189:/#

Notice how the prompt changed after the chroot. That is your clue that it is active.

Correct the Problem

This step depends on whatever is wrong. I won’t bore you with the problem I had.

Unmount the New Volume

Exit the chroot and unmount the rescue volume from the rescue server.

Shell
# exit

$ sudo umount /mnt/dev

$ sudo umount /mnt/dev/shm

$ sudo umount /mnt/sys

$ sudo umount /mnt/run

$ sudo umount /mnt/proc

$ sudo umount /mnt/dev/pts

$ sudo umount /mnt

Detach the rescue volume from the rescue server. This can be done from any machine that is configured with aws cli for use with your account credentials.

Shell
$ aws ec2 detach-volume --volume-id $AWS_RESCUE_VOLUME_ID

$ aws ec2 wait volume-available --volume-id $AWS_RESCUE_VOLUME_ID

Unmount the Problem Volume

The problem server must be shut down for this to work. Detach the problem volume from the problem server. This can be done from any machine that is configured with aws cli for use with your account credentials.

Shell
$ aws ec2 stop-instances --instance-id $AWS_PROBLEM_INSTANCE_ID

$ aws ec2 wait instance-stopped --instance-ids $AWS_PROBLEM_INSTANCE_ID

$ aws ec2 detach-volume --volume-id $AWS_PROBLEM_VOLUME_ID

$ aws ec2 wait volume-available --volume-id $AWS_PROBLEM_VOLUME_ID

Replace the Problem Volume

Now it is time to replace the problem volume containing the problem boot drive on the problem system with the newly created volume. BTW, AWS EC2 always refers to boot drives as /dev/sda1, even when the device has a different name, such as /dev/xvdb1.

replaceSystemVolume Bash function

This Bash function detaches the volume containing the current boot drive of an EC2 instance and replaces it with another volume. If the EC2 instance is running then it is first stopped.

#!/bin/bash

function replaceSystemVolume {
  # $1 - EC2 instance id
  # $2 - new volume to mount as system boot drive

  export EC2_INSTANCE="$(
    aws ec2 describe-instances --instance-ids "$1" | \
    jq -r ".Reservations[].Instances[0]"
  )"

  export EC2_NAME="$(
    jq -r ".Tags[] | select(.Key==\"Name\") | .Value" <<< "$EC2_INSTANCE"
  )"

  export ATTACHED_VOLUME_ID="$(
    jq -r ".BlockDeviceMappings[].Ebs.VolumeId" <<< "$EC2_INSTANCE"
  )"

  if [[ "$ATTACHED_VOLUME_ID" == "$2" ]]; then
    >&2 echo "VolumeId $2 is already attached to EC2 instance $1"
    exit 1
  fi

  export EC2_STATE="$(
    jq -r ".State.Name" <<< "$EC2_INSTANCE"
  )"

  if [ "$EC2_STATE" == running ]; then
    echo "Stopping EC2 instance $1"
    aws ec2 stop-instances --instance-ids "$1"
    aws ec2 wait instance-stopped --instance-ids "$1"
  fi

  aws ec2 detach-volume --volume-id "$ATTACHED_VOLUME_ID"
  aws ec2 wait volume-available --volume-id "$ATTACHED_VOLUME_ID"

  aws ec2 attach-volume \
    --device /dev/sda1 \
    --instance-id "$1" \
    --volume-id "$2"
  aws ec2 wait volume-in-use --volume-id "$2"

  aws ec2 start-instances --instance-ids "$1"
  aws ec2 wait instance-started --instance-ids "$1"
}


set -e

replaceSystemVolume "$@"

Here is how to use it:

Shell
$ replaceSystemVolume "$AWS_PROBLEM_INSTANCE_ID" "$AWS_RESCUE_VOLUME_ID"

Preview 2 instance id is AWS_EC2_RESCUE_ID. Replace rescue volume on preview with preview's original volume:

Shell
$ replaceSystemVolume "$AWS_EC2_RESCUE_ID" "$AWS_PREVIEW_VOLUME_ID"

Boot the problem system

Boot the problem system and verify the problem is solved.

Shell
$ aws ec2 start-instances --instance-ids $AWS_PROBLEM_INSTANCE_ID

$ aws ec2 wait instance-started --instance-ids $AWS_PROBLEM_INSTANCE_ID

Acknowledgements

This article was inspired by this excellent article, which uses the AWS web console to achieve similar results.