Serverless Approach to Backup and Restore EBS Volumes

EC2 instances uses EBS as a root volume as well as additional data store. It is necessary to select a proper EBS volume type depending upon the workload and to take a regular backup of EBS volume for production environments. We need a solution to backup and restore application data from EBS volume snapshots at any point of time. Also, We should not pay unnecessary cost for archiving the older snapshots. Here, I discussed some points about choosing a right EBS volume type and provide a proper mechanism of handling EBS snapshots.

EBS Volume types

Amazon EBS provides a different volume types which has different performance characteristics and cost models. We can choose the volume type based on our application requirements(type of the workload) to achieve higher performance as well as saving overall storage cost.

EBS volume is available in two different categories. SSD backed volumes and HDD backed volumes. SSD backed volumes is used when workload is i/o intensive. ie. transactional workload where frequent read write happens in an application. Its performance is rated in IOPS. HDD backed volumes are used when application requires continuous read and write to the disk at cheaper rate with high throughput. Its performance is rated in throughput MiB/s.

SSD Backed EBS Volumes

SSD volumes are high performance EBS storage. It is backed by a modern Solid State Drive Storage Technology. It is available as General Purpose SSD volume(gp2) and Provisioned IOPS SSD volume(io1). GP2 volume helps to operate a wide variety of workloads by providing high performance as well as balancing price whereas IO1 provides high performance, high throughput and suitable for mission critical low latency workloads especially hosting databases such as Cassandra, MongoDB, Postgres, MySQL and Oracle Database workloads. GP2 volume can be created from 1 GiB to 16 TiB size and IO1 volume can be created from 4 GiB to 16 TiB size. It provides IOPS from16000 to 64000 per volume depends on volume type and size.

HDD Backed EBS Volumes

HDD volumes are throughput optimized volumes at lower price. It is backed by a Magnetic Storage Technology. It is further classified into throughput optimized HDD and cold HDD. These volume types can be used for frequent and infrequently accessed workloads at cheaper and cheapest price respectively. It is generally used in scenarios such as hosting data warehouses and big data solutions where we need to consider throughput as well as storage price. But it can’t be used as a boot volume for any EC2 servers. This volume type can be created from 500 GiB to 16 TiB size and it provides IOPS from 250 to 500 per volume. The maximum throughput per volume ranges from 200 MiB/s to 500 MiB/s.

Selecting a proper EBS volume type for an application

If our solution relies on frequent access to data and the transaction between application and database is frequent. We need a high performance and reliable data storage available at comparatively lower cost. We use General Purpose SSD volume for root volume and Provisioned IOPS SSD volume for application and database data directory which requires continuous read and write operation with low latency. We monitor the performance of the volume under loaded condition when the application receives a maximum traffic which can handle without much latency. Based on that, we may increase/decrease the IOPS count to reduce the storage cost.

EBS volume data protection

Encrypting EBS volume protects the data at rest as well as motion. We can encrypt EBS volumes while creating it. The encrypted volume encrypts all the data inside that volume and it provides encryption when the data moves between EC2 instances and EBS storage. The EBS snapshots created from these volumes are also encrypted. There is no performance impact on using encrypted volumes but there is a minimal effect on I/O latency. We cannot make encrypted volumes public and its snapshots cannot be shared between AWS accounts. We can use encrypted volumes for application data directory to secure our data at rest as well as motion.

This architecture diagram explains the implementation of lambda function that takes snapshot backup of EBS volume in an AWS account across all regions daily and purges the snapshot which is lesser than specified retention period in days. By implementing this solution, We need not to clean up any of his EBS snapshots manually and we can maintain the number of EBS snapshot volumes as per the retention policy. It helps us to reduce the unnecessary AWS billing cost as well as avoids keeping the snapshots which are no longer useful.

The volume to be backed up using our solution should have a Tag with Key say “Backup” with some value to that key, The snapshot will be created with that name followed by the timestamp. It will check for the snapshot which is created before the number of days mentioned as retention period for deletion.

How to implement this Solution

We implement this solution as a cloudformation stack. Cloudformation template creates resources such as Lambda function with all important configuration such as runtime, handler, memory settings, timeout value, etc. In order to bring security, we have attached lambda execution role with a fine grained IAM policy. This IAM role and policy is defined in the same cloudformation template. The lambda function is triggered by a cloudwatch event that acts as a cron scheduler. Once the lambda function is executed, it will create volume snapshot and purges the snapshots which are older than the retention period. The cloudformation template also takes care of creating an event rule in cloudwatch with necessary permission to invoke lambda function. This infrastructure can be reproduced in multiple AWS accounts by using the same cloudformation template.

Monitoring this serverless application

We can use cloudwatch metrics for monitoring our solution. Cloudwatch exposes metrics about lambda execution and error rate including latencies. With that metrics, we optimize our solution if needed. We use logs published by lambda in cloudwatch log stream for debugging.

Refactoring/Maintenance of application

Implementing this solution does not requires maintenance window. It is not going to impact anything in the our infrastructure. This solution is easy to maintain. The code base of the lambda function is available in S3. Whenever we make any change in code, we have to update lambda function to effect the changes during upcoming execution.

Code base of this application is available in my github repository.

import os

import boto3

import datetime

import traceback

def lambda_handler(event, context):

    #default environment variables if env var is not set

    snapshot_tag = os.getenv('snapshot_tag', 'Backup')

    retention_period_in_days = os.getenv('retention_period_in_days', '7')

    sns_topic_arn = os.getenv('sns_topic_arn', 'arn:aws:sns:us-east-1:113809544561:ebs-snapshot-backup')

    def get_aws_regions(ec2_client):

        try:

            regions = ec2_client.describe_regions()

            aws_regions = []

            for region in range(len(regions['Regions'])):

                reg = regions['Regions'][region]['RegionName']

                aws_regions.append(reg)

            return aws_regions

        except Exception as e:

            print 'Unable to get AWS region list'

            print e.message

            traceback.print_exc()

            exit(1)

    # list all ebs volume with tag

    def get_ebs_volumes_with_specified_tags(client, snapshot_tag):

        try:

            ebs_volumes = []

            paginator = client.get_paginator('describe_volumes')

            response_iterator = paginator.paginate(

                Filters=[{'Name': 'tag-key', 'Values': [snapshot_tag]}],

                PaginationConfig={

                    'MaxItems': 1000,

                    'PageSize': 5,

                }

            )

            for response in response_iterator:

                for volume in response['Volumes']:

                    ebs_volumes.append(volume['VolumeId'])

            return ebs_volumes

        except Exception as e:

            print 'Unable to get EBS volumes with specified tags'

            print e.message

            traceback.print_exc()

            exit(1)

    # take snapshot of all volumes from the above list

    def create_ebs_volume_snapshot(client, ebs_volumes, snapshot_tag):

        try:

            snapshot_list = []

            timestamp = datetime.datetime.utcnow().strftime('%Y%m%d%H%M%S')

            for volume in ebs_volumes:

                snapshot = client.create_snapshot(

                    Description='EBS-Volume-Snapshot-Backup-' + volume + '-' + timestamp,

                    VolumeId=volume,

                    TagSpecifications=[

                        {

                            'ResourceType': 'snapshot',

                            'Tags': [

                                {

                                    'Key': 'Name',

                                    'Value': volume + '-' + timestamp

                                },{

                                    'Key': snapshot_tag,

                                    'Value': volume + '-' + timestamp

                                }

                            ]

                        },

                    ],

                )

                snapshot_list.append(snapshot['SnapshotId'])

            return snapshot_list

        except Exception as e:

            print 'Unable to create EBS Snapshot Backup'

            print e.message

            traceback.print_exc()

            exit(1)

    # list all ebs snapshots which are older than the retention period

    def get_ebs_snapshot_less_than_retention_period(client, snapshot_tag):

        try:

            old_ebs_snapshots = []

            paginator = client.get_paginator('describe_snapshots')

            response_iterator = paginator.paginate(

                Filters=[{'Name': 'tag-key', 'Values': [snapshot_tag]}],

                PaginationConfig={

                    'MaxItems': 1000,

                    'PageSize': 5,

                }

            )

            for response in response_iterator:

                for snapshot in response['Snapshots']:

                    current_time = datetime.datetime.utcnow()

                    start = datetime.datetime.strptime(str(snapshot['StartTime']), '%Y-%m-%d %H:%M:%S.%f+00:00')

                    end = datetime.datetime.strptime(str(current_time), '%Y-%m-%d %H:%M:%S.%f')

                    td = abs(end - start)

                    td1 = str(td).split(':')

                    duration = datetime.timedelta(hours=int(td1[0]), minutes=int(td1[1]),

                                                  seconds=int(td1[2].split('.')[0]))

                    duration_in_sec = duration.total_seconds()

                    diff = duration_in_sec / (3600 * 24)

                    if diff >= float(retention_period_in_days):

                        old_ebs_snapshots.append(snapshot['SnapshotId'])

            return old_ebs_snapshots

        except Exception as e:

            print 'Unable to get EBS Snapshots which are older than the retention period'

            print e.message

            traceback.print_exc()

            exit(1)

    # main method to start start which is called by the handler function

    def run_task():

        try:

            if snapshot_tag == 'Name':

                print 'Tag key **Name** is not allowed. Use different tag key such as ProdBackup, DevBackup, Backup, etc.'

                exit(1)

            file_path = '/tmp/msg.txt'

            if os.path.isfile(file_path) == True:

                os.remove(file_path)

            sts_client = boto3.client('sts', region_name='us-east-1')

            account_id = sts_client.get_caller_identity()["Account"]

            with open(file_path, 'a') as file:

                file.write('*' * 100 + '\n')

                file.write('EBS Volume Snapshot Backup Details(AWS account ' + str(account_id) + '):' + '\n')

                file.write('Backup retention period in days: ' + str(retention_period_in_days) + '\n')

                file.write('Snapshot Tag: ' + snapshot_tag + '\n')

                file.write('EBS volumes under snapshot tag key will be backed up by creating an EBS volume snapshot.' + '\n')

                file.write('EBS snapshots will be peristed up to the retention period in days and older snapshots will be purged.' + '\n')

            ec2_client = boto3.client('ec2', region_name='us-east-1')

            regions = get_aws_regions(ec2_client)

            overall_ebs_volumes_tobe_backed_up = 0

            overall_ebs_snapshot_created = 0

            overall_ebs_snapshot_deleted = 0

            for region in regions:

                ec2_conn = boto3.client('ec2', region_name=region)

                ebs_volumes = get_ebs_volumes_with_specified_tags(client=ec2_conn, snapshot_tag=snapshot_tag)

                if len(ebs_volumes) != 0:

                    with open(file_path, 'a') as file:

                        file.write('Number of volumes to be backed up in ' + region + ': ' + str(len(ebs_volumes)) + '\n')

                        file.write(','.join(ebs_volumes) + '\n')

                create_snapshot = create_ebs_volume_snapshot(client=ec2_conn, ebs_volumes=ebs_volumes, snapshot_tag=snapshot_tag)

                if len(create_snapshot) != 0:

                    with open(file_path, 'a') as file:

                        file.write('Number of snapshots created in ' + region + ': ' + str(len(create_snapshot)) + '\n')

                        file.write(','.join(create_snapshot) + '\n')

                old_ebs_snapshots = get_ebs_snapshot_less_than_retention_period(client=ec2_conn, snapshot_tag=snapshot_tag)

                # delete all the ebs snapshots from the above list

                for old_ebs_snapshot in old_ebs_snapshots:

                    ec2_conn.delete_snapshot(SnapshotId=old_ebs_snapshot)

                if len(old_ebs_snapshots) != 0:

                    with open(file_path, 'a') as file:

                        file.write('Number of deleted snapshots in ' + region + ': ' + str(len(old_ebs_snapshots)) + '\n')

                        file.write(','.join(old_ebs_snapshots) + '\n')

                overall_ebs_volumes_tobe_backed_up = overall_ebs_volumes_tobe_backed_up + len(ebs_volumes)

                overall_ebs_snapshot_created = overall_ebs_snapshot_created + len(create_snapshot)

                overall_ebs_snapshot_deleted = overall_ebs_snapshot_deleted + len(old_ebs_snapshots)

            with open(file_path, 'a') as file:

                file.write('Number of volumes to be backed up across regions ' + ': ' + str(overall_ebs_volumes_tobe_backed_up) + '\n')

                file.write('Number of ebs snapshots created across regions ' + ': ' + str(overall_ebs_snapshot_created) + '\n')

                file.write('Number of ebs snapshots deleted across regions ' + ': ' + str(overall_ebs_snapshot_deleted) + '\n')

            #sending notification to the cloud administrator

            count = overall_ebs_volumes_tobe_backed_up + overall_ebs_snapshot_created + overall_ebs_snapshot_deleted

            if count !=0:

                with open(file_path, 'r') as file:

                    message = file.read()

                    print message

                    try:

                        print 'Sending email to the cloud administrator'

                        sns_client=boto3.client('sns', region_name=sns_topic_arn.split(':')[3])

                        send_message = sns_client.publish(TopicArn=sns_topic_arn,

                                                          Message=message,

                                                          Subject='EBS Volume Snapshot Backup Details(AWS account ' + str(account_id) + ')',

                                                          )

                        print 'Message id: ' + send_message['MessageId']

                        print 'Email sent successfully.'

                    except Exception as e:

                        print 'Unable to send email.'

                        print e.message

            else:

                print 'No EBS volumes to be backed up and no EBS snapshots to be deleted.'

        except Exception as e:

            print e.message

            traceback.print_exc()

            exit(1)

    run_task()

# if __name__ == '__main__':

#     lambda_handler(1, 1)
https://dzone.com/articles/serverless-approach-to-backup-and-restore-ebs-volu

Happy Learning!

Be the first to comment

Leave a Reply

Your email address will not be published.


*