How To Maximize AWS Cloud Reliability and Availability

What Happened?

Typo blamed for Amazon’s Internet-crippling outage!” and similar news headlines alerted to the five-hour service disruption that Amazon Web Services (AWS) suffered in its US-East-1 region in Northern Virginia on Tuesday, February 28, 2017. The outage affected its Simple Storage Service (S3) file storage. It also affected some other services in the area such as its S3 management console, Elastic Compute Cloud (EC2) virtual servers, Amazon Elastic Block Store (EBS) block storage and Lambda event-driven computing. Most customer websites that depend on S3 were stuck or slow. However, no data loss occurred directly from the disruption. The cause was a human error made by an authorized employee that resulted in removing more server capacity than intended that caused a cascade of outages.

Why Is AWS Important?

Amazon Web Services (AWS) in 2006 started offering Information Technology (IT) infrastructure services to businesses in the form of cloud computing. Per Gartner, AWS is the leader in cloud infrastructure as a service (IaaS). Per Synergy Research Group, AWS has a market share of over 40% (and Microsoft, Google, and IBM combined have a market share of 23%) with predictions for a revenue of $13 billion in 2017.

Why Is Cloud Computing Popular?

Cloud computing is the on-demand delivery via the Internet of computing, storage, database, analytics, applications, developer tools, and other IT resources through a cloud services platform with usage-based pricing. Public cloud offerings can be used successfully for new and existing applications. Public cloud offerings enhance developer productivity and infrastructure and operations efficiency. Traditional solutions include application hosting, backup and storage, content delivery, websites, enterprise applications, and the Internet of Things. The benefits of cloud computing are its reliability, low cost, scalability, agility, elasticity, openness, flexibility, and security. A key driver of cloud computing is the ability to replace up-front capital infrastructure expenses with low variable costs that scale quickly with business needs.

Which AWS Services Are Most Relevant?

For database professionals, the likely most relevant AWS cloud services are virtual machines via Elastic Compute Cloud (EC2), managed databases via Relational Database Service (RDS), and file storage via Simple Storage Service (S3). For example, use EC2 to set up cloud virtual machines with Microsoft Windows Server and Microsoft SQL Server, use RDS to set up a managed cloud database that is compatible with Microsoft SQL Server, and use S3 to store database backup files and application data.

How Reliable is AWS?

Reliability is of prime importance for AWS. For this purpose, AWS provides a global infrastructure including 42 availability zones within 16 geographic regions. AWS also offers numerous service to ensure reliability, such as CloudWatch resource and application monitoring and CloudFormation configuration templates.

Regardless, major service disruptions occur. Examples of the well-documented service disruptions that affected EC2 and RDS are:

  • On Sunday, June 5, 2016, AWS suffered a service disruption in its Sydney region for several hours. The outage affected its EC2 instances running in a single availability zone. The cause was slow breakers during an electrical storm.

  • On Sunday, September 20, 2015, AWS suffered a service disruption in its US-East region in Northern Virginia for several hours. The outage affected its DynamoDB NoSQL database service. It also affected other services in the area such as Simple Queue Service (SQS) fully managed message queues, EC2 Auto Scaling, CloudWatch resource and application monitoring, and the AWS Console. The cause was a power outage followed by inadequate failover procedures.

  • On Monday, December 24, 2012, AWS suffered a service disruption in its US-East region in Northern Virginia for a day. The outage affected its Elastic Load Balancing (ELB) service for EC2. The cause was a human error by an authorized employee that resulted in the deletion of load balancing state data.

  • On Monday, October 22, 2012, AWS suffered a service disruption in its US-East region in Northern Virginia for several hours. The outage affected its EBS volumes in a single availability zone. It also affected other services in the area such as EC2, RDS, and ELB. Many customer websites were affected. The cause was a software bug on the EBS storage servers.

  • On Friday, June 29, 2012, AWS suffered a service disruption in its US-East-1 region in Northern Virginia for an hour. The outage affected EC2, EBS, ELB, and RDS. Several customer websites that rely on AWS were offline. The cause was failing backup power generators during an electrical storm.

  • On Wednesday, April 20, 2011, AWS suffered a service disruption in its US-East-1 region in Northern Virginia for several days. The outage affected a subset of the EBS volumes in a single availability zone. It also affected also other services in the area such as EC2 and RDS. The cause was an incorrectly executed scheduled router traffic shift during a capacity upgrade.

What Can I Do to Stay Out of Trouble?

One could jump to the conclusion that perhaps it is wise to avoid the US-East region in Northern Virginia. However, AWS located the majority of its servers in this area. It probably makes more sense to add redundancy to minimize the dependency on a single availability group and a single region. However, such redundancy carries a significant cost, and it does not ensure reliability by itself. Such measures are only a component of a much larger set of recommended processes and systems. That is, the transparent self-managing aspect of the cloud is not a replacement for proper planning and architectural design.

Where Can I Find Guidance?

To help its customers build reliable, secure, high performing, resilient, efficient, and cost-effective cloud infrastructure for their applications and data, AWS developed its Well-Architected framework. This framework provides general design principles, best practices, and a set of questions to review for each existing or proposed architecture. Its general design principles are to facilitate good design in the cloud are to stop guessing capacity needs, test systems at production scale, automate to make architectural experimentation easier, allow for evolutionary architectures, use data-driven designs, and improve through game days. Its five conceptual areas are security, reliability, performance efficiency, cost optimization, and operational excellence. This blog post focuses specifically on reliability and operational excellence.

Reliability

Reliability refers to recovering a system from infrastructure or service failures, dynamically acquiring computing resources to meet demand, and mitigating disruptions such as misconfigurations or transient network issues. The five design principles are to test recovery procedures, automatically recover from failure, scale horizontally to increase aggregate system availability, stop guessing capacity, and manage change in automation. The key service is CloudWatch to monitor run-time metrics. For each of the three best practice areas, the nine best practices and the key AWS services and features are:

Foundations: The two key questions are “How do you manage service limits for your accounts?” and “How are you planning your network topology?”. The two key services are Identity and Access Management (IAM) to control access to services and resources securely; and Virtual Private Cloud (VPC) to provision a private, isolated section of AWS to launch resources in a virtual network.

Change Management: The three key questions are “How does your system adapt to changes in demand?”, “How are you monitoring resources?”, and “How are you executing change?”. The two key services are CloudTrail to record application programming interface (API) calls for accounts and to deliver log files for auditing; and Config to provide a detailed inventory of resources and configuration, and to continuously record configuration changes.

Failure Management: The four key questions are “How are you backing up your data?”, “How does your system withstand component failures?”, “How are you testing for resiliency?”, and “How are you planning for disaster recovery?”. The key service is CloudFormation to provide templates for the creation of resources and to provision the resources in a predictable fashion.

Operational Excellence

Operational excellence refers to running and monitoring systems to deliver business value and to improve continually supporting processes and procedures. The five design principles are to perform repetitive operations with code, align operations processes to business objectives, implement regular small incremental changes, test for responses to unexpected events, learn from operational events and failures, and keep operations procedures current. The two primary services are CloudFormation to create templates based on best practices and to provision resources properly; and CloudWatch to monitor metrics, collect logs, generate alerts, and trigger responses. For each of the three best practice areas, the six best practices and the key services and features are:

Preparation: The two key questions are “What best practices for cloud operations are you using?” and “How are you doing configuration management for your workload?” The four key services are Config to provide a detailed inventory of resources and configuration, and to continuously record configuration changes; Service Catalog to create a standardized set of service offerings aligned with best practices; and Auto Scaling and Simple Queue Service (SQS) to automate workloads.

Operations: The two key questions are “How are you evolving your workload while minimizing the impact of change?” and “How do you monitor your workload to ensure it is operating as expected?”. The key services are CodeCommit, CodeDeploy, and CodePipeline to manage and automate code changes to workloads; AWS software development kits (SDKs) and third-party libraries to automate operational changes; and CloudTrail to audit and track changes.

Responses: The two key questions are “How do you respond to unplanned operational events?” and “How is escalation managed when responding to unforeseen operational events?”. The key service is CloudWatch with its alarms and events to provide effective and automated responses.

Takeaways

For business continuity, do not trust the cloud provider to handle all eventualities. Instead, continue to rely on expert system and database administrators to execute proper planning and architectural design. Specifically, maximize the availability and performance of applications and data by replicating them so that they automatically fail-over without interruption between availability zones in the same region and between different regions for fault tolerance and low latency. Finally, accept that the cloud comes with a significant cost so that taking shortcuts is likely to result in painful consequences down the road.