Cloud Day 2—High Availability and Disaster Recovery in the Cloud

by Nov 19, 2020

One of my favorite topics in cloud computing is disaster recovery. A decade ago, if you wanted to have true disaster recovery across multiple geographic regions you had to build or lease multiple data centers with staff on-location in each of those regions. This was a daunting expense that only the largest and wealthiest companies could afford. However, with cloud computing many of the disaster recovery scenarios that require expensive networking hardware or enterprise-class storage are just built into cloud vendors' service offerings. In this post you will learn about your High Availability and Disaster Recovery in cloud platforms and how these technologies integrate with SQL Server and SQL platform as a service offerings.

VMs—Let’s Talk About Availability

Virtual machines in both major cloud platforms have a standard SLA of around 99.5 to 99.9% which is a downtime of roughly 45 minutes to 4 hours a month. Those numbers are pretty good for a lot of smaller on-premises shops, but may not meet the needs for mission critical applications. Both Microsoft and Amazon offer features to provide you with higher levels of uptime—the first being Availability Zones in Azure and Proximity Placement in AWS. These features effectively do the same thing, which is ensure that a group of VMs (for example, a SQL Server Availability Group) are placed in different areas (and storage) within the same data center. In both cases this increases availability to 99.95% or 21 minutes a month of downtime per the service level agreement (SLA), which is the amount is amount of downtime allowed before the cloud vendor has to refund the customer. 

Both Azure and AWS support Availability Zones, which is a similar concept to Availability Sets, but across data centers in a given region. This allows your workloads to survive the failure of a data center in a given region. Both providers provide a sub-millisecond network latency guarantee which allows you to run workloads across data centers in a synchronous fashion, which means you will not incur any loss of data in this type of failure. The SLA with both vendors for Availability Zones is 99.99% for two or more VM which is 4 minutes a month., It is important to note that you still need another layer of data protection beyond just this infrastructure. In the case of SQL Server that could most likely be Always On Availability Groups.

Disaster Recovery—VMs First

High availability is used to protect against transient failures like a server going down or a disk failure. Disaster recovery on the other hand protects against weather or major software failure which affects an entire region. You will notice I mentioned both cloud vendors there—regional outages, while not common can happen to any vendor. One key difference between Azure and AWS in this space is that Azure has a concept of paired regions. For example, the West US 2 region is paired with the North Central region. This means a couple of things—first there’s a bigger network connection between those regions. Secondly, it means software gets rolled out to only one of the region pair at a time—so if there’s a bug that affects services, you are somewhat protected by having your resources across both of those regions. The other place this matters is with storage—Azure allows blob storage to be geo-replicated (which is a good basic DR solution for SQL Server backup files) and the replication is always to the paired region (this is not user-configurable). So it’s preferable to build your DR solution across paired regions in Azure.

When I talk to customers about their DR needs, I like to understand their tolerance for downtime and their budget. Shipping backups to other regions is the cheapest DR option. The storage for your backups is relatively inexpensive, and in the event of a regional outage, you could relatively quickly spin up a bunch of VMs running SQL Server and start restoring your databases. The downtime for such a solution is probably hours, at the most a day or two. To minimize downtime in this scenario you should have all of your infrastructure defined in templates (Azure Resource Manager, Terraform, AWS Cloud Formation) in your source control system, so you can easily redeploy your systems.

The second slightly more expensive option is specific to Azure—Azure Site Recovery. For $25/month per VM, plus storage costs, you can replicate do replicate your VM data and changes that take place to a second Azure region. This is similar to VMWare’s Site Recovery Manager in an on-premises world, and kind of, sort of works with databases. Like most infrastructure-based recovery solutions, it doesn’t do well in highly transactional databases, as recovery of active transactions can result in data loss. I prefer this solution for migrations to disaster recovery, but if most of your systems are small and not that busy it can be a one click DR solution that covers all of your infrastructure. 

Finally, there’s the option of constructing a live DR site in a second region, or even a second cloud provider (be careful of bandwidth charges if you go cross-cloud). Most commonly, this is done with Always On Availability Groups, but you can also use log shipping to accomplish a DR site. You can also take advantage of the distributed availability group feature of SQL Server to avoid having to configure a multi-site cluster, however it is only supported on SQL Server 2016 and up and then only on Enterprise Edition. One thing many organizations do is downscale the servers in their secondary region to reduce costs. In the event of a failure, which would require some downtime anyway, you would resize their VMs to match production.

What About PAAS?

Platform as a service offerings such as Azure SQL Database and Managed Instance, and Amazon RDS for SQL Server are built on highly available platforms. That means you do not have to worry about building an availability group or a cluster—the service provides that for you. You will note that this is one of the reasons PAAS services can appear to be more expensive than running standalone VMs. However, the cloud service doesn’t provide disaster recovery—you need to deploy that yourself. With Azure SQL Database you can use geo-replication which replicates your database to another region—you can have up to four copies. Azure Managed Instance is similar, but you can only have one copy. Amazon RDS supports cross-region disaster recovery using change data capture (CDC).

Cloud DR is easier to implement than what you would have traditionally done in an on-premises world, however it still holds some inherent costs and complexity. The upside is that all of the infrastructure is in place in terms of networking and storage, the downside is there are almost an infinite array of options and you need to make the right choices for your budget and business requirements.