In this blog I want to talk about how Azure SQL Database achieves high availability. One of the major benefits from moving from on-prem SQL Server to Azure SQL Database is how much easier it is to have high availability – no need for creating and managing a SQL Server failover cluster, AlwaysOn availability groups, database mirroring, log shipping, SAN replication, etc.
Azure SQL Database is a highly available database Platform as a Service that guarantees that your database is up and running 99.99% of time, without worrying about maintenance and downtimes. This is a fully managed SQL Server Database Engine process hosted in the Azure cloud that ensures that your SQL Server database is always upgraded/patched without affecting your workload. Azure automatically handles patching, backups, replication, failure detection, underlying hardware, software or network failures, deploying bug fixes, failovers, database upgrades, and other maintenance tasks. Azure SQL Database can quickly recover even in the most critical circumstances ensuring that your data is always available.
Azure SQL Database is based on the SQL Server Database Engine architecture that is adjusted for the cloud environment in order to ensure 99.99% availability even in the cases of infrastructure failures. There are two high-availability architectural models that are used in Azure SQL Database (both of them ensuring 99.99% availability):
(NOTE: Basic/Standard/Premium are service tiers that are DTU-based and used only for SQL Database Single, and General Purpose/Business Critical are vCore-based and used for both SQL Database Single and SQL Database Managed Instance)
- Basic/Standard/General Purpose model that is based on remote storage. This architectural model relies on high availability and reliability of the storage tier, but it might have some potential performance degradation during maintenance activities. This model uses Azure Premium Storage Disks
- Premium/Business Critical model that is based on a cluster of database engine processes. This architectural model relies on a fact that there is always a quorum of available database engine nodes and has minimal performance impact on your workload even during maintenance activities. This model uses AlwaysOn Availability Groups and local attached SSD storage. Provides higher IOPS and throughput than Basic/Standard/General Purpose
Azure SQL Database runs on the latest stable version of SQL Server Database Engine and Windows OS, and most of the users would not notice that the upgrades are performed continuously.
More details on these two options:
High availability in these service tiers is achieved by separation of compute and storage layers and the replication of data in the storage tier (which uses Azure Premium Storage):
There are two layers:
- Active compute nodes: A stateless compute layer that is running the sqlserver.exe process and contains only transient and cached data (for example – plan cache, buffer pool, column store pool). This stateless SQL Server node is operated by Azure Service Fabric that initializes process, controls health of the node, and performs failover to another place if necessary
- Azure Storage accounts: A stateful data layer with database files (.mdf/.ldf) that are stored in Azure Premium Storage Disks, which is remote storage (i.e. it is accessed over the network, using Azure network infrastructure). It is able to use Azure Premium Storage by taking advantage of SQL Server native capability to use database files directly in Azure Blob Storage. This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage. Azure Storage guarantees that there will be no data loss of any record that is placed in any database file (since three copies of the data is made via LRS). Azure Storage has built-in data availability/redundancy that ensures that every record in log file or page in data file will be preserved even if SQL Server process crashes. Note the tempdb database is not using Azure Premium Storage but rather it is located on the local SSD storage, which provides very low latency and high IOPS/throughput
Whenever the database engine or operating system is upgraded, or some part of underlying infrastructure fails, or if some critical issue is detected in the SQL Server process, Azure Service Fabric will move the stateless SQL Server process to another stateless compute node. There is a set of redundant (“spare”) nodes that is waiting to run new compute service in case of failover in order to minimize failover time. Data in the Azure storage layer is not affected, and data/log files are attached to the newly initialized SQL Server process. Failover time can be measured in seconds. This process guarantees 99.99% availability, but it might have some performance impacts on a heavy workload that is running due to transition time and the fact the new SQL Server node starts with cold cache.
See Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose) for info on performance improvement.
High availability in these service tiers is designed for intensive workloads that cannot tolerate any performance impact due to the ongoing maintenance operations.
In the premium/business critical model, Azure SQL database integrates compute and storage on the single node. High availability in this architectural model is achieved by the replication of compute (SQL Server Database Engine process) and storage (locally attached SSD) deployed in a Always On Availability Groups cluster with enough replicas to achieve quorum and provide HA guarantees (currently 4 nodes as shown below):
Both the SQL Server Database Engine process and the underlying mdf/ldf files are placed on the same node with locally attached SSD storage providing low latency to your workload. High availability is implemented using Always On Availability Groups. Every database is a cluster of database nodes with one primary database that is accessible for the customer workload, and three secondary processes containing copies of data. The primary node constantly pushes the changes to secondary nodes in order to ensure that the data is available on secondary replicas if the primary node crashes for any reason. Failover is handled by the Azure Service Fabric – one secondary replica becomes the primary node and a new secondary replica is created to ensure enough nodes in the cluster, and the workload is automatically redirected to the new primary node. Failover time is measured in milliseconds for most workloads, and the new primary instance is immediately ready to continue serving requests.
For the premium/business critical model for SQL Database Single (not yet for SQL Database Managed Instance), there is built-in support of Availability Zones in regions Central US, France Central, West Europe, and Southeast Asia at no extra cost (see Azure SQL Database offers zone redundant Premium databases and elastic pools). In short, Availability Zones are unique physical locations within an Azure region. Each zone is made up of one or more datacenters equipped with independent power, cooling, and networking, with each zone in a different datacenter, as opposed to the default of having all the replicas in the same datacenter. While is does not change the 99.99% SLA, making a database zone redundant makes it much more resilient to large scale failures, including datacenter level failures. As a result, it narrows the risk of downtime from regional natural disasters (see this excellent graphic on Azure resiliency). Note that it makes the existing databases more resilient, but it does not improve the level of resilience of management operations like CRUD. If those are critical part of the application logic, geo-replication is a way to go. For more info, check out Zone redundant configuration. Be aware that because the zone redundant databases have replicas in different datacenters with some distance between them, the increased network latency may increase the commit time and thus impact the performance of some OLTP workloads.
A note on the difference in the handling of a failover compared to on-prem: The database engine cannot control the failover because it may not be running when a failover has to occur, i.e. it may have just crashed. Failover has to be initiated by a component external to the database engine. For traditional SQL Server, this component is Windows Failover Clustering. For SQL DB and MI, this is Service Fabric.
IO Performance difference:
vCore model (from here):
|General Purpose||Business Critical|
|IO throughput (approximate)||Singleton Database: 500 IOPS per vCore with 7000 maximum IOPS
Managed Instance: 500-7500 IOPS per data file (depends on size of file)
|5000 IOPS per core with 200000 maximum IOPS|
DTU model (from here):
|IO throughput (approximate)||2.5 IOPS per DTU||2.5 IOPS per DTU||48 IOPS per DTU|
|IO latency (approximate)||5 ms (read), 10 ms (write)||5 ms (read), 10 ms (write)||2 ms (read/write)|
Note Managed Instance only supports the vCore model.
A word about storage: Azure SQL Database Singleton and Managed Instance both use Azure storage page blobs as the underlying persistent storage for its databases. Azure premium managed disks are just premium page blobs with some API make-up so they look like disks. Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs.
A word about failover times: In Business Critical case, there is a secondary replica that is an exact read-only copy of primary instance, so failover is just a switch to new IP address and it is almost instant. In more realistic cases, there is always some lag in secondary replica because it is constantly redoing transaction log records that are sent from the primary node. Failover time is equal to the time needed to apply all remaining transaction log records to became consistent with the primary node, and then the switch to the new IP is completed. Under heavy workload that saturates both primary and secondary replica there might be a chance that secondary cannot immediately catchup to primary so log redo time might be even longer. The exact time depends on the workload and there are no official numbers or formula to calculate this. In the General Purpose case, there is a stateless compute node ready to run sqlservr.exe that attaches .mdf/ldf files from the remote storage. This is cold cache process that should be initialized and failover time is longer than Business Critical. Failover time depends on the database size and also can vary.
Finally, if you are interested in how Microsoft manages data integrity for Azure SQL Database, check out Data Integrity in Azure SQL Database.