Creating a large data warehouse in Azure

Microsoft Azure provides you two options when hosting your SQL Server-based data warehouse: Microsoft Azure SQL Database and SQL Server in Azure Virtual Machine.  Which one is appropriate based on the size of the data warehouse?  What are some hardware features to choose from for an Azure VM for a large data warehouse?

Let’s look at each option.

Microsoft Azure SQL Database is a Platform-as-a-service (PaaS), or more specifically a relational database-as-a-service.  It is built on standardized hardware and software that is owned, hosted, and maintained by Microsoft.  You can develop directly on the service using built-in features and functionality.  When using Azure SQL Database, you pay-as-you-go with options to scale up or out for greater power.  Azure SQL Database has a max database size of 1TB (see Azure SQL Database Service Tiers and Performance Levels).  Also, not all, but most, of the SQL Server 2014 Transact-SQL statements are supported in Azure SQL Database (see Azure SQL Database Transact-SQL Reference) and it does not support SQL Server instance level features (such as, SQL Server Agent, Analysis Services, Integration Services, or Reporting Services).  But there are some features that land first in Azure SQL Database before on-prem SQL Server, such as Row-Level Security and Dynamic Data Masking.  Note that Azure SQL Database has built-in fault tolerance infrastructure capabilities that enable high availability as well as business continuity options (see Azure SQL Database Business Continuity).  So this option is only appropriate if you have a relatively small data warehouse that does not require full SQL support.

The other option that will almost always be the correct choice for a large data warehouse is to create a Azure VM that has SQL Server 2014 installed, resulting in an Infrastructure-as-a-service (IaaS).  This allows you to run SQL Server inside a virtual machine in the cloud.  Similar to Azure SQL Database, it is built on standardized hardware that is owned, hosted, and maintained by Microsoft.  When using SQL Server in a VM, you can either bring your own SQL Server license to Azure (by uploading a Windows Server VHD to Azure) or use one of the preconfigured SQL Server images in the Azure portal.  If going with a preconfigured image you should choose “SQL Server 2014 Enterprise Optimized for Data Warehousing on Windows Server 2012 R2” (see VM Images Optimized for Transactional and DW workloads in Azure VM Gallery) which will attach 15 data disks (12 disks for a 12TB data pool and 3 disks for a 3TB log pool).  You will also need to choose the virtual machine size for your VM.  Note you can setup high availability and disaster recovery solutions (see High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines).  This resulting VM will be very similar to an on-prem SQL Server solution except for the various hardware configurations that you have to choose from for your virtual machine size.

If you look at the Azure Virtual Machines Pricing for SQL Server, here are the options you would want to consider for your virtual machine size:

  • A7, 8 cores, 56GB memory, 605GB max disk size, 16 Persistent 1TB Data Disks Max, $3/hr SQL Enterprise
  • A9, 16 cores, 112GB RDMA memory, 382GB max disk size, 40Gbit/s InfiniBand, 16 Persistent 1TB Data Disks Max, $6/hr SQL Enterprise
  • D14, 16 cores, 112GB memory, 800GB max disk size (SSD), 32 Persistent 1TB Data Disks Max, cpu 60% faster than A-series, $6/hr SQL Enterprise
  • G5, 32 cores, 448GB memory, 6596GB max disk size (SSD), 64 Persistent 1TB Data Disks Max, cpu Xeon E5 v3, $12/hr SQL Enterprise.  See Largest VM in the Cloud

The “Persistent 1TB Data Disks” refers to connecting external attached drives but be aware that network bandwidth can be a bottleneck.  From Virtual Machine and Cloud Service Sizes for Azure you can see for G5 you can add 64 of 1TB data disks yielding the potential total volume size of up to 64 TB.  See How to Attach a Data Disk to a Windows Virtual Machine to attach a drive and give it a new drive letter, or attach drives to use as a storage space which is a way to make multiple disks appear as one (see Windows Server 2012 Storage Virtualization Explained).

If this still is not enough disk size for your data warehouse, you will need to use an on-prem SQL Server solution or an MPP solution such as Microsoft’s Analytics Platform System.

There is a script you can download (see Deploy a SQL Server Data Warehouse in Windows Azure Virtual Machines) that allows a user to create a *Data Warehousing* optimized VM on Azure running SQL Server 2012 or SQL Server 2014 and will also attach empty disks to the VM to be used for SQL Server data and log files.

untitled

For info on how to connect to an Azure VM using SSMS, click here.  To connect SSMS to an Azure SQL database, click here.

If you are concerned about data throughput from on-prem to azure, check out the ExpressRoute service.  ExpressRoute enables dedicated, private, high-throughput network connectivity between Azure datacenters and your on-premises IT environments.  Using ExpressRoute, you can connect your existing datacenters to Azure without having to flow any traffic over the public Internet, and enable–guaranteed network quality-of-service and the ability to use Azure as a natural extension of an existing private network or datacenter.

More info:

Introduction to Automating Deployment of SQL Server in Azure IaaS Virtual Machines

Understanding Azure SQL Database and SQL Server in Azure VMs

Inside Microsoft Azure SQL Database

Automating Image-Based Deployment of SQL Server on Azure IaaS VMs – Preparing OS Image

About James Serra

James is a big data and data warehousing solution architect at Microsoft. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 25 years of IT experience.
This entry was posted in Azure, Azure SQL Database, Data warehouse, SQLServerPedia Syndication. Bookmark the permalink.

8 Responses to Creating a large data warehouse in Azure

  1. Dave Gugg says:

    For A7 at $3 an hour I calculate a year would cost $26,280 a year. This means I could purchase my own hardware that would last five years for the same cost as running in Azure for two years. Why would Azure IaaS ever be a good solution?

    • James Serra says:

      There are a lot of other costs involved with purchasing your own hardware. The biggest is the $3/hour for Azure IaaS include the licensing for Windows and SQL Server. A7 has 8 cores and to license SQL Server enterprise edition for that many cores can be tens of thousands of dollars per year. Then there is the cost for Windows, the cost of co-location, maintaining the servers, high availability costs, etc.

  2. Don Schaeffer says:

    I’m not following why the VM route is preferrable to SQL Database. Would you elaborate a little on that?

    • James Serra says:

      Hi Don,

      The main reason is SQL Database only supports a database size max of 500GB, as well as not all T-SQL statements are supported in SQL Database.

  3. Fredrik Knutson says:

    Where will the new Azure SQL Data Warehouse fit into all this?

  4. Pingback: HA/DR for Azure SQL Database - SQL Server - SQL Server - Toad World

  5. David says:

    Hi James,
    I ran across this recently. I am looking at potentially creating a Data Warehouse using SQL Server on an Azure VM. How would this relate to SQL 2016? Will there be an updated script of image available for doing this with 2016?

    Thanks