What product to use to transform my data?

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL).  Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data.  There is good news and bad news when it comes to which product to use.  The good news is there are a lot of products to choose from.  The bad news is there are a lot of products to choose from :-).  I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part):

  • SSIS – The product used most often for on-prem ETL still has use cases for it when moving to the cloud, especially when migrating a solution from on-prem to the cloud.  If you have hundreds of SSIS packages, it may make the most sense to continue to use those packages and just change the destination adapter (i.e. from your on-prem SQL Server to Azure SQL Database) and have them execute under the Azure-SSIS integration runtime (IR) in Azure Data Factory (ADF).  Keep in mind if a package is accessing local resources (like storing temp files to a local path or using customer controls), some re-work will be required.  A concern will be if a package is moving a ton of data, as executing transformations row-by-row (ETL) can be much slower than batch transformations (ELT) – see Difference between ETL and ELT.  Also, the data has to be moved to the location of the IR and the compute power is limited by the size of the IR cluster.  For those SSIS packages that are too slow you can just replace those with a product below
  • Azure Data Factory (ADF) – Now that ADF has a new feature called Data Flow, it can transform data so it is more than just an orchestration tool.  Behind the scenes, the ADF JSON code that is created when you build a solution is converted to the appropriate code in the Scala programming language and is prepared, compiled and executed in Azure Databricks.  This means Data Flow operates in an ELT manner: It loads the data into a place where Databricks can access it, performs the transformations, and then moves it to the destination.  ADF provides a native ETL scheduler so that you can automate data transformation and movement processes either through visual data flows or via script activities that execute in Databricks or other execution engines (so, like with SSIS, data flows are row-by-row transformations and for large amounts of data it may be faster to execute a batch transformation via a script in Databricks).  My thoughts on when to use ADF are obviously if you are already using it or if your skillset lies in SSIS as it’s pretty easy to learn ADF with a SSIS background
  • Databricks – It is a Spark-based analytics platform which makes it great to use if you like to work with Spark, Python, Scala, and notebooks.  When choosing between Databricks and ADF, what I’ve noticed is that it depends highly on the customer personas and their capabilities.  There are plenty of Data Engineers and Data Scientists who want to get deep into Python or Scala and sling some code in Databricks Notebooks.  But the larger audience who wants to focus on building business logic to clean customer/address data, for example, doesn’t want to learn Python libraries, and will use the ADF visual data flow designer.  Many of those are also Data Engineers and Data Scientists, but then we start to move up the value stack to include Data Analysts and Business Analysts, which is where we start to overlap with Power BI Dataflow (see below).  Either way, when you want to orchestrate these cleaning routines with schedules, triggers, and monitors, you want that to be through ADF.  Keep in mind if you code your transformations in Databricks Notebooks, you will be responsible for maintaining that code, troubleshooting, and scheduling those routines
  • HDInsight (HDI) – Databricks is the preferred product over HDI, unless the customer has a mature Hadoop ecosystem already established.  But more and more I tend to find that the majority of workloads are Spark, so Databricks is a better option.  In terms of pure Spark workloads Databricks greatly outperforms HDI.  Although Databricks clusters can be up and running indefinitely, they’re intended to be created and terminated as necessary – so if you wish to use other Apache Hadoop data transformation tools and have them available 24/7 then HDI may better a better option than Databricks
  • PolyBase – Azure SQL Data Warehouse (SQL DW) supports PolyBase, so if you are using SQL DW you can pull data from the data lake into SQL DW using T-SQL.  An option is to use PolyBase to import the raw data from the data lake into SQL DW and then clean it there using Stored Procedures, especially if you want to stick with T-SQL and don’t want to deal with Spark or Hive or other more-difficult technologies.  But the benefit of cleaning the data as it sits in the data lake is to off-load the cleaning process so you are not affecting user queries and to reduce the storage space in SQL DW.  Cleaning the data in the data lake can also be less expensive and there are more tools available for data manipulation than just T-SQL.  Another reason is by cleaning the data in the data lake you can make it available to data scientists or power users who don’t have to wait until it’s in SQL DW
  • Power BI Dataflow – In short, Dataflows integrates data lake and data prep technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  So even though the main use case for Dataflow is for an individual solution or for small workloads, you can think of Power Query as the “compute” part that transforms data that lands in the data lake and can then be used as part of an enterprise solution

In addition to what products to use to clean the data, another consideration is where to clean the data.  Say I have an on-prem SQL Server database and I’m exporting some tables into CSV files and then copying them to the data lake (ADLS Gen2) and then to Azure SQL DW.  And I not only have to clean the data but join these CSV files with other files also in the data lake.  Should I clean the data as it moves from the data source to the data lake via ADF or always land the raw data in the data lake and then clean it and join it and land it back into the data lake (via ADF of Databricks)?  Or clean it as it moves from the data lake to SQL DW via ADF or Databricks?  In most cases you will want do all the transforming in the data lake.  So copy the raw data into the data lake, transform it and write it back into the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW, all of which can be orchestrated by ADF.  But this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.

More info:

Decision Tree for Enterprise Information Management (EIM)

Big Data Decision Tree v4

Posted in Azure Data Factory, Azure Databricks, Big Data, Data Lake, ETL, HDInsight, PolyBase, Power BI, SQLServerPedia Syndication, SSIS | 9 Comments

Azure Data Factory Data Flow

Azure Data Factory v2 (ADF) has a new feature in public preview called Data Flow.  I have usually described ADF as an orchestration tool instead of an Extract-Transform-Load (ETL) tool since it has the “E” and “L” in ETL but not the “T”.  But now it has the data transformation capability, making ADF the equivalent of “SSIS in the cloud” since it has the ability to mimic SSIS Data Flow business logic.

Data Flow works without you having to write any lines of code as you build the solution by using drag-and-drop features of the ADF interface to perform data transformations like data cleaning, aggregation, and preparation.  Behind the scenes, the ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compile and execute in Azure Databricks which will automatically scale-out as needed.

There is an excellent chart created by that shows the SSIS tasks and the equivalent ADF operation.  There are currently 13 different dataset manipulation operations with more to come:

  • New Branch
  • Join
  • Conditional Split
  • Union
  • Lookup
  • Derived Column
  • Aggregate
  • Surrogate Key
  • Exists
  • Select
  • Filter
  • Sort
  • Extend

The Data Flow feature in ADF is currently in limited public preview.  If you would like to try this out on your Data Factories, please fill out this form to request whitelisting your Azure Subscription for ADF Data Flows: http://aka.ms/dataflowpreview.  Once your subscription has been enabled, you will see “Data Factory V2 (with data flows)” as an option from the Azure Portal when creating Data Factories.

Follow Mark Kromer and the ADF team on Twitter to stay up to date on the rollout of the preview.  Also check out the ADF Data Flow’s documentation and these ADF Data Flow’s videos.

Don’t confuse this with Dataflows in Power BI, they have nothing to do with each other.

More info:

Azure Data Factory v2 and its available components in Data Flows

WHAT IS AZURE DATA FACTORY DATA FLOW?

AZURE DATA FACTORY DATA FLOW PREVIEW: SABERMETRICS EXAMPLE

Posted in Azure Data Factory, SQLServerPedia Syndication | 2 Comments

Should I load structured data into my data lake?

With data lakes becoming very popular, a common question I have been hearing often from customers is, “Should I load structured/relational data into my data lake?”.  I talked about this a while back in my blog post What is a data lake? and will expand on it in this blog.  Melissa Coates also talked about this recently, and I used her graphic below to illustrate:

I would not say it’s common place to load structured data into the data lake, but I do see it frequently.

In most cases it is not necessary to first copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:

  1. Wanting to offload the data refinement to the data lake (usually Hadoop), so the processing and space on the enterprise data warehouse (EDW) is reduced, and you avoid clashing with people running queries/reports on the EDW
  2. Wanting to use some Hadoop technologies/tools to refine/filter data that do the refinement quicker/better than your EDW
  3. It can ingest large files quickly and provide data redundancy
  4. ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
  5. The data lake is a good place for data that you “might” use down the road.  You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value.  Note PolyBase allows end-users to query data in a data lake using regular SQL, so they are not required to learn any Hadoop-related technologies.  PolyBase even allows the end-user to use SQL, or any reporting tool that uses SQL, to join data in a relational database with data in a Hadoop cluster
  6. Have a backup of the raw data in case you need to load it again due to an ETL error (and not have to go back to the source).  You can keep a long history of raw data
  7. A power user/data scientist wants to make use of the structured data in the data lake (usually combining it with non-relational data)
  8. As a quicker way to load data into Azure SQL Data Warehouse via PolyBase from the data lake (which is usually much faster than using SSIS to copy from the source to Azure SQL Data Warehouse)

You will have to balance these benefits with the extra work required to export data from a relational source into a format such as CSV, then copied to the data lake (where it can be cleaned with a tool such as Databricks and then copied into the relational EDW such as Azure SQL Data Warehouse).  Keep in mind for the relational data that is moved to the data lake you will lose valuable metadata for that data such as data types, constraints, foreign keys, etc.

I have seen some customers with relational data who bypassed the data lake and had the source systems copy data directly to the EDW (i.e. Azure SQL Data Warehouse) using SSIS/ADF, but still wanted all data in the data lake so they used Databricks to copy the data warehouse structures into the data lake using the SQL Data Warehouse Spark Connector.  In these cases, the customers used the data lake for static data (i.e. web logs, IOT data streams and older ERP data that will not change) and managed non-static data directly in the EDW, as moving data updates through the data lake to the EDW adds a layer of complexity they wanted to avoid.

I have also seen cases when the customer moved EDW data to Hadoop, refined it, and moved it back to the EDW which gave the benefit of offloading the processing and/or when they needed to use Hadoop tools.

A common practice when I see customers migrate an on-prem EDW solution to the cloud is they don’t use the data lake for their current sources (which are usually all structured data), but will use it for all new sources (which are usually mostly unstructured data).  However, they like the benefits listed above and will slowly modify most of their ETL packages (usually SSIS) to use the data lake, in some cases replacing SSIS with Azure Data Factory.

More info:

When Should We Load Relational Data to a Data Lake?

Posted in Data Lake, SQLServerPedia Syndication | 4 Comments

Premium blob storage

As a follow-up to my blog Azure Archive Blob Storage, Microsoft has released another storage tier called Azure Premium Blob Storage (announcement).  It is in private preview in US East 2, US Central and US West regions.

This is a performance tier in Azure Blob Storage, complimenting the existing Hot, Cool, and Archive tiers.  Data in Premium Blob Storage is stored on solid-state drives, which are known for lower latency and higher transactional rates compared to traditional hard drives.

It is ideal for workloads that require very fast access time such as interactive video editing, static web content, and online transactions.  It also works well for workloads that perform many relatively small transactions, such as capturing telemetry data, message passing, and data transformation.

Microsoft internal testing shows that both average and 99th percentile server latency is significantly better than the Hot access tier, providing faster and more consistent response times for both read and write across a range of object sizes.

Premium Blob Storage is available with Locally-Redundant Storage and comes with High-Throughput Block Blobs (HTBB), which provides a) improved write throughput when ingesting larger block blobs, b) instant write throughput, and c) container and blob names have no effect on throughput.

You can store block blobs and append blobs in Premium Blob Storage (page blobs is not yet available).  To use Premium Blob Storage you provision a new ‘Block Blob’ storage account in your subscription and start creating containers and blobs using the existing Blob Service REST API and/or any existing tools such as AzCopy or Azure Storage Explorer.

Premium Blob Storage has higher data storage cost, but lower transaction cost compared to data stored in the regular Hot tier.  This makes it cost effective and can be less expensive for workloads with very high transaction rates.  Check out the pricing page for more details.

At present data stored in Premium cannot be tiered to Hot, Cool or Archive access tiers.  Microsoft is working on supporting object tiering in the future.  To move data, you can synchronously copy blobs from using the new PutBlockFromURL API (sample code) or a version of AzCopy that supports this API.  PutBlockFromURL synchronously copies data server side, which means that the data has finished copying when the call completes and all data movement happens inside Azure Storage.

Check out the announcement on how to signup for the private preview.

To summarize the four storage tiers:

  • Premium storage (preview) provides high performance hardware for data that is accessed frequently
  • Hot storage: is optimized for storing data that is accessed frequently
  • Cool storage is optimized for storing data that is infrequently accessed and stored for at least 30 days
  • Archive storage is optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements (on the order of hours)

More info:

Premium Block Blob Storage – a new level of performance

Posted in Azure, SQLServerPedia Syndication | 1 Comment

SQL Server 2019 Big Data Clusters

At the Microsoft Ignite conference, Microsoft announced that SQL Server 2019 is now in preview and that SQL Server 2019 will include Apache Spark and Hadoop Distributed File System (HDFS) for scalable compute and storage.  This new architecture that combines together the SQL Server database engine, Spark, and HDFS into a unified data platform is called a “big data cluster”, deployed as containers on Kubernetes.  Big data clusters can be deployed in any cloud where there is a managed Kubernetes service, such as Azure Kubernetes Service (AKS), or in on-premises Kubernetes clusters, such as AKS on Azure Stack.  The SQL Server 2019 relational database engine in a big data cluster leverages an elastically scalable storage layer that integrates SQL Server and HDFS to scale to petabytes of data storage. The Spark engine is now part of SQL Server:

While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer.  The virtual data layer allows users to query data from many sources through a single, unified interface.  Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date.  Storage costs and data governance complexity are minimized.  See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and  Data Virtualization vs. Data Movement.

SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:

In SQL Server 2019 big data clusters, the SQL Server engine has gained the ability to natively read HDFS files, such as CSV and parquet files, by using SQL Server instances collocated on each of the HDFS data nodes that can filter and aggregate data locally in parallel across all of the HDFS data nodes.

Performance of PolyBase queries in SQL Server 2019 big data clusters can be boosted further by distributing the cross-partition aggregation and shuffling of the filtered query results to “compute pools” comprised of multiple SQL Server instances that work together (this is similar to a PolyBase scale-out group).

When you combine the enhanced PolyBase connectors with SQL Server 2019 big data clusters data pools, data from external data sources can be partitioned and cached across all the SQL Server instances in a data pool, creating a “scale-out data mart”.  There can be more than one scale-out data mart in a given data pool, and a data mart can combine data from multiple external data sources and tables, making it easy to integrate and cache combined data sets from multiple external sources.  This will also be a great solution for importing IoT data.

SQL Server 2019 big data clusters make it easier for big data sets to be joined to the data stored in the enterprise relational database, enabling people and apps that use SQL Server to query big data more easily.  The value of the big data greatly increases when it is not just in the hands of the data scientists and big data engineers but is also included in reports, dashboards, and applications used by regular end users.  At the same time, the data scientists can continue to use big data ecosystem tools against HDFS while also utilizing easy, real-time access to the high-value data in SQL Server because it is all part of one integrated, complete system.

Azure Data Studio (previously released under the name of SQL Operations Studio) is an open-source, multi-purpose data management and analytics tool for DBAs, data scientists, and data engineers.  New extensions for Azure Data Studio integrate the user experience for working with relational data in SQL Server with big data.  The new HDFS browser lets analysts, data scientists, and data engineers easily view the HDFS files and directories in the big data cluster, upload/download files, open them, and delete them if needed.  The new built-in notebooks in Azure Data Studio are built on Jupyter, enabling data scientists and engineers to write Python, R, or Scala code with Intellisense and syntax highlighting before submitting the code as Spark jobs and viewing the results inline.  Notebooks facilitate collaboration between teammates working on a data analysis project together.  Lastly, the External Table Wizard, which uses PolyBase connectors, simplifies the process of creating external data sources and tables, including column mappings (it’s much easier than the current way of creating external tables).

There will also be a management service that will provision a bunch of agents on each pod that will collect monitoring data and the logs that can be seen via a browser-based cluster admin portal, which will also provide managed services for HA, backup/recovery, security, and provisioning.

In summary, SQL Server 2019 Big Data Clusters improves the 4 V’s of Big Data with these features:

More info:

Introducing Microsoft SQL Server 2019 Big Data Clusters

What are SQL Server 2019 big data clusters?

SQL Server 2019 Big Data Clusters white paper

Video New Feature Offerings in SQL Server 2019

Video SQL Server vNext meets AI and Big Data

Video The future of SQL Server and big data

Video Deep dive on SQL Server and big data

Posted in Big Data, PolyBase, SQL Server 2019, SQLServerPedia Syndication | 1 Comment

Azure SQL Database Hyperscale

At Microsoft Ignite, one of the announcements was for Azure SQL Database Hyperscale, which was made available in public preview October 1st, 2018 in 12 different Azure regions.  SQL Database Hyperscale is a new SQL-based and highly scalable service tier for single databases that adapts on-demand to your workload’s needs.  With SQL Database Hyperscale, databases can quickly auto-scale up to 100TB, eliminating the need to pre-provision storage resources, and significantly expanding the potential for app growth without being limited by storage size.  Check out the documentation.

Compared to current Azure SQL Database service tiers, Hyperscale provides the following additional capabilities:

  • Support for up to a 100 TB of database size
  • Nearly instantaneous database backups (based on file snapshots stored in Azure Blob storage) regardless of size with no IO impact on Compute
  • Fast database restores (based on file snapshots) in minutes rather than hours or days (not a size of data operation)
  • Higher overall performance due to higher log throughput and faster transaction commit times regardless of data volumes
  • Rapid scale out – you can provision one or more read-only nodes for offloading your read workload and for use as hot-standbys
  • Rapid Scale up – you can, in constant time, scale up your compute resources to accommodate heavy workloads as and when needed, and then scale the compute resources back down when not needed

The Hyperscale service tier removes many of the practical limits traditionally seen in cloud databases.  Where most other databases are limited by the resources available in a single node, databases in the Hyperscale service tier have no such limits.  With its flexible storage architecture, storage grows as needed.  In fact, Hyperscale databases aren’t created with a defined max size.  A Hyperscale database grows as needed – and you are billed only for the capacity you use.  Storage is dynamically allocated between 5 GB and 100 TB, in 1 GB increments.  For read-intensive workloads, the Hyperscale service tier provides rapid scale-out by provisioning additional read replicas as needed for offloading read workloads.

The Hyperscale service tier is primarily intended for customers who have large databases either on-premises and want to modernize their applications by moving to the cloud or for customers who are already in the cloud and are limited by the maximum database size restrictions (1-4 TB).  It is also intended for customers who seek high performance and high scalability for storage and compute.

The Hyperscale service tier supports all SQL Server workloads, but it is primarily optimized for OLTP.  The Hyperscale service tier also supports hybrid and analytical (data mart) workloads.

It is available under the vCore-based purchasing options for SQL Database (it is not available yet for SQL Database Managed Instance).

Azure SQL Database Hyperscale is built based on a new cloud-born architecture which decouples compute, log and storage.

A Hyperscale database contains the following different types of nodes:

Compute nodes

The compute nodes look like a traditional SQL Server, but without local data files or log files.  The compute node is where the relational engine lives, so all the language elements, query processing, and so on, occur here.  All user interactions with a Hyperscale database happen through these compute nodes.  Compute nodes have SSD-based caches (labeled RBPEX – Resilient Buffer Pool Extension in the preceding diagram) to minimize the number of network round trips required to fetch a page of data.  There is one primary compute node where all the read-write workloads and transactions are processed.  There are one or more secondary compute nodes that act as hot standby nodes for failover purposes, as well as act as read-only compute nodes for offloading read workloads (if this functionality is desired).

Log service

The log service externalizes the transactional log from a Hyperscale database.  The log service node accepts log records from the primary compute node, persists them in a durable log cache, and forwards the log records to the rest of the compute nodes (so they can update their caches) as well as the relevant page server(s), so that the data can be updated there.  In this way, all data changes from the primary compute node are propagated through the log service to all the secondary compute nodes and page servers.  Finally, the log record(s) are pushed out to long-term storage in Azure Standard Storage, which is an infinite storage repository.  This mechanism removes the necessity for frequent log truncation.  The log service also has local cache to speed up access, so you have end-to-end log latencies in the 2ms range.  When the Azure Ultra SSD storage technology becomes available, this latency will shrink to around 0.4ms for remote, fully durable and resilient data storage.

Page servers

The page servers host and maintain the data files.  It consumes the log stream from the log services and applies the data modifications described in the log stream to data files.  Read requests of data pages that are not found in the compute’s local data cache or RBPEX are sent over the network to the page servers that own the pages.  In page servers, the data files are persisted in Azure Storage and are heavily cached through RBPEX (SSD-based caches).

Page servers are systems representing a scaled-out storage engine.  Multiple page servers will be created for a large database.  When the database is growing and available space in existing page servers is lower than a threshold, a new page server is automatically added to the database.  Since page servers are working independently, it allows us to grow the database with no local resource constraints.  Each page server is responsible for a subset of the pages in the database.  Nominally, each page server controls one terabyte of data.  No data is shared on more than one page server (outside of replicas that are kept for redundancy and availability).  The job of a page server is to serve database pages out to the compute nodes on demand, and to keep the pages updated as transactions update data.  Page servers are kept up-to-date by playing log records from the log service.  Long-term storage of data pages is kept in Azure Standard Storage for additional reliability.

Azure standard storage node

The Azure storage node is the final destination of data from page servers.  This storage is used for backup purposes as well as for replication between Azure regions.  Backups consist of snapshots of data files.  Restore operation are fast from these snapshots and data can be restored to any point in time.

Automated backup and point in time restore

In a Hyperscale database, snapshots of the data files are taken from the page servers periodically to replace the traditional streaming backup.  This allows for a backup of a very large database in just a few seconds.  Together with the log records stored in the log service, you can restore the database to any point in time during retention (7 days in public preview and will be configurable up to 35 days at general availability) in a very short time, regardless of the database size (a 50TB database takes just 8 minutes to restore!).

Since the backups are file-snapshot base and hence they are nearly instantaneous.  Storage and compute separation enable pushing down the backup/restore operation to the storage layer to reduce the processing burden on the primary compute node.  As a result, the backup of a large database does not impact the performance of the primary compute node.  Similarly, restores are done by copying the file snapshot and as such are not a size of data operation.  For restores within the same storage account, the restore operation is fast.

Highly Available Components

Each of the components in the Azure SQL Database Hyperscale architecture is designed to be highly available so that there will be no interruptions in database access:

  • Compute nodes each have at least one replica running and hot at all times.  In the event of a compute node failure or failover, the replica would immediately take over the Primary role, keeping the database online and available.  A replacement replica can be started up very quickly, and can warm up its caches as a background task without impacting production performance.  Other replicas may be configured for read scale-out purposes.  With this architecture, the compute nodes are effectively stateless
  • Page servers each have a standby replica online with their RBPEX cache fully populated, so they’re available to take over for the active page server in the event of failure.  Again, because they’re stateless outside of cached data, a replacement can be online very quickly, without any risk of data loss
  • The log service doesn’t have a hot standby, but this is unnecessary as the log service has no cached data and can be replaced as quickly as you could failover to a standby replica.  The data that’s managed by the log service resides first in the landing zone, which is resilient Azure Premium Storage (soon Ultra SSD storage), and is ultimately persisted in Azure Standard storage for long-term retention

So, as you can see, there’s no component that represents a single point of failure. All Hyperscale components have active standbys, with the exception of the log service, and all data is backed by fully resilient Azure storage.

More info:

Video New performance and scale enhancements for Azure SQL Database

Introducing Azure SQL Database Hyperscale

Posted in Azure SQL Database, SQLServerPedia Syndication | 1 Comment

Azure Database Migration Service (DMS)

As I first mentioned in my blog Microsoft database migration tools, the Azure Database Migration Service (DMS) is a PaaS solution that makes it easy to migrate from on-prem/RDS to Azure and one database type to another.  I’ll give a brief overview of some of the key features:

The first thing you will do is create an instance of Azure Database Migration Service, which basically reserves compute power in a region (make sure it’s the region where you destination databases will be – there is a limit of 2 DMS’s but you can email support to have it increased).  DMS allows you to choose an already existing VNET with connectivity to your source or create a basic VNET that can connect to your source servers that have public facing IPs (or are under the same VNET as the DMS service or are accessible via VNET peering or tunneling). In case your migration target is SQL Database Managed Instance (SQL MI), this also needs to be the VNET where your SQL MI instance is located (you can skip this step if migrating to SQL Database Single since it doesn’t support VNET).  Note the usual case is to connect to an existing VNET that has connectivity to the source and target so you don’t have to create a new VNET when creating a DMS.  After you create the DMS, you can create a migration project and run migration activities.

Note that DMS now supports using existing SQL Server backup files for migrations from SQL Server to SQL MI, saving time and making the overall migration process easier to perform.

When creating a migration project, you will choose either “online data migration” or “offline data migration” to refer to migrations with and without ongoing replication, respectively.

Offline migrations does a backup & restore in SQL to SQL VM and SQL to SQL MI scenarios.  For SQL to SQL Database Single, it copies the data (schema migration support to be added soon) itself using a home-built DMS streaming pipeline that uses bulk copy APIs, which has one of the best throughput when compared to all existing tech.  The same tech ships in Database Migration Assistant (DMA), but DMS is more reliable and scalable.

Online migrations (also known as continuous migrations, minimal downtime, continuous sync) uses tech that is based on reading the logs and streaming the data (for migrations to SQL Database Single it syncs via transactional replication and you must use DMA first to migrate the schema which may need changes).  In private preview is a new pipeline for migrations to SQL MI which will be based on log shipping.

Here is the list of all the current migration options (as of December 2018):

The new Database Migration Guide is for enterprise customers, partners, and business decision makers who are interested in moving to Azure cloud services (i.e. migrating from Oracle or SQL Server to Azure Data Services).  The Database Migration Guide provides comprehensive, step-by-step guidance for performing migrations, as well as improves the discoverability of the guidance, tools, software, and programs that are available to assist customers in performing these migrations.  Also, this white paper​ will guide you through the thought process and steps required to migrate your database workloads from on-premises to Azure-based cloud services.

Currently Microsoft does not have assessment rules in DMA specifically for SQL MI but it should be available soon.

More info:

Migrating and modernizing your data estate to Azure with Data Migration Services : Build 2018

Migration from SQL Server to Azure SQL Database Managed Instance

Posted in Database Migration Service, SQLServerPedia Syndication | 1 Comment

Azure SQL Database Read Scale-Out

Read Scale-Out is a little-known feature that allows you to load balance Azure SQL Database read-only workloads using the capacity of read-only replicas, for free.

As mentioned in my blog Azure SQL Database high availability, each database in the Premium tier (DTU-based purchasing model) or in the Business Critical tier (vCore-based purchasing model) is automatically provisioned with several AlwaysON read-only replicas using synchronous-commit mode to support the availability SLA of 99.99% (these AlwaysON replicas and created automatically even if you are not using geo-replication).  These replicas are provisioned with the same performance level as the read-write replica used by the regular database connections.  The Read Scale-Out feature allows you to load balance SQL Database read-only workloads using the capacity of one of the read-only replicas instead of all queries hitting the read-write replica.  This way the read-only workload will be isolated from the main read-write workload and will not affect its performance.  This feature is intended for applications that include logically separated read-only workloads, such as analytics, and therefore could gain performance benefits using this additional capacity at no extra cost.

I highlighted “one” above to bring attention to the fact that only one replica is used, meaning it does not use multiple read-only replica’s and load balance between them.

Another option for read-only workloads is if you also decide to use geo-replication (which is not free), this will create secondary databases (currently four) using asynchronous-commit mode that can be made readable and you can direct connections to each of those secondary’s directly in the connection string, and do your own load balancing between them.  For more info on geo-replication see my blog Azure SQL Database disaster recovery.

And if you are using Read Scale-Out to load balance read-only workloads on a database that is geo-replicated (e.g. as a member of a failover group), make sure that Read Scale-Out is enabled on both the primary and the geo-replicated secondary databases.  This will ensure the same load-balancing effect when your application connects to the new primary after failover.

To read how to enable Read Scale-Out and send queries to the read-only replica, check out Use read-only replicas to load balance read-only query workloads (preview).

More info:

Overview: Active geo-replication and auto-failover groups

Create linked server to readable secondary replica in Managed Instance Business Critical service tier

Azure SQL DB Readable Secondary

Posted in Azure SQL Database, SQLServerPedia Syndication | 2 Comments

Azure SQL Database disaster recovery

My last blog post was on Azure SQL Database high availability and I would like to continue along that discussion with a blog post about disaster recovery in Azure SQL Database.  First, a clarification on the difference between high availability and disaster recovery:

High Availability (HA) – Keeping your database up 100% of the time with no data loss during common problems.  Redundancy at system level, focus on failover, addresses single predictable failure, focus is on technology.  SQL Server IaaS would handle this with:

  • Always On Failover cluster instances
  • Always On Availability Groups (in same Azure region)
  • SQL Server data files in Azure

Disaster Recovery (DR) – Protection if major disaster or unusual failure wipes out your database.  Use of alternate site, focus on re-establishing services, addresses multiple failures, includes people and processes to execute recovery.  Usually includes HA also.  SQL Server IaaS would handle this with:

  • Log Shipping
  • Database Mirroring
  • Always On Availability Groups (different Azure regions)
  • Backup to Azure

Azure SQL Database makes setting up disaster recovery so much easier than SQL Server IaaS (in a VM).  Disaster recovery is done via active geo-replication, which is an Azure SQL Database feature that allows you to create readable replicas of your database in the same or different data center (region).  All it takes is navigating to this page and choosing the region to create a secondary database (this example of active geo-replication is configured with a primary in the North Central US region and secondary in the South Central US region):

Once created, the secondary database is populated with the data copied from the primary database.  This process is known as seeding.  After the secondary database has been created and seeded, updates to the primary database are asynchronously replicated to the secondary database automatically.  Asynchronous replication means that transactions are committed on the primary database before they are replicated to the secondary database.

Compare this to setting up AlwaysOn Availability Groups!  And then think about the time it takes to monitor and maintain AlwaysOn Availability Groups, stuff that you won’t have to worry about anymore, and you can see why Azure SQL database is such a pleasure.

Active geo-replication is designed as a business continuity solution that allows an application to perform quick disaster recovery in case of a data center scale outage.  If geo-replication is enabled, the application can initiate a failover to a secondary database in a different Azure region.  Up to four secondaries are supported in the same or different regions, and the secondaries can also be used for read-only access queries.  The failover can be initiated manually by the application or the user.  After failover, the new primary has a different connection end point.  As each secondary is a discrete database with the same name as the primary but in a different server you will need to reconfigure your application(s) with an updated connection string.

Each geo-replicated database will be on a server in a region with server names you give them such as ServerNameWestUS.database.windows.net (primary) and ServerNameEastUS.database.windows.net (secondary).  If you want to load balance between them in your application for read queries, if the application itself is load balanced between regions you can use Azure Traffic Manager.  If the application itself is not geo-replicated, then use Azure Load Balancer.

Auto-failover groups is an extension of active geo-replication.  It is designed to manage the failover of multiple geo-replicated databases simultaneously using an application initiated failover or by delegating failover to be done by the SQL Database service based on a user defined criteria.  The latter allows you to automatically recover multiple related databases in a secondary region after a catastrophic failure or other unplanned event that results in full or partial loss of the SQL Database service’s availability in the primary region.  Because auto-failover groups involve multiple databases, these databases must be configured on the primary server.  Both primary and secondary servers for the databases in the failover group must be in the same subscription.  Auto-failover groups support replication of all databases in the group to only one secondary server in a different region.

If you are using active geo-replication and for any reason your primary database fails, or simply needs to be taken offline, you can initiate failover to any of your secondary databases.  When failover is activated to one of the secondary databases, all other secondaries are automatically linked to the new primary.  If you are using auto-failover groups to manage database recovery, any outage that impacts one or several of the databases in the group results in automatic failover.  You can configure the auto-failover policy that best meets your application needs, or you can opt out and use manual activation.  In addition, auto-failover groups provide read-write and read-only listener end-points that remain unchanged during failovers.  Whether you use manual or automatic failover activation, failover switches all secondary databases in the group to primary.  After the database failover is completed, the DNS record is automatically updated to redirect the end-points to the new region.

Active geo-replication leverages the Always On technology of SQL Server to asynchronously replicate committed transactions on the primary database to a secondary database using snapshot isolation.  The primary and secondary instances in a geo-replication relationship have independent HA capabilities, the same as a standalone instance would have.  Auto-failover groups provide the group semantics on top of active geo-replication but the same asynchronous replication mechanism is used.  While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to never have partial transactions.  Cross-region redundancy enables applications to quickly recover from a permanent loss of an entire datacenter or parts of a datacenter caused by natural disasters, catastrophic human errors, or malicious acts.  The specific recovery point objective (RPO) data can be found at Overview of Business Continuity (The time period of updates that you might lose is under 5 seconds).

More info:

Data Integrity in Azure SQL Database

High-availability and Azure SQL Database

Overview of business continuity with Azure SQL Database

Overview: Active geo-replication and auto-failover groups

Designing globally available services using Azure SQL Database

Spotlight on SQL Database Active Geo-Replication

Azure SQL Database Business Continuity Enhancements

Posted in Azure SQL Database, SQLServerPedia Syndication | 3 Comments

Azure SQL Database high availability

In this blog I want to talk about how Azure SQL Database achieves high availability.  One of the major benefits from moving from on-prem SQL Server to Azure SQL Database is how much easier it is to have high availability – no need for creating and managing a SQL Server failover cluster, AlwaysOn availability groups, database mirroring, log shipping, SAN replication, etc.

Azure SQL Database is a highly available database Platform as a Service that guarantees that your database is up and running 99.99% of time, without worrying about maintenance and downtimes.  This is a fully managed SQL Server Database Engine process hosted in the Azure cloud that ensures that your SQL Server database is always upgraded/patched without affecting your workload.  Azure automatically handles patching, backups, replication, failure detection, underlying hardware, software or network failures, deploying bug fixes, failovers, database upgrades, and other maintenance tasks.  Azure SQL Database can quickly recover even in the most critical circumstances ensuring that your data is always available.

Azure SQL Database is based on the SQL Server Database Engine architecture that is adjusted for the cloud environment in order to ensure 99.99% availability even in the cases of infrastructure failures.  There are two high-availability architectural models that are used in Azure SQL Database (both of them ensuring 99.99% availability):

(NOTE: Basic/Standard/Premium are service tiers that are DTU-based and used only for SQL Database Single, and General Purpose/Business Critical are vCore-based and used for both SQL Database Single and SQL Database Managed Instance)

  • Basic/Standard/General Purpose model that is based on remote storage.  This architectural model relies on high availability and reliability of the storage tier, but it might have some potential performance degradation during maintenance activities.  This model uses Azure Premium Storage Disks
  • Premium/Business Critical model that is based on a cluster of database engine processes.  This architectural model relies on a fact that there is always a quorum of available database engine nodes and has minimal performance impact on your workload even during maintenance activities.  This model uses AlwaysOn Availability Groups and local attached SSD storage.  Provides higher IOPS and throughput than Basic/Standard/General Purpose

Azure SQL Database runs on the latest stable version of SQL Server Database Engine and Windows OS, and most of the users would not notice that the upgrades are performed continuously.

More details on these two options:

Basic/Standard/General Purpose

High availability in these service tiers is achieved by separation of compute and storage layers and the replication of data in the storage tier (which uses Azure Premium Storage):

There are two layers:

  • Active compute nodes: A stateless compute layer that is running the sqlserver.exe process and contains only transient and cached data (for example – plan cache, buffer pool, column store pool).  This stateless SQL Server node is operated by Azure Service Fabric that initializes process, controls health of the node, and performs failover to another place if necessary
  • Azure Storage accounts: A stateful data layer with database files (.mdf/.ldf) that are stored in Azure Premium Storage Disks, which is remote storage (i.e. it is accessed over the network, using Azure network infrastructure).  It is able to use Azure Premium Storage by taking advantage of SQL Server native capability to use database files directly in Azure Blob Storage.  This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage.  Azure Storage guarantees that there will be no data loss of any record that is placed in any database file (since three copies of the data is made via LRS).  Azure Storage has built-in data availability/redundancy that ensures that every record in log file or page in data file will be preserved even if SQL Server process crashes.  Note the tempdb database is not using Azure Premium Storage but rather it is located on the local SSD storage, which provides very low latency and high IOPS/throughput

Whenever the database engine or operating system is upgraded, or some part of underlying infrastructure fails, or if some critical issue is detected in the SQL Server process, Azure Service Fabric will move the stateless SQL Server process to another stateless compute node.  There is a set of redundant (“spare”) nodes that is waiting to run new compute service in case of failover in order to minimize failover time.  Data in the Azure storage layer is not affected, and data/log files are attached to the newly initialized SQL Server process.  Failover time can be measured in seconds.  This process guarantees 99.99% availability, but it might have some performance impacts on a heavy workload that is running due to transition time and the fact the new SQL Server node starts with cold cache.

See Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose) for info on performance improvement.

Premium/Business Critical

High availability in these service tiers is designed for intensive workloads that cannot tolerate any performance impact due to the ongoing maintenance operations.

In the premium model, Azure SQL database integrates compute and storage on the single node.  High availability in this architectural model is achieved by replication of compute (SQL Server Database Engine process) and storage (locally attached SSD) deployed in a Always On Availability Groups cluster with enough replicas to achieve quorum and provide HA guarantees (currently 4 nodes as shown below):


Both the SQL Server Database Engine process and the underlying mdf/ldf files are placed on the same node with locally attached SSD storage providing low latency to your workload.  High availability is implemented using Always On Availability Groups.  Every database is a cluster of database nodes with one primary database that is accessible for the customer workload, and three secondary processes containing copies of data.  The primary node constantly pushes the changes to secondary nodes in order to ensure that the data is available on secondary replicas if the primary node crashes for any reason.  Failover is handled by the Azure Service Fabric – one secondary replica becomes the primary node and a new secondary replica is created to ensure enough nodes in the cluster, and the workload is automatically redirected to the new primary node.  Failover time is measured in milliseconds for most workloads, and the new primary instance is immediately ready to continue serving requests.

A note on the difference in the handling of a failover compared to on-prem: The database engine cannot control the failover because it may not be running when a failover has to occur, i.e. it may have just crashed.  Failover has to be initiated by a component external to the database engine.  For traditional SQL Server, this component is Windows Failover Clustering.  For SQL DB and MI, this is Service Fabric.

IO Performance difference:

vCore model (from here):

General Purpose Business Critical
IO throughput (approximate) Singleton Database: 500 IOPS per vCore with 7000 maximum IOPS
Managed Instance: 500-7500 IOPS per data file (depends on size of file)
5000 IOPS per core with 200000 maximum IOPS

 

DTU model (from here):

Basic Standard Premium
IO throughput (approximate) 2.5 IOPS per DTU 2.5 IOPS per DTU 48 IOPS per DTU
IO latency (approximate) 5 ms (read), 10 ms (write) 5 ms (read), 10 ms (write) 2 ms (read/write)

 

Note Managed Instance only supports the vCore model.

A word about storage: Azure SQL Database Singleton and Managed Instance both use Azure storage page blobs as the underlying persistent storage for its databases.  Azure premium managed disks are just premium page blobs with some API make-up so they look like disks.  Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs.

A word about failover times:  In Business Critical case, there is a secondary replica that is an exact read-only copy of primary instance, so failover is just a switch to new IP address and it is almost instant.  In more realistic cases, there is always some lag in secondary replica because it is constantly redoing transaction log records that are sent from the primary node.  Failover time is equal to the time needed to apply all remaining transaction log records to became consistent with the primary node, and then the switch to the new IP is completed.  Under heavy workload that saturates both primary and secondary replica there might be a chance that secondary cannot immediately catchup to primary so log redo time might be even longer.  The exact time depends on the workload and there are no official numbers or formula to calculate this.  In the General Purpose case, there is a stateless compute node ready to run sqlservr.exe that attaches .mdf/ldf files from the remote storage.  This is cold cache process that should be initialized and failover time is longer than Business Critical.  Failover time depends on the database size and also can vary.

Finally, if you are interested in how Microsoft manages data integrity for Azure SQL Database, check out Data Integrity in Azure SQL Database.

More info:

High-availability and Azure SQL Database

Overview of business continuity with Azure SQL Database

Overview: Active geo-replication and auto-failover groups

Reaching Azure disk storage limit on General Purpose Azure SQL Database Managed Instance

File layout in General Purpose Azure SQL Managed Instance

What is built-in intelligence in Azure SQL Database?

Posted in Azure SQL Database, SQLServerPedia Syndication | 2 Comments