Should I load structured data into my data lake?

With data lakes becoming very popular, a common question I have been hearing often from customers is, “Should I load structured/relational data into my data lake?”.  I talked about this a while back in my blog post What is a data lake? and will expand on it in this blog.  Melissa Coates also talked about this recently, and I used her graphic below to illustrate:

I would not say it’s common place to load structured data into the data lake, but I do see it frequently.

In most cases it is not necessary to first copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:

  1. Wanting to offload the data refinement to the data lake (usually Hadoop), so the processing and space on the enterprise data warehouse (EDW) is reduced, and you avoid clashing with people running queries/reports on the EDW
  2. Wanting to use some Hadoop technologies/tools to refine/filter data that do the refinement quicker/better than your EDW
  3. It can ingest large files quickly and provide data redundancy
  4. ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
  5. The data lake is a good place for data that you “might” use down the road.  You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value.  Note PolyBase allows end-users to query data in a data lake using regular SQL, so they are not required to learn any Hadoop-related technologies.  PolyBase even allows the end-user to use SQL, or any reporting tool that uses SQL, to join data in a relational database with data in a Hadoop cluster
  6. Have a backup of the raw data in case you need to load it again due to an ETL error (and not have to go back to the source).  You can keep a long history of raw data
  7. A power user/data scientist wants to make use of the structured data in the data lake (usually combining it with non-relational data)
  8. As a quicker way to load data into Azure SQL Data Warehouse via PolyBase from the data lake (which is usually much faster than using SSIS to copy from the source to Azure SQL Data Warehouse)

You will have to balance these benefits with the extra work required to export data from a relational source into a format such as CSV, then copied to the data lake (where it can be cleaned with a tool such as Databricks and then copied into the relational EDW such as Azure SQL Data Warehouse).  Keep in mind for the relational data that is moved to the data lake you will lose valuable metadata for that data such as data types, constraints, foreign keys, etc.

I have seen some customers with relational data who bypassed the data lake and had the source systems copy data directly to the EDW (i.e. Azure SQL Data Warehouse) using SSIS/ADF, but still wanted all data in the data lake so they used Databricks to copy the data warehouse structures into the data lake using the SQL Data Warehouse Spark Connector.  In these cases, the customers used the data lake for static data (i.e. web logs, IOT data streams and older ERP data that will not change) and managed non-static data directly in the EDW, as moving data updates through the data lake to the EDW adds a layer of complexity they wanted to avoid.

I have also seen cases when the customer moved EDW data to Hadoop, refined it, and moved it back to the EDW which gave the benefit of offloading the processing and/or when they needed to use Hadoop tools.

A common practice when I see customers migrate an on-prem EDW solution to the cloud is they don’t use the data lake for their current sources (which are usually all structured data), but will use it for all new sources (which are usually mostly unstructured data).  However, they like the benefits listed above and will slowly modify most of their ETL packages (usually SSIS) to use the data lake, in some cases replacing SSIS with Azure Data Factory.

More info:

When Should We Load Relational Data to a Data Lake?

Posted in Data Lake, SQLServerPedia Syndication | 3 Comments

Premium blob storage

As a follow-up to my blog Azure Archive Blob Storage, Microsoft has released another storage tier called Azure Premium Blob Storage (announcement).  It is in private preview in US East 2, US Central and US West regions.

This is a performance tier in Azure Blob Storage, complimenting the existing Hot, Cool, and Archive tiers.  Data in Premium Blob Storage is stored on solid-state drives, which are known for lower latency and higher transactional rates compared to traditional hard drives.

It is ideal for workloads that require very fast access time such as interactive video editing, static web content, and online transactions.  It also works well for workloads that perform many relatively small transactions, such as capturing telemetry data, message passing, and data transformation.

Microsoft internal testing shows that both average and 99th percentile server latency is significantly better than the Hot access tier, providing faster and more consistent response times for both read and write across a range of object sizes.

Premium Blob Storage is available with Locally-Redundant Storage and comes with High-Throughput Block Blobs (HTBB), which provides a) improved write throughput when ingesting larger block blobs, b) instant write throughput, and c) container and blob names have no effect on throughput.

You can store block blobs and append blobs in Premium Blob Storage (page blobs is not yet available).  To use Premium Blob Storage you provision a new ‘Block Blob’ storage account in your subscription and start creating containers and blobs using the existing Blob Service REST API and/or any existing tools such as AzCopy or Azure Storage Explorer.

Premium Blob Storage has higher data storage cost, but lower transaction cost compared to data stored in the regular Hot tier.  This makes it cost effective and can be less expensive for workloads with very high transaction rates.  Check out the pricing page for more details.

At present data stored in Premium cannot be tiered to Hot, Cool or Archive access tiers.  Microsoft is working on supporting object tiering in the future.  To move data, you can synchronously copy blobs from using the new PutBlockFromURL API (sample code) or a version of AzCopy that supports this API.  PutBlockFromURL synchronously copies data server side, which means that the data has finished copying when the call completes and all data movement happens inside Azure Storage.

Check out the announcement on how to signup for the private preview.

To summarize the four storage tiers:

  • Premium storage (preview) provides high performance hardware for data that is accessed frequently
  • Hot storage: is optimized for storing data that is accessed frequently
  • Cool storage is optimized for storing data that is infrequently accessed and stored for at least 30 days
  • Archive storage is optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements (on the order of hours)

More info:

Premium Block Blob Storage – a new level of performance

Posted in Azure, SQLServerPedia Syndication | 1 Comment

SQL Server 2019 Big Data Clusters

At the Microsoft Ignite conference, Microsoft announced that SQL Server 2019 is now in preview and that SQL Server 2019 will include Apache Spark and Hadoop Distributed File System (HDFS) for scalable compute and storage.  This new architecture that combines together the SQL Server database engine, Spark, and HDFS into a unified data platform is called a “big data cluster”, deployed as containers on Kubernetes.  Big data clusters can be deployed in any cloud where there is a managed Kubernetes service, such as Azure Kubernetes Service (AKS), or in on-premises Kubernetes clusters, such as AKS on Azure Stack.  The SQL Server 2019 relational database engine in a big data cluster leverages an elastically scalable storage layer that integrates SQL Server and HDFS to scale to petabytes of data storage. The Spark engine is now part of SQL Server:

While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer.  The virtual data layer allows users to query data from many sources through a single, unified interface.  Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date.  Storage costs and data governance complexity are minimized.  See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and  Data Virtualization vs. Data Movement.

SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:

In SQL Server 2019 big data clusters, the SQL Server engine has gained the ability to natively read HDFS files, such as CSV and parquet files, by using SQL Server instances collocated on each of the HDFS data nodes that can filter and aggregate data locally in parallel across all of the HDFS data nodes.

Performance of PolyBase queries in SQL Server 2019 big data clusters can be boosted further by distributing the cross-partition aggregation and shuffling of the filtered query results to “compute pools” comprised of multiple SQL Server instances that work together (this is similar to a PolyBase scale-out group).

When you combine the enhanced PolyBase connectors with SQL Server 2019 big data clusters data pools, data from external data sources can be partitioned and cached across all the SQL Server instances in a data pool, creating a “scale-out data mart”.  There can be more than one scale-out data mart in a given data pool, and a data mart can combine data from multiple external data sources and tables, making it easy to integrate and cache combined data sets from multiple external sources.  This will also be a great solution for importing IoT data.

SQL Server 2019 big data clusters make it easier for big data sets to be joined to the data stored in the enterprise relational database, enabling people and apps that use SQL Server to query big data more easily.  The value of the big data greatly increases when it is not just in the hands of the data scientists and big data engineers but is also included in reports, dashboards, and applications used by regular end users.  At the same time, the data scientists can continue to use big data ecosystem tools against HDFS while also utilizing easy, real-time access to the high-value data in SQL Server because it is all part of one integrated, complete system.

Azure Data Studio (previously released under the name of SQL Operations Studio) is an open-source, multi-purpose data management and analytics tool for DBAs, data scientists, and data engineers.  New extensions for Azure Data Studio integrate the user experience for working with relational data in SQL Server with big data.  The new HDFS browser lets analysts, data scientists, and data engineers easily view the HDFS files and directories in the big data cluster, upload/download files, open them, and delete them if needed.  The new built-in notebooks in Azure Data Studio are built on Jupyter, enabling data scientists and engineers to write Python, R, or Scala code with Intellisense and syntax highlighting before submitting the code as Spark jobs and viewing the results inline.  Notebooks facilitate collaboration between teammates working on a data analysis project together.  Lastly, the External Table Wizard, which uses PolyBase connectors, simplifies the process of creating external data sources and tables, including column mappings (it’s much easier than the current way of creating external tables).

There will also be a management service that will provision a bunch of agents on each pod that will collect monitoring data and the logs that can be seen via a browser-based cluster admin portal, which will also provide managed services for HA, backup/recovery, security, and provisioning.

In summary, SQL Server 2019 Big Data Clusters improves the 4 V’s of Big Data with these features:

More info:

Introducing Microsoft SQL Server 2019 Big Data Clusters

What are SQL Server 2019 big data clusters?

SQL Server 2019 Big Data Clusters white paper

Video New Feature Offerings in SQL Server 2019

Video SQL Server vNext meets AI and Big Data

Video The future of SQL Server and big data

Video Deep dive on SQL Server and big data

Posted in Big Data, PolyBase, SQL Server 2019, SQLServerPedia Syndication | 1 Comment

Azure SQL Database Hyperscale

At Microsoft Ignite, one of the announcements was for Azure SQL Database Hyperscale, which was made available in public preview October 1st, 2018 in 12 different Azure regions.  SQL Database Hyperscale is a new SQL-based and highly scalable service tier for single databases that adapts on-demand to your workload’s needs.  With SQL Database Hyperscale, databases can quickly auto-scale up to 100TB, eliminating the need to pre-provision storage resources, and significantly expanding the potential for app growth without being limited by storage size.  Check out the documentation.

Compared to current Azure SQL Database service tiers, Hyperscale provides the following additional capabilities:

  • Support for up to a 100 TB of database size
  • Nearly instantaneous database backups (based on file snapshots stored in Azure Blob storage) regardless of size with no IO impact on Compute
  • Fast database restores (based on file snapshots) in minutes rather than hours or days (not a size of data operation)
  • Higher overall performance due to higher log throughput and faster transaction commit times regardless of data volumes
  • Rapid scale out – you can provision one or more read-only nodes for offloading your read workload and for use as hot-standbys
  • Rapid Scale up – you can, in constant time, scale up your compute resources to accommodate heavy workloads as and when needed, and then scale the compute resources back down when not needed

The Hyperscale service tier removes many of the practical limits traditionally seen in cloud databases.  Where most other databases are limited by the resources available in a single node, databases in the Hyperscale service tier have no such limits.  With its flexible storage architecture, storage grows as needed.  In fact, Hyperscale databases aren’t created with a defined max size.  A Hyperscale database grows as needed – and you are billed only for the capacity you use.  Storage is dynamically allocated between 5 GB and 100 TB, in 1 GB increments.  For read-intensive workloads, the Hyperscale service tier provides rapid scale-out by provisioning additional read replicas as needed for offloading read workloads.

The Hyperscale service tier is primarily intended for customers who have large databases either on-premises and want to modernize their applications by moving to the cloud or for customers who are already in the cloud and are limited by the maximum database size restrictions (1-4 TB).  It is also intended for customers who seek high performance and high scalability for storage and compute.

The Hyperscale service tier supports all SQL Server workloads, but it is primarily optimized for OLTP.  The Hyperscale service tier also supports hybrid and analytical (data mart) workloads.

It is available under the vCore-based purchasing options for SQL Database (it is not available yet for SQL Database Managed Instance).

Azure SQL Database Hyperscale is built based on a new cloud-born architecture which decouples compute, log and storage.

A Hyperscale database contains the following different types of nodes:

Compute nodes

The compute nodes look like a traditional SQL Server, but without local data files or log files.  The compute node is where the relational engine lives, so all the language elements, query processing, and so on, occur here.  All user interactions with a Hyperscale database happen through these compute nodes.  Compute nodes have SSD-based caches (labeled RBPEX – Resilient Buffer Pool Extension in the preceding diagram) to minimize the number of network round trips required to fetch a page of data.  There is one primary compute node where all the read-write workloads and transactions are processed.  There are one or more secondary compute nodes that act as hot standby nodes for failover purposes, as well as act as read-only compute nodes for offloading read workloads (if this functionality is desired).

Log service

The log service externalizes the transactional log from a Hyperscale database.  The log service node accepts log records from the primary compute node, persists them in a durable log cache, and forwards the log records to the rest of the compute nodes (so they can update their caches) as well as the relevant page server(s), so that the data can be updated there.  In this way, all data changes from the primary compute node are propagated through the log service to all the secondary compute nodes and page servers.  Finally, the log record(s) are pushed out to long-term storage in Azure Standard Storage, which is an infinite storage repository.  This mechanism removes the necessity for frequent log truncation.  The log service also has local cache to speed up access.

Page servers

The page servers host and maintain the data files.  It consumes the log stream from the log services and applies the data modifications described in the log stream to data files.  Read requests of data pages that are not found in the compute’s local data cache or RBPEX are sent over the network to the page servers that own the pages.  In page servers, the data files are persisted in Azure Storage and are heavily cached through RBPEX (SSD-based caches).

Page servers are systems representing a scaled-out storage engine.  Multiple page servers will be created for a large database.  When the database is growing and available space in existing page servers is lower than a threshold, a new page server is automatically added to the database.  Since page servers are working independently, it allows us to grow the database with no local resource constraints.  Each page server is responsible for a subset of the pages in the database.  Nominally, each page server controls one terabyte of data.  No data is shared on more than one page server (outside of replicas that are kept for redundancy and availability).  The job of a page server is to serve database pages out to the compute nodes on demand, and to keep the pages updated as transactions update data.  Page servers are kept up-to-date by playing log records from the log service.  Long-term storage of data pages is kept in Azure Standard Storage for additional reliability.

Azure standard storage node

The Azure storage node is the final destination of data from page servers.  This storage is used for backup purposes as well as for replication between Azure regions.  Backups consist of snapshots of data files.  Restore operation are fast from these snapshots and data can be restored to any point in time.

Automated backup and point in time restore

In a Hyperscale database, snapshots of the data files are taken from the page servers periodically to replace the traditional streaming backup.  This allows for a backup of a very large database in just a few seconds.  Together with the log records stored in the log service, you can restore the database to any point in time during retention (7 days in public preview) in a very short time, regardless of the database size (a 50TB database takes just 8 minutes to restore!).

Since the backups are file-snapshot base and hence they are nearly instantaneous.  Storage and compute separation enable pushing down the backup/restore operation to the storage layer to reduce the processing burden on the primary compute node.  As a result, the backup of a large database does not impact the performance of the primary compute node.  Similarly, restores are done by copying the file snapshot and as such are not a size of data operation.  For restores within the same storage account, the restore operation is fast.

More info:

Video New performance and scale enhancements for Azure SQL Database

Posted in Azure SQL Database, SQLServerPedia Syndication | 1 Comment

Azure Database Migration Service (DMS)

As I first mentioned in my blog Microsoft database migration tools, the Azure Database Migration Service (DMS) is a PaaS solution that makes it easy to migrate from on-prem/RDS to Azure and one database type to another.  I’ll give a brief overview of some of the key features:

The first thing you will do is create an instance of Azure Database Migration Service, which basically reserves compute power in a region (make sure it’s the region where you destination databases will be – there is a limit of 2 DMS’s but you can email support to have it increased).  DMS allows you to choose an already existing VNET with connectivity to your source or create a basic VNET that can connect to your source servers that have public facing IPs (or are under the same VNET as the DMS service or are accessible via VNET peering or tunneling). In case your migration target is SQL Database Managed Instance (SQL MI), this also needs to be the VNET where your SQL MI instance is located (you can skip this step if migrating to SQL Database Single since it doesn’t support VNET).  Note the usual case is to connect to an existing VNET that has connectivity to the source and target so you don’t have to create a new VNET when creating a DMS.  After you create the DMS, you can create a migration project and run migration activities.

Note that DMS now supports using existing SQL Server backup files for migrations from SQL Server to SQL MI, saving time and making the overall migration process easier to perform.

When creating a migration project, you will choose either “online data migration” or “offline data migration” to refer to migrations with and without ongoing replication, respectively.

Offline migrations does a backup & restore in SQL to SQL VM and SQL to SQL MI scenarios.  For SQL to SQL Database Single, it copies the data (schema migration support to be added soon) itself using a home-built DMS streaming pipeline that uses bulk copy APIs, which has one of the best throughput when compared to all existing tech.  The same tech ships in Database Migration Assistant (DMA), but DMS is more reliable and scalable.

Online migrations (also known as continuous migrations, minimal downtime, continuous sync) uses tech that is based on reading the logs and streaming the data (for migrations to SQL Database Single it syncs via transactional replication and you must use DMA first to migrate the schema which may need changes).  In private preview is a new pipeline for migrations to SQL MI which will be based on log shipping.

Here is the list of all the current migration options along with the roadmap:

The new Database Migration Guide is for enterprise customers, partners, and business decision makers who are interested in moving to Azure cloud services (i.e. migrating from Oracle or SQL Server to Azure Data Services).  The Database Migration Guide provides comprehensive, step-by-step guidance for performing migrations, as well as improves the discoverability of the guidance, tools, software, and programs that are available to assist customers in performing these migrations.  Also, this white paper​ will guide you through the thought process and steps required to migrate your database workloads from on-premises to Azure-based cloud services.

Currently Microsoft does not have assessment rules in DMA specifically for SQL MI but it should be available soon.

More info:

Migrating and modernizing your data estate to Azure with Data Migration Services : Build 2018

Migration from SQL Server to Azure SQL Database Managed Instance

Posted in Database Migration Service, SQLServerPedia Syndication | 1 Comment

Azure SQL Database Read Scale-Out

Read Scale-Out is a little-known feature that allows you to load balance Azure SQL Database read-only workloads using the capacity of read-only replicas, for free.

As mentioned in my blog Azure SQL Database high availability, each database in the Premium tier (DTU-based purchasing model) or in the Business Critical tier (vCore-based purchasing model) is automatically provisioned with several AlwaysON read-only replicas using synchronous-commit mode to support the availability SLA of 99.99% (these AlwaysON replicas and created automatically even if you are not using geo-replication).  These replicas are provisioned with the same performance level as the read-write replica used by the regular database connections.  The Read Scale-Out feature allows you to load balance SQL Database read-only workloads using the capacity of one of the read-only replicas instead of all queries hitting the read-write replica.  This way the read-only workload will be isolated from the main read-write workload and will not affect its performance.  This feature is intended for applications that include logically separated read-only workloads, such as analytics, and therefore could gain performance benefits using this additional capacity at no extra cost.

I highlighted “one” above to bring attention to the fact that only one replica is used, meaning it does not use multiple read-only replica’s and load balance between them.

Another option for read-only workloads is if you also decide to use geo-replication (which is not free), this will create secondary databases (currently four) using asynchronous-commit mode that can be made readable and you can direct connections to each of those secondary’s directly in the connection string, and do your own load balancing between them.  For more info on geo-replication see my blog Azure SQL Database disaster recovery.

And if you are using Read Scale-Out to load balance read-only workloads on a database that is geo-replicated (e.g. as a member of a failover group), make sure that Read Scale-Out is enabled on both the primary and the geo-replicated secondary databases.  This will ensure the same load-balancing effect when your application connects to the new primary after failover.

To read how to enable Read Scale-Out and send queries to the read-only replica, check out Use read-only replicas to load balance read-only query workloads (preview).

More info:

Overview: Active geo-replication and auto-failover groups

Create linked server to readable secondary replica in Managed Instance Business Critical service tier

Azure SQL DB Readable Secondary

Posted in Azure SQL Database, SQLServerPedia Syndication | 2 Comments

Azure SQL Database disaster recovery

My last blog post was on Azure SQL Database high availability and I would like to continue along that discussion with a blog post about disaster recovery in Azure SQL Database.  First, a clarification on the difference between high availability and disaster recovery:

High Availability (HA) – Keeping your database up 100% of the time with no data loss during common problems.  Redundancy at system level, focus on failover, addresses single predictable failure, focus is on technology.  SQL Server IaaS would handle this with:

  • Always On Failover cluster instances
  • Always On Availability Groups (in same Azure region)
  • SQL Server data files in Azure

Disaster Recovery (DR) – Protection if major disaster or unusual failure wipes out your database.  Use of alternate site, focus on re-establishing services, addresses multiple failures, includes people and processes to execute recovery.  Usually includes HA also.  SQL Server IaaS would handle this with:

  • Log Shipping
  • Database Mirroring
  • Always On Availability Groups (different Azure regions)
  • Backup to Azure

Azure SQL Database makes setting up disaster recovery so much easier than SQL Server IaaS (in a VM).  Disaster recovery is done via active geo-replication, which is an Azure SQL Database feature that allows you to create readable replicas of your database in the same or different data center (region).  All it takes is navigating to this page and choosing the region to create a secondary database (this example of active geo-replication is configured with a primary in the North Central US region and secondary in the South Central US region):

Once created, the secondary database is populated with the data copied from the primary database.  This process is known as seeding.  After the secondary database has been created and seeded, updates to the primary database are asynchronously replicated to the secondary database automatically.  Asynchronous replication means that transactions are committed on the primary database before they are replicated to the secondary database.

Compare this to setting up AlwaysOn Availability Groups!  And then think about the time it takes to monitor and maintain AlwaysOn Availability Groups, stuff that you won’t have to worry about anymore, and you can see why Azure SQL database is such a pleasure.

Active geo-replication is designed as a business continuity solution that allows an application to perform quick disaster recovery in case of a data center scale outage.  If geo-replication is enabled, the application can initiate a failover to a secondary database in a different Azure region.  Up to four secondaries are supported in the same or different regions, and the secondaries can also be used for read-only access queries.  The failover can be initiated manually by the application or the user.  After failover, the new primary has a different connection end point.  As each secondary is a discrete database with the same name as the primary but in a different server you will need to reconfigure your application(s) with an updated connection string.

Each geo-replicated database will be on a server in a region with server names you give them such as ServerNameWestUS.database.windows.net (primary) and ServerNameEastUS.database.windows.net (secondary).  If you want to load balance between them in your application for read queries, if the application itself is load balanced between regions you can use Azure Traffic Manager.  If the application itself is not geo-replicated, then use Azure Load Balancer.

Auto-failover groups is an extension of active geo-replication.  It is designed to manage the failover of multiple geo-replicated databases simultaneously using an application initiated failover or by delegating failover to be done by the SQL Database service based on a user defined criteria.  The latter allows you to automatically recover multiple related databases in a secondary region after a catastrophic failure or other unplanned event that results in full or partial loss of the SQL Database service’s availability in the primary region.  Because auto-failover groups involve multiple databases, these databases must be configured on the primary server.  Both primary and secondary servers for the databases in the failover group must be in the same subscription.  Auto-failover groups support replication of all databases in the group to only one secondary server in a different region.

If you are using active geo-replication and for any reason your primary database fails, or simply needs to be taken offline, you can initiate failover to any of your secondary databases.  When failover is activated to one of the secondary databases, all other secondaries are automatically linked to the new primary.  If you are using auto-failover groups to manage database recovery, any outage that impacts one or several of the databases in the group results in automatic failover.  You can configure the auto-failover policy that best meets your application needs, or you can opt out and use manual activation.  In addition, auto-failover groups provide read-write and read-only listener end-points that remain unchanged during failovers.  Whether you use manual or automatic failover activation, failover switches all secondary databases in the group to primary.  After the database failover is completed, the DNS record is automatically updated to redirect the end-points to the new region.

Active geo-replication leverages the Always On technology of SQL Server to asynchronously replicate committed transactions on the primary database to a secondary database using snapshot isolation.  The primary and secondary instances in a geo-replication relationship have independent HA capabilities, the same as a standalone instance would have.  Auto-failover groups provide the group semantics on top of active geo-replication but the same asynchronous replication mechanism is used.  While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to never have partial transactions.  Cross-region redundancy enables applications to quickly recover from a permanent loss of an entire datacenter or parts of a datacenter caused by natural disasters, catastrophic human errors, or malicious acts.  The specific recovery point objective (RPO) data can be found at Overview of Business Continuity (The time period of updates that you might lose is under 5 seconds).

More info:

Data Integrity in Azure SQL Database

High-availability and Azure SQL Database

Overview of business continuity with Azure SQL Database

Overview: Active geo-replication and auto-failover groups

Designing globally available services using Azure SQL Database

Spotlight on SQL Database Active Geo-Replication

Azure SQL Database Business Continuity Enhancements

Posted in Azure SQL Database, SQLServerPedia Syndication | 3 Comments

Azure SQL Database high availability

In this blog I want to talk about how Azure SQL Database achieves high availability.  One of the major benefits from moving from on-prem SQL Server to Azure SQL Database is how much easier it is to have high availability – no need for creating and managing a SQL Server failover cluster, AlwaysOn availability groups, database mirroring, log shipping, SAN replication, etc.

Azure SQL Database is a highly available database Platform as a Service that guarantees that your database is up and running 99.99% of time, without worrying about maintenance and downtimes.  This is a fully managed SQL Server Database Engine process hosted in the Azure cloud that ensures that your SQL Server database is always upgraded/patched without affecting your workload.  Azure automatically handles patching, backups, replication, failure detection, underlying hardware, software or network failures, deploying bug fixes, failovers, database upgrades, and other maintenance tasks.  Azure SQL Database can quickly recover even in the most critical circumstances ensuring that your data is always available.

Azure SQL Database is based on the SQL Server Database Engine architecture that is adjusted for the cloud environment in order to ensure 99.99% availability even in the cases of infrastructure failures.  There are two high-availability architectural models that are used in Azure SQL Database (both of them ensuring 99.99% availability):

(NOTE: Basic/Standard/Premium are service tiers that are DTU-based and used only for SQL Database Single, and General Purpose/Business Critical are vCore-based and used for both SQL Database Single and SQL Database Managed Instance)

  • Basic/Standard/General Purpose model that is based on remote storage.  This architectural model relies on high availability and reliability of the storage tier, but it might have some potential performance degradation during maintenance activities.  This model uses Azure Premium Storage Disks
  • Premium/Business Critical model that is based on a cluster of database engine processes.  This architectural model relies on a fact that there is always a quorum of available database engine nodes and has minimal performance impact on your workload even during maintenance activities.  This model uses AlwaysOn Availability Groups and local attached SSD storage.  Provides higher IOPS and throughput than Basic/Standard/General Purpose

Azure SQL Database runs on the latest stable version of SQL Server Database Engine and Windows OS, and most of the users would not notice that the upgrades are performed continuously.

More details on these two options:

Basic/Standard/General Purpose

High availability in these service tiers is achieved by separation of compute and storage layers and the replication of data in the storage tier (which uses Azure Premium Storage):

There are two layers:

  • Active compute nodes: A stateless compute layer that is running the sqlserver.exe process and contains only transient and cached data (for example – plan cache, buffer pool, column store pool).  This stateless SQL Server node is operated by Azure Service Fabric that initializes process, controls health of the node, and performs failover to another place if necessary
  • Azure Storage accounts: A stateful data layer with database files (.mdf/.ldf) that are stored in Azure Premium Storage Disks, which is remote storage (i.e. it is accessed over the network, using Azure network infrastructure).  It is able to use Azure Premium Storage by taking advantage of SQL Server native capability to use database files directly in Azure Blob Storage.  This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage.  Azure Storage guarantees that there will be no data loss of any record that is placed in any database file (since three copies of the data is made via LRS).  Azure Storage has built-in data availability/redundancy that ensures that every record in log file or page in data file will be preserved even if SQL Server process crashes.  Note the tempdb database is not using Azure Premium Storage but rather it is located on the local SSD storage, which provides very low latency and high IOPS/throughput

Whenever the database engine or operating system is upgraded, or some part of underlying infrastructure fails, or if some critical issue is detected in the SQL Server process, Azure Service Fabric will move the stateless SQL Server process to another stateless compute node.  There is a set of redundant (“spare”) nodes that is waiting to run new compute service in case of failover in order to minimize failover time.  Data in the Azure storage layer is not affected, and data/log files are attached to the newly initialized SQL Server process.  Failover time can be measured in seconds.  This process guarantees 99.99% availability, but it might have some performance impacts on a heavy workload that is running due to transition time and the fact the new SQL Server node starts with cold cache.

See Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose) for info on performance improvement.

Premium/Business Critical

High availability in these service tiers is designed for intensive workloads that cannot tolerate any performance impact due to the ongoing maintenance operations.

In the premium model, Azure SQL database integrates compute and storage on the single node.  High availability in this architectural model is achieved by replication of compute (SQL Server Database Engine process) and storage (locally attached SSD) deployed in a Always On Availability Groups cluster with enough replicas to achieve quorum and provide HA guarantees (currently 4 nodes as shown below):


Both the SQL Server Database Engine process and the underlying mdf/ldf files are placed on the same node with locally attached SSD storage providing low latency to your workload.  High availability is implemented using Always On Availability Groups.  Every database is a cluster of database nodes with one primary database that is accessible for the customer workload, and three secondary processes containing copies of data.  The primary node constantly pushes the changes to secondary nodes in order to ensure that the data is available on secondary replicas if the primary node crashes for any reason.  Failover is handled by the Azure Service Fabric – one secondary replica becomes the primary node and a new secondary replica is created to ensure enough nodes in the cluster, and the workload is automatically redirected to the new primary node.  Failover time is measured in milliseconds for most workloads, and the new primary instance is immediately ready to continue serving requests.

A note on the difference in the handling of a failover compared to on-prem: The database engine cannot control the failover because it may not be running when a failover has to occur, i.e. it may have just crashed.  Failover has to be initiated by a component external to the database engine.  For traditional SQL Server, this component is Windows Failover Clustering.  For SQL DB and MI, this is Service Fabric.

IO Performance difference:

vCore model (from here):

General Purpose Business Critical
IO throughput (approximate) Singleton Database: 500 IOPS per vCore with 7000 maximum IOPS
Managed Instance: 500-7500 IOPS per data file (depends on size of file)
5000 IOPS per core with 200000 maximum IOPS

 

DTU model (from here):

Basic Standard Premium
IO throughput (approximate) 2.5 IOPS per DTU 2.5 IOPS per DTU 48 IOPS per DTU
IO latency (approximate) 5 ms (read), 10 ms (write) 5 ms (read), 10 ms (write) 2 ms (read/write)

 

Note Managed Instance only supports the vCore model.

A word about storage: Azure SQL Database Singleton and Managed Instance both use Azure storage page blobs as the underlying persistent storage for its databases.  Azure premium managed disks are just premium page blobs with some API make-up so they look like disks.  Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs.

A word about failover times:  In Business Critical case, there is a secondary replica that is an exact read-only copy of primary instance, so failover is just a switch to new IP address and it is almost instant.  In more realistic cases, there is always some lag in secondary replica because it is constantly redoing transaction log records that are sent from the primary node.  Failover time is equal to the time needed to apply all remaining transaction log records to became consistent with the primary node, and then the switch to the new IP is completed.  Under heavy workload that saturates both primary and secondary replica there might be a chance that secondary cannot immediately catchup to primary so log redo time might be even longer.  The exact time depends on the workload and there are no official numbers or formula to calculate this.  In the General Purpose case, there is a stateless compute node ready to run sqlservr.exe that attaches .mdf/ldf files from the remote storage.  This is cold cache process that should be initialized and failover time is longer than Business Critical.  Failover time depends on the database size and also can vary.

Finally, if you are interested in how Microsoft manages data integrity for Azure SQL Database, check out Data Integrity in Azure SQL Database.

More info:

High-availability and Azure SQL Database

Overview of business continuity with Azure SQL Database

Overview: Active geo-replication and auto-failover groups

Reaching Azure disk storage limit on General Purpose Azure SQL Database Managed Instance

File layout in General Purpose Azure SQL Managed Instance

What is built-in intelligence in Azure SQL Database?

Posted in Azure SQL Database, SQLServerPedia Syndication | 2 Comments

Power BI: Dataflows

Dataflows, previously called Common Data Service for Analytics as well as Datapools, will be in preview soon and I wanted to explain in this blog what it is and how it can help you get value out of your data quickly (it’s a follow-up to my blog Getting value out of data quickly).

In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas.  This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data).  You can also extend the CDM by creating custom entities.  Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.

A dataflow is not just the data itself, but also logic on how the data is manipulated.  Dataflows belong to the Data Warehouse/Mart/Lake family.  Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint.  Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.

This adds two new layers to Power BI (Dataflows and Storage):

But you can instead use your own Azure Data Lake Store Gen2, allowing other Azure services to reuse the data (i.e. Azure Databricks can be used to manipulate the data).

You can also setup incremental refresh for any entity, link to entities from other dataflows, and can pull data down from the dataflows into Power BI desktop.

To use dataflows, in the Power BI Service, under a Workspace: Create – Dataflow – Add entities: This starts online Power Query and you then choose a connector from one of the many data sources (just like you do with Power Query in Power BI Desktop).  Then choose a table to import and the screen will look like this:

To create a dashboard from these entities, in Power BI Desktop you simply choose Get Data -> Power BI dataflows.

The bottom line is Power BI users can now easily create a dataflow to prepare data in a centralized storage, using a standardized schema, ready for easy consumption, reuse, and generation of business insights.

Dataflows are a great way to have a power user get value out of data without involving IT.  But while this adds enterprise tools to Power BI, it does not mean you are creating an enterprise solution.  You still may need to create a data warehouse and cubes: See The need for having both a DW and cubes and Is the traditional data warehouse dead?.

More info:

Self-service data prep with dataflows

Microsoft Common Data Services

Video Introduction to Common Data Service For Analytics

Video Common Data Service for Analytics (CDS-A) and Power BI – an Introduction

Power BI expands self-service prep for big data, unifies modern and enterprise BI

Video Introducing: Advanced data prep with dataflows—for unified data and powerful insights

Dataflows in Power BI: A Data Analytics Gamechanger?

Video Introduction to the Microsoft Common Data Model

Video Power BI data preparation with Dataflows

Power BI Dataflows, and Why You Should Care

Terminology Check – What are Data Flows?

Dataflows in Power BI

Dataflows: The Good, the Bad, the Ugly

Posted in Power BI, SQLServerPedia Syndication | 7 Comments

Power BI new feature: Composite models

There are two really great features just added to Power BI that I wanted to blog about: Composite models and Dual storage mode.  This is part of the July release for Power BI Desktop and it is in preview (see Power BI Desktop July 2018 Feature Summary).  I’ll also talk about a future release called Aggregations.

First a review of the two ways to connect to a data source:

Import – The selected tables and columns are imported into Power BI Desktop.  As you create or interact with a visualization, Power BI Desktop uses the imported data.  You must refresh the data, which imports the full data set again (or use the preview feature incremental refresh), to see any changes that occurred to the underlying data since the initial import or the most recent refresh.  Import datasets in the Power BI services have a 10GB dataset limitation for Premium version and 1GB limitation for free version (although with compression you can import much large data sets).  See Data sources in Power BI Desktop

DirectQuery – No data is imported or copied into Power BI Desktop.  As you create or interact with a visualization, Power BI Desktop queries the underlying data source, which means you’re always viewing current data.  DirectQuery lets you build visualizations over very large datasets, where it otherwise would be unfeasible to first import all of the data with pre-aggregation.  See Data sources supported by DirectQuery.

Up until now in Power BI, when you connect to a data source using DirectQuery, it is not possible to connect to any other data source in the same report (all tables must come from a single database), nor to include data that has been imported.  The new composite model feature removes this restriction, allowing a single report to seamlessly combine data from one or more DirectQuery sources, and/or combine data from a mix of DirectQuery sources and imported data.  So this means you can combine multiple DirectQuery sources with multiple Import sources.  If your report has some DirectQuery tables and some import tables, the status bar on the bottom right of your report will show a storage mode of ‘Mixed.’  Clicking on this allows all tables to be switched to import mode easily.

For example, with composite models it’s possible to build a model that combines sales data from an enterprise data warehouse using DirectQuery, with data on sales targets that is in a departmental SQL Server database using DirectQuery, along with some data imported from a spreadsheet.  A model that combines data from more than one DirectQuery source, or combines DirectQuery with imported data is referred to as a composite model.

Also, composite models include a new feature called dual storage mode.  If you are using DirectQuery currently, all visuals will result in queries being sent to the backend source, even for simple visuals such a slicer showing all the Product Categories.  The ability to define a table as having a storage mode of “Dual” means that a copy of the data for that table will also be imported, and any visuals that reference only columns from this table will use the imported data, and not require a query to the underlying source.  The benefits of this are improved performance, and lessened load on the backend source.  But if there are large tables being queried using DirectQuery, the dual table will operate as a DirectQuery table so no table data would need to be imported to be joined with an imported table.

Another feature due out in the next 90 days is “Aggregations” that allows you to create aggregation tables.  This new feature along with composite models and dual storage mode allows you to create a solution that uses huge datasets.  For example, say I have two related tables: One is at the detail grain called Sales, and another is the aggregated totals of Sales called Sales_Agg.  Sales is set to DirectQuery storage mode and Sales_Agg is set to Import storage mode.  If a user sends a query with a SELECT statement that has a GROUP BY that can be filled by the Sales_Agg table, the data will be pulled from cache in milliseconds since that table was imported (for example, 1.6 billion aggregated rows imported from SQL DW compressed to 10GB in memory).  If a user sends a query with a GROUP BY for a field that is not in the Sales_Agg table, it will do a DirectQuery to the Sales table (for example, sending a Spark query to a 23-node HDI Spark cluster of 1 trillion details rows of 250TB, taking about 40 seconds).  The user is not aware there is a Sales_Agg table (all aggregation tables are hidden) – they simple send a query to Sales and Power BI automatically redirects the query to the best table to use.  And if using a Date table, it can be set to Dual mode so it joins with Sales_Agg in memory in the first part of the example, or joins with Sales on the data source using DirectQuery in the second part of the example (so it does not have to pull the 1 trillion detail rows into Power BI in order to join with the imported Date table).

So you can think of aggregations as a replacement for creating an Azure Analysis Services tabular data model, saving on cost and optimization work.

You will need to right-click the Sales_Agg table and choose “Manage aggregations” to map the aggregated Sales_Agg table columns to the detail Sales table columns.  There is also a “Precedence” field that allows you to have multiple aggregation tables on the same fact table at different grains:

You can also create a report with a drillthrough feature where users can right-click on a data point in a report page that was built with an aggregation table and drillthrough to a focused page to get details that are filtered to that context that is built using DirectQuery.

So in summary, there are three values for storage mode at the table level:

  • Import – When set to Import, imported tables are cached.  Queries submitted to the Power BI dataset that return data from Import tables can only be fulfilled from cached data
  • DirectQuery – With this setting, DirectQuery tables are not cached.  Queries submitted to the Power BI dataset (for example, DAX queries) that return data from DirectQuery tables can only be fulfilled by executing on-demand queries to the data source.  Queries submitted to the data source use the query language for that data source (for example, SQL)
  • Dual – Dual tables can act as either cached or not cached, depending on the context of the query submitted to the Power BI dataset.  In some cases, queries are fulfilled from cached data; in other cases, queries are fulfilled by executing an on-demand query to the data source

Note that changing a table to Import is an irreversible operation; it cannot be changed back to DirectQuery, or back to Dual.  Also note there are two limitations during the preview period: DirectQuery only supports the tabular model (not multi-dimensional model) and you can’t publish files to the Power BI service.

More info:

Power BI Monthly Digest – July 2018

Composite models in Power BI Desktop (Preview)

Storage mode in Power BI Desktop (Preview)

Using DirectQuery in Power BI

Power BI Composite Models: The Good, The Bad, The Ugly

Composite Model; DirectQuery and Import Data Combined; Evolution Begins in Power BI

Video Building Enterprise grade BI models with Microsoft Power BI Premium

Video Building a data model to support 1 trillion rows of data and more with Microsoft Power BI Premium

Video Power BI and the Future for Modern and Enterprise BI

Video Introducing: Advanced data prep with dataflows—for unified data and powerful insights

Understanding Power BI Dual Storage

Video Microsoft Power BI Premium: Building enterprise-grade BI models for big data

Posted in Power BI, SQLServerPedia Syndication | Comments Off on Power BI new feature: Composite models