Azure SQL Database Hyperscale

At Microsoft Ignite, one of the announcements was for Azure SQL Database Hyperscale, which was made available in public preview October 1st, 2018 in 12 different Azure regions.  SQL Database Hyperscale is a new SQL-based and highly scalable service tier for single databases that adapts on-demand to your workload’s needs.  With SQL Database Hyperscale, databases can quickly auto-scale up to 100TB, eliminating the need to pre-provision storage resources, and significantly expanding the potential for app growth without being limited by storage size.  Check out the documentation.

Compared to current Azure SQL Database service tiers, Hyperscale provides the following additional capabilities:

  • Support for up to a 100 TB of database size
  • Nearly instantaneous database backups (based on file snapshots stored in Azure Blob storage) regardless of size with no IO impact on Compute
  • Fast database restores (based on file snapshots) in minutes rather than hours or days (not a size of data operation)
  • Higher overall performance due to higher log throughput and faster transaction commit times regardless of data volumes
  • Rapid scale out – you can provision one or more read-only nodes for offloading your read workload and for use as hot-standbys
  • Rapid Scale up – you can, in constant time, scale up your compute resources to accommodate heavy workloads as and when needed, and then scale the compute resources back down when not needed

The Hyperscale service tier removes many of the practical limits traditionally seen in cloud databases.  Where most other databases are limited by the resources available in a single node, databases in the Hyperscale service tier have no such limits.  With its flexible storage architecture, storage grows as needed.  In fact, Hyperscale databases aren’t created with a defined max size.  A Hyperscale database grows as needed – and you are billed only for the capacity you use.  Storage is dynamically allocated between 5 GB and 100 TB, in 1 GB increments.  For read-intensive workloads, the Hyperscale service tier provides rapid scale-out by provisioning additional read replicas as needed for offloading read workloads.

The Hyperscale service tier is primarily intended for customers who have large databases either on-premises and want to modernize their applications by moving to the cloud or for customers who are already in the cloud and are limited by the maximum database size restrictions (1-4 TB).  It is also intended for customers who seek high performance and high scalability for storage and compute.

The Hyperscale service tier supports all SQL Server workloads, but it is primarily optimized for OLTP.  The Hyperscale service tier also supports hybrid and analytical (data mart) workloads.

It is available under the vCore-based purchasing options for SQL Database (it is not available yet for SQL Database Managed Instance).

Azure SQL Database Hyperscale is built based on a new cloud-born architecture which decouples compute, log and storage.

A Hyperscale database contains the following different types of nodes:

Compute nodes

The compute nodes look like a traditional SQL Server, but without local data files or log files.  The compute node is where the relational engine lives, so all the language elements, query processing, and so on, occur here.  All user interactions with a Hyperscale database happen through these compute nodes.  Compute nodes have SSD-based caches (labeled RBPEX – Resilient Buffer Pool Extension in the preceding diagram) to minimize the number of network round trips required to fetch a page of data.  There is one primary compute node where all the read-write workloads and transactions are processed.  There are one or more secondary compute nodes that act as hot standby nodes for failover purposes, as well as act as read-only compute nodes for offloading read workloads (if this functionality is desired).

Log service

The log service externalizes the transactional log from a Hyperscale database.  The log service node accepts log records from the primary compute node, persists them in a durable log cache, and forwards the log records to the rest of the compute nodes (so they can update their caches) as well as the relevant page server(s), so that the data can be updated there.  In this way, all data changes from the primary compute node are propagated through the log service to all the secondary compute nodes and page servers.  Finally, the log record(s) are pushed out to long-term storage in Azure Standard Storage, which is an infinite storage repository.  This mechanism removes the necessity for frequent log truncation.  The log service also has local cache to speed up access.

Page servers

The page servers host and maintain the data files.  It consumes the log stream from the log services and applies the data modifications described in the log stream to data files.  Read requests of data pages that are not found in the compute’s local data cache or RBPEX are sent over the network to the page servers that own the pages.  In page servers, the data files are persisted in Azure Storage and are heavily cached through RBPEX (SSD-based caches).

Page servers are systems representing a scaled-out storage engine.  Multiple page servers will be created for a large database.  When the database is growing and available space in existing page servers is lower than a threshold, a new page server is automatically added to the database.  Since page servers are working independently, it allows us to grow the database with no local resource constraints.  Each page server is responsible for a subset of the pages in the database.  Nominally, each page server controls one terabyte of data.  No data is shared on more than one page server (outside of replicas that are kept for redundancy and availability).  The job of a page server is to serve database pages out to the compute nodes on demand, and to keep the pages updated as transactions update data.  Page servers are kept up-to-date by playing log records from the log service.  Long-term storage of data pages is kept in Azure Standard Storage for additional reliability.

Azure standard storage node

The Azure storage node is the final destination of data from page servers.  This storage is used for backup purposes as well as for replication between Azure regions.  Backups consist of snapshots of data files.  Restore operation are fast from these snapshots and data can be restored to any point in time.

Automated backup and point in time restore

In a Hyperscale database, snapshots of the data files are taken from the page servers periodically to replace the traditional streaming backup.  This allows for a backup of a very large database in just a few seconds.  Together with the log records stored in the log service, you can restore the database to any point in time during retention (7 days in public preview) in a very short time, regardless of the database size.

Since the backups are file-snapshot base and hence they are nearly instantaneous.  Storage and compute separation enable pushing down the backup/restore operation to the storage layer to reduce the processing burden on the primary compute node.  As a result, the backup of a large database does not impact the performance of the primary compute node.  Similarly, restores are done by copying the file snapshot and as such are not a size of data operation.  For restores within the same storage account, the restore operation is fast.

More info:

Video New performance and scale enhancements for Azure SQL Database

Posted in Azure SQL Database, SQLServerPedia Syndication | 1 Comment

Azure Database Migration Service (DMS)

As I first mentioned in my blog Microsoft database migration tools, the Azure Database Migration Service (DMS) is a PaaS solution that makes it easy to migrate from on-prem/RDS to Azure and one database type to another.  I’ll give a brief overview of some of the key features:

The first thing you will do is create an instance of Azure Database Migration Service, which basically reserves compute power in a region (make sure it’s the region where you destination databases will be – there is a limit of 2 DMS’s but you can email support to have it increased).  DMS allows you to choose an already existing VNET with connectivity to your source or create a basic VNET that can connect to your source servers that have public facing IPs (or are under the same VNET as the DMS service or are accessible via VNET peering or tunneling). In case your migration target is SQL Database Managed Instance (SQL MI), this also needs to be the VNET where your SQL MI instance is located (you can skip this step if migrating to SQL Database Single since it doesn’t support VNET).  Note the usual case is to connect to an existing VNET that has connectivity to the source and target so you don’t have to create a new VNET when creating a DMS.  After you create the DMS, you can create a migration project and run migration activities.

Note that DMS now supports using existing SQL Server backup files for migrations from SQL Server to SQL MI, saving time and making the overall migration process easier to perform.

When creating a migration project, you will choose either “online data migration” or “offline data migration” to refer to migrations with and without ongoing replication, respectively.

Offline migrations does a backup & restore in SQL to SQL VM and SQL to SQL MI scenarios.  For SQL to SQL Database Single, it copies the data (schema migration support to be added soon) itself using a home-built DMS streaming pipeline that uses bulk copy APIs, which has one of the best throughput when compared to all existing tech.  The same tech ships in Database Migration Assistant (DMA), but DMS is more reliable and scalable.

Online migrations (also known as continuous migrations, minimal downtime, continuous sync) uses tech that is based on reading the logs and streaming the data (for migrations to SQL Database Single it syncs via transactional replication and you must use DMA first to migrate the schema which may need changes).  In private preview is a new pipeline for migrations to SQL MI which will be based on log shipping.

Here is the list of all the current migration options along with the roadmap:

The new Database Migration Guide is for enterprise customers, partners, and business decision makers who are interested in moving to Azure cloud services (i.e. migrating from Oracle or SQL Server to Azure Data Services).  The Database Migration Guide provides comprehensive, step-by-step guidance for performing migrations, as well as improves the discoverability of the guidance, tools, software, and programs that are available to assist customers in performing these migrations.  Also, this white paper​ will guide you through the thought process and steps required to migrate your database workloads from on-premises to Azure-based cloud services.

Currently Microsoft does not have assessment rules in DMA specifically for SQL MI but it should be available soon.

More info:

Migrating and modernizing your data estate to Azure with Data Migration Services : Build 2018

Posted in Database Migration Service, SQLServerPedia Syndication | 1 Comment

Azure SQL Database Read Scale-Out

Read Scale-Out is a little-known feature that allows you to load balance Azure SQL Database read-only workloads using the capacity of read-only replicas, for free.

As mentioned in my blog Azure SQL Database high availability, each database in the Premium tier (DTU-based purchasing model) or in the Business Critical tier (vCore-based purchasing model) is automatically provisioned with several AlwaysON read-only replicas using synchronous-commit mode to support the availability SLA of 99.99% (these AlwaysON replicas and created automatically even if you are not using geo-replication).  These replicas are provisioned with the same performance level as the read-write replica used by the regular database connections.  The Read Scale-Out feature allows you to load balance SQL Database read-only workloads using the capacity of one of the read-only replicas instead of all queries hitting the read-write replica.  This way the read-only workload will be isolated from the main read-write workload and will not affect its performance.  This feature is intended for applications that include logically separated read-only workloads, such as analytics, and therefore could gain performance benefits using this additional capacity at no extra cost.

I highlighted “one” above to bring attention to the fact that only one replica is used, meaning it does not use multiple read-only replica’s and load balance between them.

Another option for read-only workloads is if you also decide to use geo-replication (which is not free), this will create secondary databases (currently four) using asynchronous-commit mode that can be made readable and you can direct connections to each of those secondary’s directly in the connection string, and do your own load balancing between them.  For more info on geo-replication see my blog Azure SQL Database disaster recovery.

And if you are using Read Scale-Out to load balance read-only workloads on a database that is geo-replicated (e.g. as a member of a failover group), make sure that Read Scale-Out is enabled on both the primary and the geo-replicated secondary databases.  This will ensure the same load-balancing effect when your application connects to the new primary after failover.

To read how to enable Read Scale-Out and send queries to the read-only replica, check out Use read-only replicas to load balance read-only query workloads (preview).

More info:

Overview: Active geo-replication and auto-failover groups

Create linked server to readable secondary replica in Managed Instance Business Critical service tier

Posted in Azure SQL Database, SQLServerPedia Syndication | 2 Comments

Azure SQL Database disaster recovery

My last blog post was on Azure SQL Database high availability and I would like to continue along that discussion with a blog post about disaster recovery in Azure SQL Database.  First, a clarification on the difference between high availability and disaster recovery:

High Availability (HA) – Keeping your database up 100% of the time with no data loss during common problems.  Redundancy at system level, focus on failover, addresses single predictable failure, focus is on technology.  SQL Server IaaS would handle this with:

  • Always On Failover cluster instances
  • Always On Availability Groups (in same Azure region)
  • SQL Server data files in Azure

Disaster Recovery (DR) – Protection if major disaster or unusual failure wipes out your database.  Use of alternate site, focus on re-establishing services, addresses multiple failures, includes people and processes to execute recovery.  Usually includes HA also.  SQL Server IaaS would handle this with:

  • Log Shipping
  • Database Mirroring
  • Always On Availability Groups (different Azure regions)
  • Backup to Azure

Azure SQL Database makes setting up disaster recovery so much easier than SQL Server IaaS (in a VM).  Disaster recovery is done via active geo-replication, which is an Azure SQL Database feature that allows you to create readable replicas of your database in the same or different data center (region).  All it takes is navigating to this page and choosing the region to create a secondary database (this example of active geo-replication is configured with a primary in the North Central US region and secondary in the South Central US region):

Once created, the secondary database is populated with the data copied from the primary database.  This process is known as seeding.  After the secondary database has been created and seeded, updates to the primary database are asynchronously replicated to the secondary database automatically.  Asynchronous replication means that transactions are committed on the primary database before they are replicated to the secondary database.

Compare this to setting up AlwaysOn Availability Groups!  And then think about the time it takes to monitor and maintain AlwaysOn Availability Groups, stuff that you won’t have to worry about anymore, and you can see why Azure SQL database is such a pleasure.

Active geo-replication is designed as a business continuity solution that allows an application to perform quick disaster recovery in case of a data center scale outage.  If geo-replication is enabled, the application can initiate a failover to a secondary database in a different Azure region.  Up to four secondaries are supported in the same or different regions, and the secondaries can also be used for read-only access queries.  The failover can be initiated manually by the application or the user.  After failover, the new primary has a different connection end point.  As each secondary is a discrete database with the same name as the primary but in a different server you will need to reconfigure your application(s) with an updated connection string.

Auto-failover groups is an extension of active geo-replication.  It is designed to manage the failover of multiple geo-replicated databases simultaneously using an application initiated failover or by delegating failover to be done by the SQL Database service based on a user defined criteria.  The latter allows you to automatically recover multiple related databases in a secondary region after a catastrophic failure or other unplanned event that results in full or partial loss of the SQL Database service’s availability in the primary region.  Because auto-failover groups involve multiple databases, these databases must be configured on the primary server.  Both primary and secondary servers for the databases in the failover group must be in the same subscription.  Auto-failover groups support replication of all databases in the group to only one secondary server in a different region.

If you are using active geo-replication and for any reason your primary database fails, or simply needs to be taken offline, you can initiate failover to any of your secondary databases.  When failover is activated to one of the secondary databases, all other secondaries are automatically linked to the new primary.  If you are using auto-failover groups to manage database recovery, any outage that impacts one or several of the databases in the group results in automatic failover.  You can configure the auto-failover policy that best meets your application needs, or you can opt out and use manual activation.  In addition, auto-failover groups provide read-write and read-only listener end-points that remain unchanged during failovers.  Whether you use manual or automatic failover activation, failover switches all secondary databases in the group to primary.  After the database failover is completed, the DNS record is automatically updated to redirect the end-points to the new region.

Active geo-replication leverages the Always On technology of SQL Server to asynchronously replicate committed transactions on the primary database to a secondary database using snapshot isolation.  The primary and secondary instances in a geo-replication relationship have independent HA capabilities, the same as a standalone instance would have.  Auto-failover groups provide the group semantics on top of active geo-replication but the same asynchronous replication mechanism is used.  While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to never have partial transactions.  Cross-region redundancy enables applications to quickly recover from a permanent loss of an entire datacenter or parts of a datacenter caused by natural disasters, catastrophic human errors, or malicious acts.  The specific recovery point objective (RPO) data can be found at Overview of Business Continuity (The time period of updates that you might lose is under 5 seconds).

More info:

Data Integrity in Azure SQL Database

High-availability and Azure SQL Database

Overview of business continuity with Azure SQL Database

Overview: Active geo-replication and auto-failover groups

Designing globally available services using Azure SQL Database

Spotlight on SQL Database Active Geo-Replication

Azure SQL Database Business Continuity Enhancements

Posted in Azure SQL Database, SQLServerPedia Syndication | 3 Comments

Azure SQL Database high availability

In this blog I want to talk about how Azure SQL Database achieves high availability.  One of the major benefits from moving from on-prem SQL Server to Azure SQL Database is how much easier it is to have high availability – no need for creating and managing a SQL Server failover cluster, AlwaysOn availability groups, database mirroring, log shipping, SAN replication, etc.

Azure SQL Database is a highly available database Platform as a Service that guarantees that your database is up and running 99.99% of time, without worrying about maintenance and downtimes.  This is a fully managed SQL Server Database Engine process hosted in the Azure cloud that ensures that your SQL Server database is always upgraded/patched without affecting your workload.  Azure automatically handles patching, backups, replication, failure detection, underlying hardware, software or network failures, deploying bug fixes, failovers, database upgrades, and other maintenance tasks.  Azure SQL Database can quickly recover even in the most critical circumstances ensuring that your data is always available.

Azure SQL Database is based on the SQL Server Database Engine architecture that is adjusted for the cloud environment in order to ensure 99.99% availability even in the cases of infrastructure failures.  There are two high-availability architectural models that are used in Azure SQL Database (both of them ensuring 99.99% availability):

(NOTE: Basic/Standard/Premium are service tiers that are DTU-based and used only for SQL Database Single, and General Purpose/Business Critical are vCore-based and used for both SQL Database Single and SQL Database Managed Instance)

  • Basic/Standard/General Purpose model that is based on remote storage.  This architectural model relies on high availability and reliability of the storage tier, but it might have some potential performance degradation during maintenance activities.  This model uses Azure Premium Storage Disks
  • Premium/Business Critical model that is based on a cluster of database engine processes.  This architectural model relies on a fact that there is always a quorum of available database engine nodes and has minimal performance impact on your workload even during maintenance activities.  This model uses AlwaysOn Availability Groups and local attached SSD storage.  Provides higher IOPS and throughput than Basic/Standard/General Purpose

Azure SQL Database runs on the latest stable version of SQL Server Database Engine and Windows OS, and most of the users would not notice that the upgrades are performed continuously.

More details on these two options:

Basic/Standard/General Purpose

High availability in these service tiers is achieved by separation of compute and storage layers and the replication of data in the storage tier (which uses Azure Premium Storage):

There are two layers:

  • Active compute nodes: A stateless compute layer that is running the sqlserver.exe process and contains only transient and cached data (for example – plan cache, buffer pool, column store pool).  This stateless SQL Server node is operated by Azure Service Fabric that initializes process, controls health of the node, and performs failover to another place if necessary
  • Azure Storage accounts: A stateful data layer with database files (.mdf/.ldf) that are stored in Azure Premium Storage Disks, which is remote storage (i.e. it is accessed over the network, using Azure network infrastructure).  It is able to use Azure Premium Storage by taking advantage of SQL Server native capability to use database files directly in Azure Blob Storage.  This means that there is not a disk or a network share that hosts database files; instead, file path is an HTTPS URL, and each database file is a page blob in Azure Blob Storage.  Azure Storage guarantees that there will be no data loss of any record that is placed in any database file (since three copies of the data is made via LRS).  Azure Storage has built-in data availability/redundancy that ensures that every record in log file or page in data file will be preserved even if SQL Server process crashes.  Note the tempdb database is not using Azure Premium Storage but rather it is located on the local SSD storage, which provides very low latency and high IOPS/throughput

Whenever the database engine or operating system is upgraded, or some part of underlying infrastructure fails, or if some critical issue is detected in the SQL Server process, Azure Service Fabric will move the stateless SQL Server process to another stateless compute node.  There is a set of redundant (“spare”) nodes that is waiting to run new compute service in case of failover in order to minimize failover time.  Data in the Azure storage layer is not affected, and data/log files are attached to the newly initialized SQL Server process.  Failover time can be measured in seconds.  This process guarantees 99.99% availability, but it might have some performance impacts on a heavy workload that is running due to transition time and the fact the new SQL Server node starts with cold cache.

See Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose) for info on performance improvement.

Premium/Business Critical

High availability in these service tiers is designed for intensive workloads that cannot tolerate any performance impact due to the ongoing maintenance operations.

In the premium model, Azure SQL database integrates compute and storage on the single node.  High availability in this architectural model is achieved by replication of compute (SQL Server Database Engine process) and storage (locally attached SSD) deployed in a Always On Availability Groups cluster with enough replicas to achieve quorum and provide HA guarantees (currently 4 nodes as shown below):


Both the SQL Server Database Engine process and the underlying mdf/ldf files are placed on the same node with locally attached SSD storage providing low latency to your workload.  High availability is implemented using Always On Availability Groups.  Every database is a cluster of database nodes with one primary database that is accessible for the customer workload, and three secondary processes containing copies of data.  The primary node constantly pushes the changes to secondary nodes in order to ensure that the data is available on secondary replicas if the primary node crashes for any reason.  Failover is handled by the Azure Service Fabric – one secondary replica becomes the primary node and a new secondary replica is created to ensure enough nodes in the cluster, and the workload is automatically redirected to the new primary node.  Failover time is measured in milliseconds for most workloads, and the new primary instance is immediately ready to continue serving requests.

A note on the difference in the handling of a failover compared to on-prem: The database engine cannot control the failover because it may not be running when a failover has to occur, i.e. it may have just crashed.  Failover has to be initiated by a component external to the database engine.  For traditional SQL Server, this component is Windows Failover Clustering.  For SQL DB and MI, this is Service Fabric.

IO Performance difference:

vCore model (from here):

General Purpose Business Critical
IO throughput (approximate) Singleton Database: 500 IOPS per vCore with 7000 maximum IOPS
Managed Instance: 500-7500 IOPS per data file (depends on size of file)
5000 IOPS per core with 200000 maximum IOPS

 

DTU model (from here):

Basic Standard Premium
IO throughput (approximate) 2.5 IOPS per DTU 2.5 IOPS per DTU 48 IOPS per DTU
IO latency (approximate) 5 ms (read), 10 ms (write) 5 ms (read), 10 ms (write) 2 ms (read/write)

 

Note Managed Instance only supports the vCore model.

A word about storage: Azure SQL Database Singleton and Managed Instance both use Azure storage page blobs as the underlying persistent storage for its databases.  Azure premium managed disks are just premium page blobs with some API make-up so they look like disks.  Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs.

A word about failover times:  In Business Critical case, there is a secondary replica that is an exact read-only copy of primary instance, so failover is just a switch to new IP address and it is almost instant.  In more realistic cases, there is always some lag in secondary replica because it is constantly redoing transaction log records that are sent from the primary node.  Failover time is equal to the time needed to apply all remaining transaction log records to became consistent with the primary node, and then the switch to the new IP is completed.  Under heavy workload that saturates both primary and secondary replica there might be a chance that secondary cannot immediately catchup to primary so log redo time might be even longer.  The exact time depends on the workload and there are no official numbers or formula to calculate this.  In the General Purpose case, there is a stateless compute node ready to run sqlservr.exe that attaches .mdf/ldf files from the remote storage.  This is cold cache process that should be initialized and failover time is longer than Business Critical.  Failover time depends on the database size and also can vary.

Finally, if you are interested in how Microsoft manages data integrity for Azure SQL Database, check out Data Integrity in Azure SQL Database.

More info:

High-availability and Azure SQL Database

Overview of business continuity with Azure SQL Database

Overview: Active geo-replication and auto-failover groups

Reaching Azure disk storage limit on General Purpose Azure SQL Database Managed Instance

File layout in General Purpose Azure SQL Managed Instance

What is built-in intelligence in Azure SQL Database?

Posted in Azure SQL Database, SQLServerPedia Syndication | 2 Comments

Power BI: Dataflows

Dataflows, previously called Common Data Service for Analytics as well as Datapools, will be in preview soon and I wanted to explain in this blog what it is and how it can help you get value out of your data quickly (it’s a follow-up to my blog Getting value out of data quickly).

In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas.  This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data).  You can also extend the CDM by creating custom entities.  Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.

A dataflow is not just the data itself, but also logic on how the data is manipulated.  Dataflows belong to the Data Warehouse/Mart/Lake family.  Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint.  Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.

This adds two new layers to Power BI (Dataflows and Storage):

But you can instead use your own Azure Data Lake Store Gen2, allowing other Azure services to reuse the data (i.e. Azure Databricks can be used to manipulate the data).

You can also setup incremental refresh for any entity, link to entities from other dataflows, and can pull data down from the dataflows into Power BI desktop.

To use dataflows, in the Power BI Service, under a Workspace: Create – Dataflow – Add entities: This starts online Power Query and you then choose a connector from one of the many data sources (just like you do with Power Query in Power BI Desktop).  Then choose a table to import and the screen will look like this:

To create a dashboard from these entities, in Power BI Desktop you simply choose Get Data -> Power BI dataflows.

The bottom line is Power BI users can now easily create a dataflow to prepare data in a centralized storage, using a standardized schema, ready for easy consumption, reuse, and generation of business insights.

Dataflows are a great way to have a power user get value out of data without involving IT.  But while this adds enterprise tools to Power BI, it does not mean you are creating an enterprise solution.  You still may need to create a data warehouse and cubes: See The need for having both a DW and cubes and Is the traditional data warehouse dead?.

More info:

Self-service data prep with dataflows

Microsoft Common Data Services

Video Introduction to Common Data Service For Analytics

Video Common Data Service for Analytics (CDS-A) and Power BI – an Introduction

Power BI expands self-service prep for big data, unifies modern and enterprise BI

Video Introducing: Advanced data prep with dataflows—for unified data and powerful insights

Dataflows in Power BI: A Data Analytics Gamechanger?

Video Introduction to the Microsoft Common Data Model

Video Power BI data preparation with Dataflows

Posted in Power BI, SQLServerPedia Syndication | 7 Comments

Power BI new feature: Composite models

There are two really great features just added to Power BI that I wanted to blog about: Composite models and Dual storage mode.  This is part of the July release for Power BI Desktop and it is in preview (see Power BI Desktop July 2018 Feature Summary).  I’ll also talk about a future release called Aggregations.

First a review of the two ways to connect to a data source:

Import – The selected tables and columns are imported into Power BI Desktop.  As you create or interact with a visualization, Power BI Desktop uses the imported data.  You must refresh the data, which imports the full data set again (or use the preview feature incremental refresh), to see any changes that occurred to the underlying data since the initial import or the most recent refresh.  Import datasets in the Power BI services have a 10GB dataset limitation for Premium version and 1GB limitation for free version (although with compression you can import much large data sets).  See Data sources in Power BI Desktop

DirectQuery – No data is imported or copied into Power BI Desktop.  As you create or interact with a visualization, Power BI Desktop queries the underlying data source, which means you’re always viewing current data.  DirectQuery lets you build visualizations over very large datasets, where it otherwise would be unfeasible to first import all of the data with pre-aggregation.  See Data sources supported by DirectQuery.

Up until now in Power BI, when you connect to a data source using DirectQuery, it is not possible to connect to any other data source in the same report (all tables must come from a single database), nor to include data that has been imported.  The new composite model feature removes this restriction, allowing a single report to seamlessly combine data from one or more DirectQuery sources, and/or combine data from a mix of DirectQuery sources and imported data.  So this means you can combine multiple DirectQuery sources with multiple Import sources.  If your report has some DirectQuery tables and some import tables, the status bar on the bottom right of your report will show a storage mode of ‘Mixed.’  Clicking on this allows all tables to be switched to import mode easily.

For example, with composite models it’s possible to build a model that combines sales data from an enterprise data warehouse using DirectQuery, with data on sales targets that is in a departmental SQL Server database using DirectQuery, along with some data imported from a spreadsheet.  A model that combines data from more than one DirectQuery source, or combines DirectQuery with imported data is referred to as a composite model.

Also, composite models include a new feature called dual storage mode.  If you are using DirectQuery currently, all visuals will result in queries being sent to the backend source, even for simple visuals such a slicer showing all the Product Categories.  The ability to define a table as having a storage mode of “Dual” means that a copy of the data for that table will also be imported, and any visuals that reference only columns from this table will use the imported data, and not require a query to the underlying source.  The benefits of this are improved performance, and lessened load on the backend source.  But if there are large tables being queried using DirectQuery, the dual table will operate as a DirectQuery table so no table data would need to be imported to be joined with an imported table.

Another feature due out in the next 90 days is “Aggregations” that allows you to create aggregation tables.  This new feature along with composite models and dual storage mode allows you to create a solution that uses huge datasets.  For example, say I have two related tables: One is at the detail grain called Sales, and another is the aggregated totals of Sales called Sales_Agg.  Sales is set to DirectQuery storage mode and Sales_Agg is set to Import storage mode.  If a user sends a query with a SELECT statement that has a GROUP BY that can be filled by the Sales_Agg table, the data will be pulled from cache in milliseconds since that table was imported (for example, 1.6 billion aggregated rows imported from SQL DW compressed to 10GB in memory).  If a user sends a query with a GROUP BY for a field that is not in the Sales_Agg table, it will do a DirectQuery to the Sales table (for example, sending a Spark query to a 23-node HDI Spark cluster of 1 trillion details rows of 250TB, taking about 40 seconds).  The user is not aware there is a Sales_Agg table (all aggregation tables are hidden) – they simple send a query to Sales and Power BI automatically redirects the query to the best table to use.  And if using a Date table, it can be set to Dual mode so it joins with Sales_Agg in memory in the first part of the example, or joins with Sales on the data source using DirectQuery in the second part of the example (so it does not have to pull the 1 trillion detail rows into Power BI in order to join with the imported Date table).

So you can think of aggregations as a replacement for creating an Azure Analysis Services tabular data model, saving on cost and optimization work.

You will need to right-click the Sales_Agg table and choose “Manage aggregations” to map the aggregated Sales_Agg table columns to the detail Sales table columns.  There is also a “Precedence” field that allows you to have multiple aggregation tables on the same fact table at different grains:

You can also create a report with a drillthrough feature where users can right-click on a data point in a report page that was built with an aggregation table and drillthrough to a focused page to get details that are filtered to that context that is built using DirectQuery.

So in summary, there are three values for storage mode at the table level:

  • Import – When set to Import, imported tables are cached.  Queries submitted to the Power BI dataset that return data from Import tables can only be fulfilled from cached data
  • DirectQuery – With this setting, DirectQuery tables are not cached.  Queries submitted to the Power BI dataset (for example, DAX queries) that return data from DirectQuery tables can only be fulfilled by executing on-demand queries to the data source.  Queries submitted to the data source use the query language for that data source (for example, SQL)
  • Dual – Dual tables can act as either cached or not cached, depending on the context of the query submitted to the Power BI dataset.  In some cases, queries are fulfilled from cached data; in other cases, queries are fulfilled by executing an on-demand query to the data source

Note that changing a table to Import is an irreversible operation; it cannot be changed back to DirectQuery, or back to Dual.  Also note there are two limitations during the preview period: DirectQuery only supports the tabular model (not multi-dimensional model) and you can’t publish files to the Power BI service.

More info:

Power BI Monthly Digest – July 2018

Composite models in Power BI Desktop (Preview)

Storage mode in Power BI Desktop (Preview)

Using DirectQuery in Power BI

Power BI Composite Models: The Good, The Bad, The Ugly

Composite Model; DirectQuery and Import Data Combined; Evolution Begins in Power BI

Video Building Enterprise grade BI models with Microsoft Power BI Premium

Video Building a data model to support 1 trillion rows of data and more with Microsoft Power BI Premium

Video Power BI and the Future for Modern and Enterprise BI

Video Introducing: Advanced data prep with dataflows—for unified data and powerful insights

Understanding Power BI Dual Storage

Video Microsoft Power BI Premium: Building enterprise-grade BI models for big data

Posted in Power BI, SQLServerPedia Syndication | Comments Off on Power BI new feature: Composite models

The need for having both a DW and cubes

I have heard some people say if you have a data warehouse, there is no need for cubes (when I say “cubes” I am referring to tabular and multidimensional OLAP models).  And I have heard others say if you have OLAP cubes, you don’t need a data warehouse.  I strongly disagree with both these statements, as almost all the customers I see that are building a modern data warehouse use both in their solutions.  Here are some reasons for both:

Why have a data warehouse if you can just use a cube?

  • Breaking down complex steps so easier to build cube
  • Cube is departmental view (cube builder not thinking enterprise solution)
  • Easier to clean/join/master data in DW
  • Processing cube is slow against sources
  • One place to control data for consistency and have one version of the truth
  • Use by tools that need relational format
  • Cube does not have all data
  • Cube may be behind in data updates (needs processing)
  • DW is place to integrate data
  • Risk of having multiple cubes doing same thing
  • DW keeps historical records
  • Easier to create data marts from DW

Reasons to report off cubes instead of the data warehouse (a summary from my prior blog post of Why use a SSAS cube?):

  • Semantic layer
  • Handle many concurrent users
  • Aggregating data for performance
  • Multidimensional analysis
  • No joins or relationships
  • Hierarchies, KPI’s
  • Row-level Security
  • Advanced time-calculations
  • Slowly Changing Dimensions (SCD)
  • Required for some reporting tools

The typical architecture I see looks like this:

Posted in Azure Analysis Services, Data warehouse, SQLServerPedia Syndication | 10 Comments

Monitoring Azure SQL Database

There are a number of options to monitor Azure SQL Database.  In this post I will briefly cover the built-in options and not 3rd-party products that I blogged about a while back (see Azure SQL Database monitoring).

Monitoring keeps you alert of problems.  Another reason monitoring helps you is to determine whether your database has excess capacity or is having trouble because resources are maxed out, and then decide whether it’s time to adjust the performance level and service tiers of your database.  You can monitor your database using:

  • Graphical tools in the Azure portal (click “Resource” on the Overview blade): monitor a single database’s metrics of CPU percentage, DTU percentage, Data IO percentage, Database size percentage and more.  You can configure alerts if metrics exceed or fall below a certain threshold over a time period – click “Alerts (Classic)” under “Monitoring”.
  • Use SQL dynamic management views (DMV): The two main one’s are sys.resource_stats in the logical master database of your server, and sys.dm_db_resource_stats in the user database.  You can use the sys.dm_db_resource_stats view in every SQL database. The sys.dm_db_resource_stats view shows recent resource use data relative to the service tier. Average percentages for CPU, data IO, log writes, and memory are recorded every 15 seconds and are maintained for 1 hour.  Because this view provides a more granular look at resource use, use sys.dm_db_resource_stats first for any current-state analysis or troubleshooting.  The sys.resource_stats view in the master database has additional information that can help you monitor the performance of your SQL database at its specific service tier and performance level.  The data is collected every 5 minutes and is maintained for approximately 14 days.  This view is useful for a longer-term historical analysis of how your SQL database uses resources.  See Monitoring Azure SQL Database using dynamic management views for other DMV’s you might want to use
  • Monitor resource usage using SQL Database Query Performance Insight (requires Query Store).  Review top CPU consuming queries and view individual query details
  • Azure SQL Intelligent Insights is proactive monitoring that uses built-in intelligence to continuously monitor database usage through artificial intelligence and detect disruptive events that cause poor performance.  Once detected, a detailed analysis is performed that generates a diagnostics log (usually to Azure Log Analytics) with an intelligent assessment of the issue.  This assessment consists of a root cause analysis of the database performance issue and, where possible, recommendations for performance improvements.  Intelligent Insights analyzes SQL Database performance by comparing the database workload from the last hour with the past seven-day baseline workload.  It also monitors absolute operational thresholds and detects issues with excessive wait times, critical exceptions, and issues with query parameterizations that might affect performance.  The system automatically considers changes to the workload and changes in the number of query requests made to the database to dynamically determine normal and out-of-the-ordinary database performance thresholds.  Integration of Intelligent Insights with Azure Log Analytics is performed through first enabling Intelligent Insights logging (selecting “SQLInsights” under LOG) and then configuring Intelligent Insights log data to be streamed into Azure Log Analytics, which is a feature of the Operations Management Suite (OMS)
  • Azure SQL Analytics: provides reporting and alerting capabilities on top of the Intelligent Insights and other diagnostics log data as well as metric data

Other ways of monitoring SQL Database:

More info:

Monitoring database performance in Azure SQL Database

Posted in Azure SQL Database, SQLServerPedia Syndication | Comments Off on Monitoring Azure SQL Database

Azure Data Lake Store Gen2

Big news!  The next generation of Azure Data Lake Store (ADLS) has arrived.  See the official announcement.

In short, ADLS Gen2 is the combination of the current ADLS (now called Gen1) and Blob storage.  Gen2 is built on Blob storage.  By GA, ADLS Gen2 will have all the features of both, which means it will have features such as limitless storage capacity, support all Blob tiers (Hot, Cool, and Archive), the new lifecycle management feature, Azure Active Directory integration, hierarchical file system, and read-access geo-redundant storage.

A Gen2 capability is what is called “multi-modal” which means customers can use either Blob object store APIs or the new Gen2 file system APIs.  The key here is that both blob and file system semantics are now supported over the same data.

For existing customers of Gen1, once Gen2 is GA, no new features will be added to Gen1.  Customers can stay on Gen1 if they don’t need any new capabilities or can move to Gen2 where they can leverage all the goodness of the combined capabilities.  They can upgrade when they chose to do so.

Existing customers of Blob storage can continue to use Blob storage to save a bit of money (storage costs will be the same between Blob and Gen2 but transaction costs will be a bit higher for Gen2 due to the overhead of namespaces).  By GA, existing Blob storage accounts will just need to “enable Gen2” to get all the features of Gen2.  Before GA, they will need to copy their data from Blob storage to Gen2.

New customers should go with Gen2 unless the simplicity of an object store is all that is needed – for example,  storing images, storing backup data, website hosting, etc where the apps really don’t benefit from a file system namespace and the customer wants to save a bit of money on transaction costs.

Note that Blob storage and ADLS Gen1 will continue to exist and that Gen2 pricing will be roughly half of Gen1.

It was announced yesterday (June 27th) and be available for a limited public preview (customers will have to sign up).

Because ADLS Gen2 is part of blob storage, it is a “ring 0” service and will at GA be available in all regions.  The limited public preview program kicks off with two regions in the US with new regions added throughout the preview window.

For those using the current Blob SDK’s: Initially the SDK’s are different and some code changes will be required.  Microsoft is looking at whether they can reduce the need for code changes.  For customers using the WASB or ADLS driver, it will be as simple as switching to the new Gen2 driver and changing configs.

Check out the Azure Data Lake Storage Gen2 overview video for more info as well as A closer look at Azure Data Lake Storage Gen2 and finally check out the Gen2 documentation.

Posted in Azure Data Lake, SQLServerPedia Syndication | 3 Comments