Power BI: Dataflows

Dataflows, previously called Common Data Service for Analytics as well as Datapools, will be in preview soon and I wanted to explain in this blog what it is and how it can help you get value out of your data quickly (it’s a follow-up to my blog Getting value out of data quickly).

In short, Dataflows integrates data lake and ETL technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  Dataflows include a standard schema, called the Common Data Model (CDM), that contains the most common business entities across the major functions such as marketing, sales, service, finance, along with connectors that ingest data from the most common sources into these schemas.  This greatly simplifies modeling and integration challenges (it prevents multiple metadata/definition on the same data).  You can also extend the CDM by creating custom entities.  Lastly – Microsoft and their partners will be shipping out-of-the-box applications that run on Power BI that populate data in the Common Data Model and deliver insights through Power BI.

A dataflow is not just the data itself, but also logic on how the data is manipulated.  Dataflows belong to the Data Warehouse/Mart/Lake family.  Its main job is to aggregate, cleanse, transform, integrate and harmonize data from a large and growing set of supported on-premises and cloud-based data sources including Dynamics 365, Salesforce, Azure SQL Database, Excel, SharePoint.  Dataflows hold a collection of data-lake stored entities (i.e. tables) which are stored in internal Power BI Common Data Model compliant folders in Azure Data Lake Storage Gen2.

This adds two new layers to Power BI (Dataflows and Storage):

But you can instead use your own Azure Data Lake Store Gen2, allowing other Azure services to reuse the data (i.e. Azure Databricks can be used to manipulate the data).

You can also setup incremental refresh for any entity, link to entities from other dataflows, and can pull data down from the dataflows into Power BI desktop.

To use dataflows, in the Power BI Service, under a Workspace: Create – Dataflow – Add entities: This starts online Power Query and you then choose a connector from one of the many data sources (just like you do with Power Query in Power BI Desktop).  Then choose a table to import and the screen will look like this:

To create a dashboard from these entities, in Power BI Desktop you simply choose Get Data -> Power BI dataflows.

The bottom line is Power BI users can now easily create a dataflow to prepare data in a centralized storage, using a standardized schema, ready for easy consumption, reuse, and generation of business insights.

Dataflows are a great way to have a power user get value out of data without involving IT.  But while this adds enterprise tools to Power BI, it does not mean you are creating an enterprise solution.  You still may need to create a data warehouse and cubes: See The need for having both a DW and cubes and Is the traditional data warehouse dead?.

More info:

Self-service data prep with dataflows

Microsoft Common Data Services

Video Introduction to Common Data Service For Analytics

Video Common Data Service for Analytics (CDS-A) and Power BI – an Introduction

Power BI expands self-service prep for big data, unifies modern and enterprise BI

Video Introducing: Advanced data prep with dataflows—for unified data and powerful insights

Dataflows in Power BI: A Data Analytics Gamechanger?

Posted in Power BI, SQLServerPedia Syndication | 3 Comments

Power BI new feature: Composite models

There are two really great features just added to Power BI that I wanted to blog about: Composite models and Dual storage mode.  This is part of the July release for Power BI Desktop and it is in preview (see Power BI Desktop July 2018 Feature Summary).  I’ll also talk about a future release called Aggregations.

First a review of the two ways to connect to a data source:

Import – The selected tables and columns are imported into Power BI Desktop.  As you create or interact with a visualization, Power BI Desktop uses the imported data.  You must refresh the data, which imports the full data set again (or use the preview feature incremental refresh), to see any changes that occurred to the underlying data since the initial import or the most recent refresh.  Import datasets in the Power BI services have a 10GB dataset limitation for Premium version and 1GB limitation for free version (although with compression you can import much large data sets).  See Data sources in Power BI Desktop

DirectQuery – No data is imported or copied into Power BI Desktop.  As you create or interact with a visualization, Power BI Desktop queries the underlying data source, which means you’re always viewing current data.  DirectQuery lets you build visualizations over very large datasets, where it otherwise would be unfeasible to first import all of the data with pre-aggregation.  See Data sources supported by DirectQuery.

Up until now in Power BI, when you connect to a data source using DirectQuery, it is not possible to connect to any other data source in the same report (all tables must come from a single database), nor to include data that has been imported.  The new composite model feature removes this restriction, allowing a single report to seamlessly combine data from one or more DirectQuery sources, and/or combine data from a mix of DirectQuery sources and imported data.  So this means you can combine multiple DirectQuery sources with multiple Import sources.  If your report has some DirectQuery tables and some import tables, the status bar on the bottom right of your report will show a storage mode of ‘Mixed.’  Clicking on this allows all tables to be switched to import mode easily.

For example, with composite models it’s possible to build a model that combines sales data from an enterprise data warehouse using DirectQuery, with data on sales targets that is in a departmental SQL Server database using DirectQuery, along with some data imported from a spreadsheet.  A model that combines data from more than one DirectQuery source, or combines DirectQuery with imported data is referred to as a composite model.

Also, composite models include a new feature called dual storage mode.  If you are using DirectQuery currently, all visuals will result in queries being sent to the backend source, even for simple visuals such a slicer showing all the Product Categories.  The ability to define a table as having a storage mode of “Dual” means that a copy of the data for that table will also be imported, and any visuals that reference only columns from this table will use the imported data, and not require a query to the underlying source.  The benefits of this are improved performance, and lessened load on the backend source.  But if there are large tables being queried using DirectQuery, the dual table will operate as a DirectQuery table so no table data would need to be imported to be joined with an imported table.

Another feature due out in the next 90 days is “Aggregations” that allows you to create aggregation tables.  This new feature along with composite models and dual storage mode allows you to create a solution that uses huge datasets.  For example, say I have two related tables: One is at the detail grain called Sales, and another is the aggregated totals of Sales called Sales_Agg.  Sales is set to DirectQuery storage mode and Sales_Agg is set to Import storage mode.  If a user sends a query with a SELECT statement that has a GROUP BY that can be filled by the Sales_Agg table, the data will be pulled from cache in milliseconds since that table was imported (for example, 1.6 billion aggregated rows imported from SQL DW compressed to 10GB in memory).  If a user sends a query with a GROUP BY for a field that is not in the Sales_Agg table, it will do a DirectQuery to the Sales table (for example, sending a Spark query to a 23-node HDI Spark cluster of 1 trillion details rows of 250TB, taking about 40 seconds).  The user is not aware there is a Sales_Agg table (all aggregation tables are hidden) – they simple send a query to Sales and Power BI automatically redirects the query to the best table to use.  And if using a Date table, it can be set to Dual mode so it joins with Sales_Agg in memory in the first part of the example, or joins with Sales on the data source using DirectQuery in the second part of the example (so it does not have to pull the 1 trillion detail rows into Power BI in order to join with the imported Date table).

So you can think of aggregations as a replacement for creating an Azure Analysis Services tabular data model, saving on cost and optimization work.

You will need to right-click the Sales_Agg table and choose “Manage aggregations” to map the aggregated Sales_Agg table columns to the detail Sales table columns.  There is also a “Precedence” field that allows you to have multiple aggregation tables on the same fact table at different grains:

You can also create a report with a drillthrough feature where users can right-click on a data point in a report page that was built with an aggregation table and drillthrough to a focused page to get details that are filtered to that context that is built using DirectQuery.

So in summary, there are three values for storage mode at the table level:

  • Import – When set to Import, imported tables are cached.  Queries submitted to the Power BI dataset that return data from Import tables can only be fulfilled from cached data
  • DirectQuery – With this setting, DirectQuery tables are not cached.  Queries submitted to the Power BI dataset (for example, DAX queries) that return data from DirectQuery tables can only be fulfilled by executing on-demand queries to the data source.  Queries submitted to the data source use the query language for that data source (for example, SQL)
  • Dual – Dual tables can act as either cached or not cached, depending on the context of the query submitted to the Power BI dataset.  In some cases, queries are fulfilled from cached data; in other cases, queries are fulfilled by executing an on-demand query to the data source

Note that changing a table to Import is an irreversible operation; it cannot be changed back to DirectQuery, or back to Dual.  Also note there are two limitations during the preview period: DirectQuery only supports the tabular model (not multi-dimensional model) and you can’t publish files to the Power BI service.

More info:

Power BI Monthly Digest – July 2018

Composite models in Power BI Desktop (Preview)

Storage mode in Power BI Desktop (Preview)

Using DirectQuery in Power BI

Power BI Composite Models: The Good, The Bad, The Ugly

Composite Model; DirectQuery and Import Data Combined; Evolution Begins in Power BI

Video Building Enterprise grade BI models with Microsoft Power BI Premium

Video Building a data model to support 1 trillion rows of data and more with Microsoft Power BI Premium

Video Power BI and the Future for Modern and Enterprise BI

Video Introducing: Advanced data prep with dataflows—for unified data and powerful insights

Understanding Power BI Dual Storage

Posted in Power BI, SQLServerPedia Syndication | Leave a comment

The need for having both a DW and cubes

I have heard some people say if you have a data warehouse, there is no need for cubes (when I say “cubes” I am referring to tabular and multidimensional OLAP models).  And I have heard others say if you have OLAP cubes, you don’t need a data warehouse.  I strongly disagree with both these statements, as almost all the customers I see that are building a modern data warehouse use both in their solutions.  Here are some reasons for both:

Why have a data warehouse if you can just use a cube?

  • Breaking down complex steps so easier to build cube
  • Cube is departmental view (cube builder not thinking enterprise solution)
  • Easier to clean/join/master data in DW
  • Processing cube is slow against sources
  • One place to control data for consistency and have one version of the truth
  • Use by tools that need relational format
  • Cube does not have all data
  • Cube may be behind in data updates (needs processing)
  • DW is place to integrate data
  • Risk of having multiple cubes doing same thing
  • DW keeps historical records
  • Easier to create data marts from DW

Reasons to report off cubes instead of the data warehouse (a summary from my prior blog post of Why use a SSAS cube?):

  • Semantic layer
  • Handle many concurrent users
  • Aggregating data for performance
  • Multidimensional analysis
  • No joins or relationships
  • Hierarchies, KPI’s
  • Row-level Security
  • Advanced time-calculations
  • Slowly Changing Dimensions (SCD)
  • Required for some reporting tools

The typical architecture I see looks like this:

Posted in Azure Analysis Services, Data warehouse, SQLServerPedia Syndication | 10 Comments

Monitoring Azure SQL Database

There are a number of options to monitor Azure SQL Database.  In this post I will briefly cover the built-in options and not 3rd-party products that I blogged about a while back (see Azure SQL Database monitoring).

Monitoring keeps you alert of problems.  Another reason monitoring helps you is to determine whether your database has excess capacity or is having trouble because resources are maxed out, and then decide whether it’s time to adjust the performance level and service tiers of your database.  You can monitor your database using:

  • Graphical tools in the Azure portal (click “Resource” on the Overview blade): monitor a single database’s metrics of CPU percentage, DTU percentage, Data IO percentage, Database size percentage and more.  You can configure alerts if metrics exceed or fall below a certain threshold over a time period – click “Alerts (Classic)” under “Monitoring”.
  • Use SQL dynamic management views (DMV): The two main one’s are sys.resource_stats in the logical master database of your server, and sys.dm_db_resource_stats in the user database.  You can use the sys.dm_db_resource_stats view in every SQL database. The sys.dm_db_resource_stats view shows recent resource use data relative to the service tier. Average percentages for CPU, data IO, log writes, and memory are recorded every 15 seconds and are maintained for 1 hour.  Because this view provides a more granular look at resource use, use sys.dm_db_resource_stats first for any current-state analysis or troubleshooting.  The sys.resource_stats view in the master database has additional information that can help you monitor the performance of your SQL database at its specific service tier and performance level.  The data is collected every 5 minutes and is maintained for approximately 14 days.  This view is useful for a longer-term historical analysis of how your SQL database uses resources.  See Monitoring Azure SQL Database using dynamic management views for other DMV’s you might want to use
  • Monitor resource usage using SQL Database Query Performance Insight (requires Query Store).  Review top CPU consuming queries and view individual query details
  • Azure SQL Intelligent Insights is proactive monitoring that uses built-in intelligence to continuously monitor database usage through artificial intelligence and detect disruptive events that cause poor performance.  Once detected, a detailed analysis is performed that generates a diagnostics log (usually to Azure Log Analytics) with an intelligent assessment of the issue.  This assessment consists of a root cause analysis of the database performance issue and, where possible, recommendations for performance improvements.  Intelligent Insights analyzes SQL Database performance by comparing the database workload from the last hour with the past seven-day baseline workload.  It also monitors absolute operational thresholds and detects issues with excessive wait times, critical exceptions, and issues with query parameterizations that might affect performance.  The system automatically considers changes to the workload and changes in the number of query requests made to the database to dynamically determine normal and out-of-the-ordinary database performance thresholds.  Integration of Intelligent Insights with Azure Log Analytics is performed through first enabling Intelligent Insights logging (selecting “SQLInsights” under LOG) and then configuring Intelligent Insights log data to be streamed into Azure Log Analytics, which is a feature of the Operations Management Suite (OMS)
  • Azure SQL Analytics: provides reporting and alerting capabilities on top of the Intelligent Insights and other diagnostics log data as well as metric data

Other ways of monitoring SQL Database:

More info:

Monitoring database performance in Azure SQL Database

Posted in Azure SQL Database, SQLServerPedia Syndication | Comments Off on Monitoring Azure SQL Database

Azure Data Lake Store Gen2

Big news!  The next generation of Azure Data Lake Store (ADLS) has arrived.  See the official announcement.

In short, ADLS Gen2 is the combination of the current ADLS (now called Gen1) and Blob storage.  Gen2 is built on Blob storage.  By GA, ADLS Gen2 will have all the features of both, which means it will have features such as limitless storage capacity, support all Blob tiers (Hot, Cool, and Archive), the new lifecycle management feature, Azure Active Directory integration, hierarchical file system, and read-access geo-redundant storage.

A Gen2 capability is what is called “multi-modal” which means customers can use either Blob object store APIs or the new Gen2 file system APIs.  The key here is that both blob and file system semantics are now supported over the same data.

For existing customers of Gen1, once Gen2 is GA, no new features will be added to Gen1.  Customers can stay on Gen1 if they don’t need any new capabilities or can move to Gen2 where they can leverage all the goodness of the combined capabilities.  They can upgrade when they chose to do so.

Existing customers of Blob storage can continue to use Blob storage to save a bit of money (storage costs will be the same between Blob and Gen2 but transaction costs will be a bit higher for Gen2 due to the overhead of namespaces).  By GA, existing Blob storage accounts will just need to “enable Gen2” to get all the features of Gen2.  Before GA, they will need to copy their data from Blob storage to Gen2.

New customers should go with Gen2 unless the simplicity of an object store is all that is needed – for example,  storing images, storing backup data, website hosting, etc where the apps really don’t benefit from a file system namespace and the customer wants to save a bit of money on transaction costs.

Note that Blob storage and ADLS Gen1 will continue to exist and that Gen2 pricing will be roughly half of Gen1.

It was announced yesterday (June 27th) and be available for a limited public preview (customers will have to sign up).

Because ADLS Gen2 is part of blob storage, it is a “ring 0” service and will at GA be available in all regions.  The limited public preview program kicks off with two regions in the US with new regions added throughout the preview window.

For those using the current Blob SDK’s: Initially the SDK’s are different and some code changes will be required.  Microsoft is looking at whether they can reduce the need for code changes.  For customers using the WASB or ADLS driver, it will be as simple as switching to the new Gen2 driver and changing configs.

Check out the Azure Data Lake Storage Gen2 overview video for more info as well as A closer look at Azure Data Lake Storage Gen2 and finally check out the Gen2 documentation.

Posted in Azure Data Lake, SQLServerPedia Syndication | 3 Comments

Analytics Platform System (APS) AU7 released

The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), has just released an appliance update (AU7), which is sort of like a service pack, except that it includes many new features.

Below is what is new in this release:

Customers will get significantly improved query performance and enhanced security features with this release. APS AU7 builds on appliance update 6 (APS 2016) release as a foundation. Upgrading to APS appliance update 6 is a prerequisite to upgrade to appliance update 7.

Faster performance

APS AU7 now provides the ability to automatically create statistics and update of existing outdated statistics for improved query optimization. APS AU7 also adds support for setting multiple variables from a single select statement reducing the number of redundant round trips to the server and improving overall query and ETL performance time. Other T-SQL features include HASH and ORDER GROUP query hints to provide more control over improving query execution plans.

Better security

APS AU7 also includes latest firmware and drivers along with the hardware and software patch to address the Spectre/Meltdown vulnerability from our hardware partners.

Management enhancements

Customers already on APS2016 will experience an enhanced upgrade process to APS AU7 allowing a shorter maintenance window with the ability to uninstall and rollback to a previous version.  AU7 also introduces a section called Feature Switch in configuration manager giving customers the ability to customize the behavior of new features.

More info:

Microsoft releases the latest update of Analytics Platform System

Posted in PDW/APS, SQLServerPedia Syndication | Comments Off on Analytics Platform System (APS) AU7 released

Understanding Cosmos DB

Cosmos DB is an awesome product that is mainly used for large-scale OLTP solutions.  Any web, mobile, gaming, and IoT application that needs to handle massive amounts of data, reads, and writes at a globally distributed scale with near-real response times for a variety of data are great use cases (It can be scaled-out to support many millions of transactions per second).  Because it fits in the NoSQL category and is a scale-out solution, it can be difficult to wrap your head around how it works if you come from the relational world (i.e. SQL Server).  So this blog will be out the differences in how Cosmos DB works.

First, a quick comparison of terminology to help you understand the difference:

RDBMS Cosmos DB (Document Model) Cosmos DB (Graph Model
Database Database Database
Table, view Collection Graph
Row Document (JSON) Vertex
Column Property Property
Foreign Key Reference Edge
Join Embedded document .out()
Partition Key/Sharding Key Partition Key Partition Key

From Welcome to Azure Cosmos DB and other documentation, here are some key points to understand:

  • You can distribute your data to any number of Azure regions, with the click of a button. This enables you to put your data where your users are, ensuring the lowest possible latency to your customers
  • When a new region gets added, it is available for operations within 30 minutes anywhere in the world (assuming your data is 100 TBs or less).
  • To control exact sequence of regional failovers in cases of an outage, Azure Cosmos DB enables you to associate a priority with various regions associated with the database account
  • Azure Cosmos DB enables you to configure the regions (associated with the database) for “read”, “write” or “read/write” regions.
  • For Cosmos DB to offer strong consistency in a globally distributed setup, it needs to synchronously replicate the writes or to synchronously perform cross-region reads.  The speed of light and the wide area network reliability dictates that strong consistency will result in higher latencies and reduced availability of database operations.  Hence, in order to offer guaranteed low latencies at the 99th percentile and 99.99% availability for all single region accounts and all multi-region accounts with relaxed consistency, and 99.999% availability on all multi-region database accounts, it must employ asynchronous replication.  This in-turn requires that it must also offer well-defined, relaxed consistency model(s) – weaker than strong (to offer low latency and availability guarantees) and ideally stronger than “eventual” consistency (with an intuitive programming model)
  • Using Azure Cosmos DB’s multi-homing APIs, an app always knows where the nearest region is and sends requests to the nearest data center.  All of this is possible with no config changes.  You set your write-region and as many read-regions as you want, and the rest is handled for you
  • As you add and remove regions to your Azure Cosmos DB database, your application does not need to be redeployed and continues to be highly available thanks to the multi-homing API capability
  • It supports multiple data models, including but not limited to document, graph, key-value, table, and column-family data models
  • APIs for the following data models are supported with SDKs available in multiple languages: SQL API, MongoDB API, Cassandra APIGremlin API, Table API
  • 99.99% availability SLA for all single region database accounts, and all 99.999% read availability on all multi-region database accounts.  Deploy to any number of Azure regions for higher availability and better performance
  • For a typical 1KB item, Cosmos DB guarantees end-to-end latency of reads under 10 ms and indexed writes under 15 ms at the 99th percentile within the same Azure region.  The median latencies are significantly lower (under 5 ms).  So you will want to deploy your app and your database to multiple regions to have users all over the world have the same low latency.  If you have an app in one region but the Cosmos DB database in another, then you will have additional latency between the regions (see Azure Latency Test to determine what that latency would be, or go to see existing latency via the Azure Portal and choose Azure Cosmos DB then choose your database then choose Metrics -> Consistency -> SLA -> Replication latency)
  • Developers reserve throughput of the service according to the application’s varying load.  Behind the scenes, Cosmos DB will scale up resources (memory, processor, partitions, replicas, etc.) to achieve that requested throughput while maintaining the 99th percentile of latency for reads to under 10 ms and for writes to under 15 ms. Throughput is specified in request units (RUs) per second.  The number of RUs consumed for a particular operation varies based upon a number of factors, but the fetching of a single 1KB document by id spends roughly 1 RU.  Delete, update, and insert operations consume roughly 5 RUs assuming 1 KB documents.  Big queries and stored procedure executions can consume 100s or 1000s of RUs based upon the complexity of the operations needed.  For each collection (bucket of documents), you specify the RUs
  • Throughput directly affects how much the user is charged but can be tuned up dynamically to handle peak load and down to save costs when more lightly loaded by using the Azure Portal, one of the supported SDKs, or the REST API
  • Request Units (RU) are used to guarantee throughput in Cosmos DB.  You will pay for what you reserve, not what you use.  RUs are provisioned by region and can vary by region as a result.  But they are not shared between regions.  This will require you to understand usage patterns in each region you have a replica
  • For applications that exceed the provisioned request unit rate for a container, requests to that collection are throttled until the rate drops below the reserved level.  When a throttle occurs, the server preemptively ends the request with RequestRateTooLargeException (HTTP status code 429) and returns the x-ms-retry-after-ms header indicating the amount of time, in milliseconds, that the user must wait before reattempting the request.  So, you will get 10ms reads as long as requests stay under the set RU’s
  • Cosmos DB provides five consistency levels: strong, bounded-staleness, session, consistent prefix, and eventual.  The further to the left in this list, the greater the consistency but the higher the RU cost which essentially lowers available throughput for the same RU setting.  Session level consistency is the default.  Even when set to lower consistency level, any arbitrary set of operations can be executed in an ACID-compliant transaction by performing those operations from within a stored procedure.  You can also change the consistency level for each request using the x-ms-consistency-level request header or the equivalent option in your SDK
  • Azure Cosmos DB accounts that are configured to use strong consistency cannot associate more than one Azure region with their Azure Cosmos DB account
  • There is not support for GROUP BY or other aggregation functionality found in database systems (workaround is to use Spark to Cosmos DB connector)
  • No database schema/index management – it automatically indexes all the data it ingests without requiring any schema or indexes and serves blazing fast queries.  By default, every field in each document is automatically indexed generally providing good performance without tuning to specific query patterns.  These defaults can be modified by setting an indexing policy which can vary per field.
  • Industry-leading, financially backed, comprehensive service level agreements (SLAs) for availability, latency, throughput, and consistency for your mission-critical data
  • There is a local emulator running under MS Windows for developer desktop use (was added in the fall of 2016)
  • Capacity options for a collection: Fixed (max of 10GB and 400 – 10,000 RU/s), Unlimited (1,000 – 100,000 RU/s). You can contact support if you need more than 100,000 RU/s.  There is no limit to the total amount of data or throughput that a container can store in Azure Cosmos DB
  • Costs: SSD Storage (per GB): $0.25 GB/month; Reserved RUs/second (per 100 RUs, 400 RUs minimum): $0.008/hour (for all regions except Japan and Brazil which are more)
  • Global distribution (also known as global replication/geo-redundancy/geo-replication) is for delivering low-latency access to data to end users no matter where they are located around the globe and for adding regional resiliency for business continuity and disaster recovery (BCDR).  When you choose to make containers span across geographic regions, you are billed for the throughput and storage for each container in every region and the data transfer between regions
  • Cosmos DB implements optimistic concurrency so there are no locks or blocks but instead, if two transactions collide on the same data, one of them will fail and will be asked to retry
  • Because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively “weak links” and will not be verified by the database itself.  If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.
  • You can set up a policy to geo-fence a database to specific regions.  This geo-fencing capability is especially useful when dealing with data sovereignty compliance that requires data to never leave a specific geographical boundary
  • Backups are taken every four hours and two are kept at all times.  Also, in the event of database deletion, the backups will be kept for thirty days before being discarded.  With these rules in place, the client knows that in the event of some unintended data modification, they have an eight-hour window to get support involved and start the restore process
  • Cosmos DB is an Azure data storage solution which means that the data at rest is encrypted by default and data is encrypted in transit.  If you need Role-Based Access Control (RBAC), Azure Active Directory (AAD) is supported in Cosmos DB
  • Within Cosmos DB, partitions are used to distribute your data for optimal read and write operations.  It is recommended to create a granular key with highly distinct values.  The partitions are managed for you.  Cosmos DB will split or merge partitions to keep the data properly distributed.  Keep in mind your key needs to support distributed writes and distributed reads
  • Until recently, writes could only be made to one region.  But now in private preview is writes to multi regions.  See Multi-master at global scale with Azure Cosmos DB.  With Azure Cosmos DB multi-master support, you can perform writes on containers of data (for example, collections, graphs, tables) distributed anywhere in the world. You can update data in any region that is associated with your database account. These data updates can propagate asynchronously. In addition to providing fast access and write latency to your data, multi-master also provides a practical solution for failover and load-balancing issues.  To compare scaling out writes with Cosmos DB versus SQL Server check out Distributed Writes
  • New Azure Cosmos DB Explorer is in public preview – A full screen standalone web-based version of the Data Explorer many of you already use in Azure Portal for Cosmos DB

Azure Cosmos DB allows you to scale throughput (as well as, storage), elastically across any number of regions depending on your needs or demand.

Azure Cosmos DB distributed and partitioned collections

The above pictures shows a single Azure Cosmos DB container horizontally partitioned (across three resource partitions within a region) and then globally distributed across three Azure regions

An Azure Cosmos DB container gets distributed in two dimensions (i) within a region and (ii) across regions. Here’s how (see Partition and scale in Azure Cosmos DB for more info):

  • Local distribution: Within a single region, an Azure Cosmos DB container is horizontally scaled out in terms of resource partitions.  Each resource partition manages a set of keys and is strongly consistent and highly available being physically represented by four replicas also called a replica-set and state machine replication among those replicas.  Azure Cosmos DB is a fully resource-governed system, where a resource partition is responsible to deliver its share of throughput for the budget of system resources allocated to it.  The scaling of an Azure Cosmos DB container is transparent to the users.  Azure Cosmos DB manages the resource partitions and splits and merges them as needed as storage and throughput requirements change
  • Global distribution: If it is a multi-region database, each of the resource partitions is then distributed across those regions.  Resource partitions owning the same set of keys across various regions form a partition set (see preceding figure).  Resource partitions within a partition set are coordinated using state machine replication across multiple regions associated with the database.  Depending on the consistency level configured, the resource partitions within a partition set are configured dynamically using different topologies (for example, star, daisy-chain, tree etc.)

The following links can help with understanding the core concepts better: Request units in Azure Cosmos DBPerformance tips for Azure Cosmos DB and .NETTuning query performance with Azure Cosmos DBPartitioning in Azure Cosmos DB using the SQL APILeverage Azure CosmosDB metrics to find issues.

You can Try Azure Cosmos DB for Free without an Azure subscription, free of charge and commitments.  For a good training course on Cosmos DB check out Developing Planet-Scale Applications in Azure Cosmos DB and Learning Azure Cosmos DB.

More info:

Relational databases vs Non-relational databases

A technical overview of Azure Cosmos DB

Azure Cosmos DB vs. SQL Server: Scalability Differences

Posted in Azure Cosmos DB, SQLServerPedia Syndication | 1 Comment

Getting value out of data quickly

There are times when you need to create a “quick and dirty” solution to build a report.  This blog will show you one way of using a few Azure products to accomplish that.  This should not be viewed as a replacement for a data warehouse but rather as a way to quickly show a customer how to get value out of their data or if you need a one-time report or if you just want to see if certain data would be useful to move into your data warehouse.

Let’s look at a high-level architecture for building a report quickly using NCR data (restaurant data):

This solution has the restaurant data that is in on-prem SQL Server replicated to Azure SQL Database using transactional replicationAzure Data Factory is then used to copy the point-of-sale transactions logs in Azure SQL Database into Azure Data Lake Store.  Then Azure Data Lake Analytics with U-SQL is used to transform/clean the data and store it back into Azure Data Lake Store.  That data is then used in Power BI to create the reports and dashboards (business users can build the models in Power BI and the data can be refreshed multiple times during the day via the new incremental refresh).  This is all done with Platform-as-a-Service products so there is nothing to setup and install and no VMs – just quickly and easily doing all the work via the Azure portal.

This solution is inexpensive since there is no need for the more expensive services like Azure SQL Data Warehouse or Azure Analysis Services, and Azure Data Lake Analytics is a job service that you only pay for when the query runs (where you specify the account units to use).

Some things to keep in mind with a solution like this:

  • Power BI has been called “reporting crack” because once a business user is exposed to it they want more.  And this solution gives them their first taste
  • This solution should have a very limited scope – it’s more like a proof-of-concept and should be a short-term solution
  • It takes the approach of ELT instead of ETL in that data is loaded into Azure Data Lake Store and then converted using the power of Azure Data Lake Analytics instead of it being transformed during the move from the source system to the data lake like you usually do when using SSIS
  • This limits the data model building to one person using it for themselves or a department verses have multiple people build models for an enterprise solution using Azure Analysis Services
  • This results in quick value but sacrifices an enterprise solution that includes performance, data governance, data history, referential integrity, security, and master data management.  Also, you will not be able to use tools that need to work against a relational format
  • This solution will normally require a power user to develop reports since it’s working against a data lake instead of a easier-to-use relational model or a tabular model

An even better way to get value of of data quickly is with another product that is in preview called Common Data Service for Analytics.  More on this in my next blog.

Posted in SQLServerPedia Syndication | 2 Comments

Microsoft Build event announcements

Another Microsoft event and another bunch of exciting announcements.  At the Microsoft Build event this week, the major announcements in the data platform space were:

Multi-master at global scale with Azure Cosmos DB.  Perform writes on containers of data (for example, collections, graphs, tables) distributed anywhere in the world. You can update data in any region that is associated with your database account. These data updates can propagate asynchronously. In addition to providing fast access and write latency to your data, multi-master also provides a practical solution for failover and load-balancing issues.  More info

Azure Cosmos DB Provision throughput at the database level in preview.  Azure Cosmos DB customers with multiple collections can now provision throughput at a database level and share throughput across the database, making large collection databases cheaper to start and operate.  More info

Virtual network service endpoint for Azure Cosmos DB.  Generally available today, virtual network service endpoint (VNET) helps to ensure access to Azure Cosmos DB from the preferred virtual network subnet.  The feature will remove the manual change of IP and provide an easier way to manage access to Azure Cosmos DB endpoint.  More info

Azure Cognitive Search now in preview.  Cognitive Search, a new preview feature in the existing Azure Search service, includes an enrichment pipeline allowing customers to find rich structured information from documents.  That information can then become part of the Azure Search index.  Cognitive Search also integrates with Natural Language Processing capabilities and includes built-in enrichers called cognitive skills.  Built-in skills help to perform a variety of enrichment tasks, such as the extraction of entities from text or image analysis and OCR capabilities.  Cognitive Search is also extensible and can connect to your own custom-built skills.  More info

Azure SQL Database and Data Warehouse TDE with customer managed keys.  Now generally available, Azure SQL Database and Data Warehouse Transparent Data Encryption (TDE) offers Bring Your Own Key (BYOK) support with Azure Key Vault integration.  Azure Key Vault provides highly available and scalable secure storage for RSA cryptographic keys backed by FIPS 140-2 Level 2 validated Hardware Security Modules (HSMs).  Key Vault streamlines the key management process and enables customers to maintain full control of encryption keys and allows them to manage and audit key access.  This is one of the most frequently requested features by enterprise customers looking to protect sensitive data and meet regulatory or security compliance obligations.  More info

Azure Database Migration Service is now generally available.  This is a service that was designed to be a seamless, end-to-end solution for moving on-premises SQL Server, Oracle, and other relational databases to the cloud. The service will support migrations of homogeneous/heterogeneous source-target pairs, and the guided migration process will be easy to understand and implement.  More info

4 new features now available in Azure Stream Analytics: Public preview: Session window; Private preview: C# custom code support for Stream Analytics jobs on IoT Edge, Blob output partitioning by custom attribute, Updated Built-In ML models for Anomaly Detection.  More info

Posted in SQLServerPedia Syndication | 1 Comment

Azure SQL Data Warehouse Gen2 announced

Monday was announced the general availability of the Compute Optimized Gen2 tier of Azure SQL Data Warehouse.  With this performance optimized tier, Microsoft is dramatically accelerating query performance and concurrency.

The changes in Azure SQL DW Compute Optimized Gen2 tier are:

  • 5x query performance via a adaptive caching technology. which takes a blended approach of using remote storage in combination with a fast SSD cache layer (using NVMes) that places data next to compute based on user access patterns and frequency
  • Significant improvement in serving concurrent queries (32 to 128 queries/cluster)
  • Removes limits on columnar data volume to enable unlimited columnar data volume
  • 5 times higher computing power compared to the current generation by leveraging the latest hardware innovations that Azure offers via additional Service Level Objectives (DW7500c, DW10000c, DW15000c and DW30000c)
  • Added Transparent Data Encryption with customer-managed keys

Azure SQL DW Compute Optimized Gen2 tier will roll out to 20 regions initially, you can find the full list of regions available, with subsequent rollouts to all other Azure regions.  If you have a Gen1 data warehouse, take advantage of the latest generation of the service by upgrading.  If you are getting started, try Azure SQL DW Compute Optimized Gen2 tier today.

More info:

Turbocharge cloud analytics with Azure SQL Data Warehouse

Blazing fast data warehousing with Azure SQL Data Warehouse

Video Microsoft Mechanics video

Posted in Azure SQL DW, SQLServerPedia Syndication | 1 Comment