Common Data Model

The Common Data Model (CDM) is a shared data model that is a place to keep all common data to be shared between applications and data sources.  Another way to think of it is is a way to organize data from many sources that are in different formats into a standard structure.

The Common Data Model includes over 340 standardized, extensible data schemas that Microsoft and its partners have published.  This collection of predefined schemas includes entities, attributes, semantic metadata, and relationships.  The schemas represent commonly used concepts and activities, such as Account and Campaign, to simplify the creation, aggregation, and analysis of data.  Having the metadata stored along with the data is very helpful when using a data lake which normally does not have metadata due to the schema-on-read nature.  This graphic shows some elements of the standard entities (for more information see Common Data Model repository on GitHub):

Use the Visual Entity Navigator for interactively exploring entities, entity extensions/inheritance, attributes, and relationships.

Industry Accelerators are foundational components within the Microsoft Power platform and Dynamics 365 that enable ISVs and other solution providers to quickly build industry vertical solutions.  The accelerators extend the Common Data Model to include new entities to support a data schema for concepts within specific industries.  Think of the time savings if you are in one of these industries and need to create a data model.  Instead of spending weeks or months creating one from scratch, you have one already built for you.  Microsoft is focused on delivering accelerators for these industries and others:

The CDM is already supported in the Common Data Services for Apps, Dynamics 365, PowerApps, Power BI via Dataflows, and it will be supported in many upcoming Azure data services.

With Power BI Dataflows, the common data model stores the data into Azure Data Lake Storage (ADLS) Gen2, either internal storage provided by Power BI or stored in your organization’s ADLS Gen2 account (see Dataflows and Azure Data Lake integration (Preview)).  Dataflow can map the output of a query to an entity in the common data model.  This feature is handled with the “Map to Standard” option in the Dataflow Power Query Editor (see CDM and Azure Data Services Integration).  Note that healthcare is currently the only industry accelerator supported for Power BI Dataflows.

The following diagram shows a sample CDM folder in ADLS Gen2, created by a Power BI Dataflow, that contains three entities:

Still to be determined is how exactly CDM fits into a modern data warehouse solution as there are still a number of products that do not have integration points into it yet (i.e. Azure Data Factory is not yet able to write directly to CDM).  However, there are manual steps you can do to integrate Azure Data Factory, Databricks, Azure Machine Learning, and SQL Data Warehouse with CDM until the products gain built-in integration (see CDM and Azure Data Services Integration).

If we think about just Power BI Dataflows, the current state is limiting as Power BI requires its own filesystem inside of ADLS Gen2, which means that sadly we can’t integrate any of it with the rest of our data lake.  In turn that means it’s difficult to frame the enterprise data lake plus Power BI Dataflows, which may be CDM entities or custom entities, plus CDM from other products like Dynamics.

Also, keep in mind that a modern data warehouse also has a relational database in the solution that is fed from the data lake, so the CDM in the data lake would need to be replicated in the relational database.  However, using Power BI Dataflows a business user can do the work of copying and transforming data from multiple source systems into the CDM in the data lake, instead of waiting for IT to do it (who may have a backlog and who may not understand the data).

One more thing to point out: the Open Data Initiative (ODI) is a collaboration between Microsoft, Adobe, and SAP to use the same common data models in their products.

More info:

Video Dataflows and data interop with the Common Data Model

Video Tips & Tricks for getting data of any shape and size into Azure Data Lake using Dataflows

What is the Common Data Model and Why Should I Care? Part 3 of Dataflow Series in Power BI

Video Introduction to the Microsoft Common Data Model

Video Microsoft Common Data Model (CDM): An introductory session – BRK2052

Posted in Azure Data Lake, SQLServerPedia Syndication | Leave a comment

Microsoft Build event announcements

Another Microsoft event and another bunch of exciting announcements.  At the Microsoft Build event last week, the major announcements in the data platform and AI space were:

Machine Learning Services enhancements

In Private Preview is a new visual interface for Azure Machine Learning adds drag-and-drop workflow capabilities to Azure Machine Learning service. It simplifies the process of building, testing, and deploying machine learning models for customers who prefer a visual versus coding experience. This integration brings the best from ML Studio and AML service together. The drag-n-drop experience let any data scientist can quickly build a model without coding. The tool also gives enough flexibility for data scientist to fine tune its model. The AML service as the backend platform offer all the scalability, security, debuggability …  that ML studio can’t give. The easy deployment capability in visual interface enables easy generation of score.py file and creating images. With a few clicks, a trained model can be deployed to any AKS cluster associated with AML service. Think of this as Azure Machine Learning Studio V2. More info

Create, run, and explore automated machine learning experiments in the Azure portal without a single line of code. Automated machine learning automates the process of selecting the best algorithm to use for your specific data, so you can generate a machine learning model quickly. More info

Azure Cognitive Services enhancements

The launch of Personalizer, along with Anomaly Detector and Content Moderator, are part of the new Decision category of Cognitive Services that provide recommendations to enable informed and efficient decision-making for users. Available now in preview. More info​

Wrangling Data Flows in Azure Data Factory

A new capability called Wrangling Data Flows, available in preview, gives users the ability to explore and wrangle data at scale. Wrangling Data Flows empowers users to visually discover and explore their data without writing a single line of code​​. It is described at https://mybuild.techcommunity.microsoft.com/sessions/76997?source=speakerdetail#top-anchor (at 21 min). It is what you use with PowerQuery Online. The execution runtime underneath is Mapping Dataflows but the user interface is PowerQuery UI. More info

Azure Database for PostgreSQL Hyperscale (Citus)

Now in Public Preview, this brings high-performance scaling to PostgreSQL database workloads by horizontally scaling a single database across hundreds of nodes to deliver blazingly fast performance and scale. With horizontal scale-out, you can fit more data in-memory, parallelize queries across hundreds of nodes, and index data faster. The addition of Hyperscale (Citus) as a deployment option for Azure Database for PostgreSQL simplifies your infrastructure and application design, saving organizations time to focus on their business needs. More info

Azure SQL Database Hyperscale

Now Generally Available. More info

Azure SQL Database Serverless

Azure SQL Database serverless is a new compute tier that optimizes price-performance and simplifies performance management for databases with intermittent, unpredictable usage. Serverless automatically scales compute for single databases based on workload demand and bills for compute used per second. Serverless databases automatically pause during inactive periods when only storage is billed and automatically resume when activity returns. Serverless helps customers focus on building apps faster and more efficiently. with all databases in SQL Database, serverless databases are always up-to-date, highly available, and benefit from built-in intelligence for further security and performance optimization.

Azure SQL Data Warehouse’s support for semi-structured data

Azure SQL Data Warehouse now supports semi-structured data. Now with one service, both structured and semi-structured data formats (like JSON) can now be analyzed directly from the data warehouse for faster insights. In private preview. More info

Azure Data Explorer enhancements

The new Spark connector will enable customers to seamlessly read data from Azure Data Explorer into a Spark Dataframe as well as ingest data from it. This enables a number of use cases related to data transformation and machine learning in Spark and the ability to use Azure Data Explorer as a data source/destination for interactive analytics or to operationalize machine learning models for fast scoring of data arriving in Azure Data Explorer.

Azure Data Explorer Python and R plugins will enable customers to embed Python and R code as part of the query. More info

Continuous data export: This feature writes CSV or Parquet files to the data lake, as data streams in via event hubs, IoT hubs, or any other path. It enables building analytical solution over fresh data and seamlessly fill the data lake.

The ability to query data in the lake in its natural format using Azure Data Explorer. Simply define this data as an external table in Azure Data Explorer and query it. You can then join/union it with more data from Azure Data Explorer, SQL servers, and more.

Autoscale for Azure HDInsight

The Autoscale capability for Azure HDInsight is now available in public preview. Autoscale automatically scales your Spark, Hive, or MapReduce HDInsight clusters up or down based on load or a pre-defined schedule.

  • Load-based Autoscale uses several workload-specific metrics, such as CPU and memory usage, to intelligently scale your cluster between user-configurable min and max sizes based on the load
  • Schedule-based Autoscale allows you to set your own custom schedule for cluster scaling. For example, you can set your cluster to scale up to 10 nodes starting at 9:00 am and scale back down to 3 nodes at 9:00 pm on working days

The Autoscale capability can be configured using the Azure portal or programmatically using ARM templates.

Posted in SQLServerPedia Syndication | Comments Off on Microsoft Build event announcements

Where should I clean my data?

As a follow-up to my blogs What product to use to transform my data? and Should I load structured data into my data lake?, I wanted to talk about where you should you clean your data when building a modern data warehouse in Azure.  As an example, let’s say I have an on-prem SQL Server database and I want to copy one million rows from a few tables to a data lake (ADLS Gen2) and then to Azure SQL DW, where the data will be used to generate Power BI reports (for background on a data lake, check out What is a data lake? and Why use a data lake? and Data lake details).  There are five places that you could clean the data:

  1. Clean the data and optionally aggregate it as it sits in source system.  The tool used for this would depend on the source system that stores the data (i.e. if SQL Server, you would use stored procedures).  The only benefit with this option is if you aggregate the data, you will move less data from the source system to Azure, which can be helpful if you have a small pipe to Azure and don’t need the row-level details.  The disadvantages are: the raw source data is not available in the data lake, so you would always need to go back to source system if you needed to get it again, and it may not even still exist in the source system; you would put extra stress on the source system when doing the cleaning which could affect end users using the system; it could take a long time to clean the data as the source system may not have fast performance; and you would not be able to use other tools (i.e. Hadoop, Databricks) to clean it.  Strongly advise against this option
  2. Clean data as it goes from source to the data lake.  The products that could be used for this include SSIS, Azure Data Factory (ADF) Data Flows (renamed to ADF Mapping Data Flows), Databricks, or Power BI Dataflow.  Note that ADF Mapping Data Flows can only be used against four data sources: Azure SQL DB, Azure SQL DW, Parquet (from ADLS & Blob), and Delimited Text (from ADLS & Blob).  Think of these four data sources as staging areas for the data to be used by ADF Mapping Data Flows.  The disadvantages of this option are: It will cause additional time pulling data from the source system which could affect performance of end users using the system; and you won’t have the raw data in the data lake.  If using SSIS or Power BI Dataflow it would cause nonparallel line-by-line transformations (see Difference between ETL and ELT) which could be very slow when cleaning many rows of data.  See below for how SSIS compares to ADF Mapping Data Flows and Databricks.  Strongly advise against this option
  3. Clean data after it has landed into data lake.  You land the data into a raw area in the data lake, clean it, then write it to a cleaned area in the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW via Polybase, all of which can be orchestrated by ADF.  This is the preferred option as it results in having the raw data in the data lake and minimizes the time hitting the source system, as well as saving costs as opposed to cleaning the data in a data warehouse (see Reduce Costs By Adding A Data Lake To Your Cloud Data Warehouse).  This will also be the way to clean the data the fastest.  ADF makes it real easy to move data from a source system (supports 79 sources) to the data lake (ADLS Gen2) by creating an ADF pipeline that uses the Copy Activity with a Self-Hosted Integration Runtime connected to your on-prem SQL Server.  You could land the data in the data lake in Parquet format with a high Data Integration Units (DIU) setting to make it super-fast.  ADF also makes it real easy via the Copy Activity to copy the cleaned data in the data lake to 23 destinations (“sinks”) such as SQL DW.  In addition, you could copy the cleaned data in the data lake to an MDM solution to master it, then write it back to the data lake in another layer called the mastered layer.  The cleaned area is also where you could join the data from various files together or split them apart for security reasons (i.e. having a file with rows from different countries split into multiple files for each country).  Products that could be used to clean the data: ADF Mapping Data Flows, Databricks, or HDInsight.  See below for how ADF Mapping Data Flows compares to Databricks
  4. Clean data as it moves from raw area in data lake to SQL DW.  The disadvantages of using this option is you won’t have the clean data in the data lake, it would do row-by-row transformations which are slow, and some products don’t use Polybase so it would be slow to load the data into SQL DW.  Products that can be used for this include ADF Mapping Data Flows or Databricks using the SQL DW Connector (which uses Polybase).  Strongly advise against this option
  5. Copy data from raw area in data lake to SQL DW then clean it via stored procedures in SQL DW. The disadvantages of using this option is it is more costly as SQL DW costs more than Databricks, the data transformations can affect users running queries on SQL DW, and the cleaned data won’t be available in the data lake for others to use.  You would use ADF to copy the data from the raw data area into staging in SQL DW, which would use Polybase.  Then execute stored procedures that would clean the data and copy it to production tables.  Strongly advise against this option

Be aware this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.

A bit about how ADF Mapping Data Flows works “under the covers”.  ADF Mapping Data Flows is a big deal as it brings GUI-based design together with scale.  It uses the Azure Databricks compute engine under the covers:  ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compiled and executed in Azure Databricks which will automatically scale-out as needed.  The distribution of data across worker/compute nodes in Databricks is where the optimizations come in: each of the transformations in the data flow has an “optimize” tab that lets you control how the data gets partitioned across the worker nodes.  If your source is Azure SQL DB, it’ll be using the out-of-the-box JDBC driver, which means the Databricks workers are accessing Azure SQL DB directly.  There are slight exceptions – if Azure SQL Data Warehouse is the source/destination, then it can land the data temporarily in blob storage (not the hierarchical namespace / ADLS Gen2) before Polybasing it in.  If the source/destination can Polybase, you set a staging linked service/container in the Data Flow activity settings.

How does ADF Mapping Data Flows compare to using SSIS?  In SSIS, you would create an SSIS Data Flow Task to do, for example, a Lookup and a Data Conversion on the one million rows.  This could be slow because it would have to copy the data from SQL Server in batches (usually 10,000 rows) to the SSIS server (so your dependent on the size of the SSIS server, see How Much Memory Does SSIS Need?), then would go row-by-row updating the data, and landing the batch to a cleaned layer in ADLS Gen2 before doing the next one.  With ADF Mapping Data Flows, you create an ADF pipeline that uses the Copy Activity to copy the one million rows from SQL Server to a raw area in ADLS Gen2, then create a Data Flow activity in the ADF pipeline to do the transformations (see Azure Data Factory Data Flow), which behind-the-scenes fires up Databricks, puts the data in a Spark in-memory DataFrame across the workers, and runs Scala code on the DataFrame to update the rows.  This is very fast because the reads from that data lake into memory are done in parallel by the Databricks worker engines (you can be reading as many extents of that file as you have worker cores), and because Databricks can be scaled for the transformations (it’s still doing row-by-row iterations, but it’s doing them in parallel according to the number of workers/cores and the way the data is distributed across those workers).  After the transformations are done, you would use the ADF Mapping Data Flow Sink Transformation to save this data in a cleaned layer in ADLS Gen2.  So the end result is ADF Mapping Data Flows is likely to be much faster than SSIS.

What is the difference between using Databricks vs ADF Mapping Data Flows?  ADF Mapping Data Flows is much easier to use because of its drag-and-drop interface, while Databricks is all code-based.  However, use Databricks if you like notebooks, want to use ML tools, and want more monitoring.  Performance when using ADF Mapping Data Flows would be the same as using Databricks separately or when using the Databricks Notebook activity in ADF unless you used a different partitioning scheme (controlled by the “optimize” tab for ADF Mapping Data Flows) or cluster size (default is 8 cores of general compute for ADF Mapping Data Flows where the driver node is assigned 4 cores with the rest assigned to workers.  That is part of the definition of the default Azure IR.  If you wish to set your own configurations, you will create a new Azure IR and choose that named IR in your data flow activity which can have up to 256 cores), but also depends on the startup time for Databricks when using an ADF Mapping Data Flow (which can take five minutes, but this can be greatly reduced if you turn on the Mapping Data Flow Debug Mode or use the time-to-live setting which will be available soon) compared to using a separate Databricks cluster that is already running (which is instantaneous if you keep it running but that can be very expensive).  At best you can expect at least a one minute set-up time when using ADF Mapping Data Flows.  That’s the normal job overhead of job execution in a distributed big data system like Spark, which is what ADF Mapping Data Flows is using under the hood.  The good news is that as you scale that workload up to large data volumes, that one minute set-up time won’t increase with your workload, so your experienced performance will increase with data volume (see Mapping data flows performance and tuning guide).  Finally, keep in mind that Databricks allows you to build a generic solution that could clean data in many files, while ADF Mapping Data Flows requires a package per object.

More info:

ADF Data Flow tutorial videos

ADF Mapping Data Flow Patterns

Comparing Mapping and Wrangling Data Flows in Azure Data Factory

Posted in Azure Data Factory, Azure Data Lake, Azure Databricks, Azure SQL DW, PolyBase, Power BI, SQLServerPedia Syndication, SSIS | 10 Comments

Two more of my presentations

I recently made available two more presentations that you might find helpful.  Feel free to download them and present them to others (adding a line that you got them from me is all I ask).  There is a full list of all my presentations with slide decks here.  I continually update all my decks, but unfortunately SlideShare no longer allows you to update decks, so just email me at jamesserra3@gmail.com if you want the latest one.

Azure data platform overview

Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build. (slides)

Machine Learning and AI

The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases. (slides)

Posted in SQLServerPedia Syndication | Comments Off on Two more of my presentations

Azure Data Explorer

Azure Data Explorer (ADX) was announced as generally available on Feb 7th.  In short, ADX is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices.  ADX makes it simple to ingest this data and enables you to perform complex ad-hoc queries on the data in seconds – ADX has speeds of up to 200MB/sec per node (up to 1000 nodes) and queries across a billion records take less than a second.  A typical use case is when you are generating terabytes of data from which you need to understand quickly what that data is telling you, as opposed to a traditional database that takes longer to get value out of the data because of the effort to collect the data and place it in the database before you can start to explore it.

It’s a tool for speculative analysis of your data, one that can inform the code you build, optimizing what you query for or helping build new models that can become part of your machine learning platform.  It can not only work on numbers but also does full-text search on semi-structured or un-structured data.  One of my favorite demo’s was watching a query over 6 trillion log records, counting the number of critical errors by doing a full-text search for the word ‘alert’ in the event text that took just 2.7 seconds.  Because of this speed, ADX can be a replacement for search and log analytics engines such as elasticsearch, Splunk, LogStash, or influxDB.  One way I heard it described that I liked was to think of it as an optimized cache on top of a data lake.

Azure Data Explorer integrates with other major services to provide an end-to-end solution that includes data collection, ingestion, storage, indexing, querying, and visualization.  It has a pivotal role in the data warehousing flow by executing the EXPLORE step below on terabytes of diverse raw data.

In addition to getting instant feedback from streaming data (via ad-hoc analysis of source data), another way to use ADX is to have it consume large amounts of streaming data, aggregate it, have it land in its internal database, and then have that aggregated data copied to ADLS Gen2 in CSV format via data export or Azure Data Factory.  Once there, it can then be transformed and cleaned using another tool such as Databricks (since ADX can’t transform data) and then copied to a traditional data warehouse such as Azure SQL Data Warehouse.  This can be much faster than trying to use other tools such as Azure Data Factory to land tons of streaming data into ADLS Gen2 and much less expensive than landing streaming data into Azure SQL Data Warehouse.

The following diagram shows the different aspects of working with Azure Data Explorer:

Work in Azure Data Explorer generally follows this pattern:

  1. Create database: Create a dedicated cluster (currently available VMs are D11_v2, D12_v2, D13_v2, D14_v2, L4, L8, L16) and then create one or more databases in that cluster.  See Quickstart: Create an Azure Data Explorer cluster and database.  You can scale up the cluster (increasing the size of the node/instance) or scale out the cluster, also known as autoscale (adding more nodes/instances).  Note the max number of nodes you can scale out to depends on the VM you have chosen
  2. Ingest data: Load data from a streaming or a batch source (event hub, iot hub, blob storage or ADLS Gen2 via event grid, Kafka, Logstash plugin, apps via APIs, or Azure Data Factory) into database tables in the database created in step 1 so that you can run queries against it.  This is done by creating a target table in ADX and connecting the source such as event hub to this table and mapping the incoming data to the column names in the target table.  Supported formats are Avro, CSV, JSON, MULTILINE JSON, PSV, SOH, SCSV, TSV, and TXT.  The database is maintenance free so no DBA is needed for it.  See Quickstart: Ingest data from Event Hub into Azure Data Explorer
  3. Query database: Use an Azure web application or download the Kusto Explorer client tool to run, review, and share queries and results (you can go to the web application directly via dataexplorer.azure.com or via the “Query” link while on the Azure Data Explorer cluster blade in the Azure portal).  In addition, you can send queries programmatically (using an SDK to query data from your code) or to a REST API endpoint.  The Azure portal-based query builder displays the results in a table where you can filter your data and even apply basic pivot table options.  It also includes tools for basic visualization, and you can choose from a range of chart types.  ADX uses a different query language called Kusto Query Language (KQL) that is for querying only (you can not update or delete data using KQL) but is a very robust and powerful language that has 400+ operators (filtering, aggregation, grouping, joining, etc as well as full-text indexing and time series analysis) and has built-in machine learning features such as clustering, regression, detecting anomalies or doing forecasting.  Note the Kusto service can interpret and run T-SQL queries with some language limitations.  KQL can also query data from a SQL Database, Cosmos DB, and blob files.  KQL is the same language used in Azure Log Analytics and Application Insights.  You can find more information about the query language here.  You can also take an online PluralSight course.  See Quickstart: Query data in Azure Data Explorer
  4. You can also visualize your analysis in ADX either through the ‘render’ command in KQL or via three options for connecting to data in Power BI: use the built-in connector, import a query from Azure Data Explorer, or use a SQL query.  Finally, there is a Python plugin so you can start using tools like the Anaconda analytics environment and Jupyter Notebooks with saved trained ML models that work with your Azure Data Explorer data sets

All the products and tools that can interact with ADX:

  1. Capability for many data types, formats, and sources
    Structured (numbers), semi-structured (JSON\XML), and free text
  2. Batch or streaming ingestion
    Use managed ingestion pipeline or queue a request for pull ingestion
  3. Compute and storage isolation
    – Independent scale out / scale in
    – Persistent data in Azure Blob Storage
    – Caching for low-latency on compute
  4. Multiple options to support data consumption
    Use out-of-the box tools and connectors or use APIs/SDKs for custom solution

Kusto stores its ingested data in reliable storage (Azure Blob Storage), away from its actual processing (e.g. Azure Compute) nodes.  To speed-up queries on that data, Kusto caches this data (or parts of it) on its processing nodes, SSD or even in RAM.  Kusto includes a sophisticated cache mechanism designed to make intelligent decisions as to which data objects to cache.  The Kusto cache provides a granular cache policy (set when creating a database in step #1 above) that customers can use to differentiate between two data cache policies: hot data cache and cold data cache.  The Kusto cache will attempt to keep all data that falls into the hot data cache in local SSD (or RAM), up to the defined size of the hot data cache.  The remaining local SSD space will be used to hold data that is not categorized as hot.

The main implications of setting the hot cache policy are:

  • Cost: The cost of reliable storage can be dramatically cheaper than local SSD (for example, in Azure it is currently about 45x times cheaper)
  • Performance: Data can be queried faster when it is in local SSD.  This is particularly true for range queries, i.e. queries that scan large quantities of data

For more details on why ADX is so fast, check out Azure Data Explorer Technology 101 and Azure Data Explorer whitepaper.

Note that Azure Time Series Insights (TSI) is built on top of Azure Data Explorer.  TSI is a SaaS offering as part of the IoT product suite and ADX is a PaaS offering, part of the Azure data platform, powering TSI and other SaaS apps.  So, using ADX allows us to get a layer deeper than TSI and build a custom solution.  Products like TSI, Azure Monitor Log Analytics, Application insights, and Azure Security Insights are pre-built solutions for a specific purpose, where ADX is used when you are building your own analytics solution or platform and need full control of data sources, data schema and data management, the powerful analytics capabilities of KQL over time series and telemetry data, and a reserved resources cost model.

To sum up in one sentence, Azure Data Explorer is a big data analytics cloud platform optimized for interactive, ad-hoc queries on top of fast flowing data.  To get an idea of the cost, check out the cost estimator and note that your cluster can be stopped (you only pay for when it is running).

More info:

Azure Data Explorer Technology 101

Azure Data Explorer whitepaper

What is Azure Data Explorer and Kusto Querying Language (KQL)?

Time series analysis in Azure Data Explorer

How to save $$$ when using Azure Data Explorer

How to use Azure Data Explorer for large-scale data analysis

Video An overview of Azure Data Explorer (ADX)

Video Azure Data Explorer – Query billions of records in seconds!

Posted in SQLServerPedia Syndication | 3 Comments

My latest presentations

I frequently present at user groups, and always try to create a brand new presentation to keep things interesting.  We all know technology changes so quickly so there is no shortage of topics!  There is a list of all my presentations with slide decks.  Here are the new presentations I created the past year:

Building a modern data warehouse

Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each. (slides)

AI for an intelligent cloud and intelligent edge: Discover, deploy, and manage with Azure ML services

Discover, manage, deploy, monitor – rinse and repeat.  In this session we show how Azure Machine Learning can be used to create the right AI model for your challenge and then easily customize it using your development tools while relying on Azure ML to optimize them to run in hardware accelerated environments for the cloud and the edge using FPGAs and Neural Network accelerators.  We then show you how to deploy the model to highly scalable web services and nimble edge applications that Azure can manage and monitor for you.  Finally, we illustrate how you can leverage the model telemetry to retrain and improve your content. (slides)

Power BI for Big Data and the New Look of Big Data Solutions

New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution.  In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution.  We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized. (slides)

How to build your career

In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, a SQL Server MVP, and then joined Microsoft.  Along the way I saw my yearly income triple.  Is it because I know some secret?  Is it because I am a genius?  No!  It is just about laying out your career path, setting goals, and doing the work.

I’ll cover tips I learned over my career on everything from interviewing to building your personal brand.  I’ll discuss perm positions, consulting, contracting, working for Microsoft or partners, hot fields, in-demand skills, social media, networking, presenting, blogging, salary negotiating, dealing with recruiters, certifications, speaking at major conferences, resume tips, and keys to a high-paying career.

Your first step to enhancing your career will be to attend this session! Let me be your career coach! (slides)

Is the traditional data warehouse dead?

With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that?  No!  In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds.  I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake.  I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution.  And I’ll put it all together by showing common big data architectures. (slides)

Posted in Presentation, SQLServerPedia Syndication | 2 Comments

Azure Data Lake Store Gen2 is GA

Azure Data Lake Store (ADLS) Gen2 was made generally available on February 7th.  In short, ADLS Gen2 is the best of the previous version of ADLS (now called ADLS Gen1) and Azure Blob Storage.  ADLS Gen2 is built on Blob storage and because of that it is a “ring 0” service and is available in all regions.  To create an ADLS Gen2 account, see Quickstart: Create an Azure Data Lake Storage Gen2 storage account.  Note that creating Blob storage or ADLS Gen2 both follow the same process, which is to create a storage account.  The difference is there is an Advanced tab with an ADLS Gen2 section where you set Hierarchical namespace (HNS) to Enabled to make the account ADLS Gen2 instead of Blob.

ADLS Gen2 has most of the features of both ADLS Gen1 and Blob storage (with the features not supported yet listed below).  Features currently supported include limitless storage capacity, Azure Active Directory integration, hierarchical file system, and read-access geo-redundant storage.

When to use Blob vs ADLS Gen2

New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).

Upgrade guidance

Soon existing Blob storage accounts (general purpose v2) will have a seamless mechanism to convert it to ADLS Gen2 to get all the features of ADLS Gen2 (eg. go to the Configuration screen and set Hierarchical namespace to Enabled).  If your Blob storage account is general purpose v1, you will first need to upgrade it to general purpose v2 before you can set Hierarchical namespace to Enabled.  For now you can’t upgrade to ADLS Gen2 this way so you will need to copy your data from Blob storage to ADLS Gen2.

For existing customers of ADLS Gen1, no new features will be added to ADLS Gen1.  You can stay on ADLS Gen1 if you don’t need any of the new capabilities or can move to ADLS Gen2 where you can leverage all the goodness of the combined capabilities and save money (ADLS Gen2 pricing will be roughly half of ADLS Gen1).  You can upgrade when you chose to do so, which is done by copying the data from ADLS Gen1 to ADLS Gen2 as well as any app modifications required to run on ADLS Gen2, with migration tools coming soon.  See Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2 for help on migrating as well as the list of some Azure products and 3rd-party products that don’t yet support ADLS Gen2 that may make you want to hold off on migrating.

What works and what does not work

You might want to stay with Blob storage for now if there is a feature that is not yet supported in ADLS Gen2 such as soft delete, snapshots, object level storage tiers (Hot, Cool, and Archive), and lifecycle management (see Known issues with Azure Data Lake Storage Gen2).  If you are using a 3rd-party app or an Azure app, make sure that it supports ADLS Gen2 (see Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2).  If you are using the WASB or ADLS driver, it will be as simple as switching to the new ADLS Gen2 driver and changing configs.

Blob Storage APIs and ADLS Gen2 APIs aren’t interoperable with each other yet in an ADLS Gen2 account (an account with the HNS enabled).  If you have tools, applications, services, or scripts that use Blob APIs, and you want to use them to work with all of the content that you upload to your account, then don’t upgrade yet until Blob APIs become interoperable with ADLS Gen2 APIs.  Using a storage account without a hierarchical namespace means you then don’t have access to ADLS Gen2 specific features, such as directory and filesystem access control lists.

Note that the file size limit in ADLS Gen2 is 5TB.

Replication options

Below is an excellent slide that describes the four replication options available with Azure Blob Storage which is also available for ADLS Gen2.  Notice the SLA which is a guarantee of the percentage of time Azure will successfully process requests to read data or write data from storage.  Also notice the durability, which specifies the chances of losing data, which is incredible small.

Check out the Azure Data Lake Storage Gen2 overview video for more info as well as A closer look at Azure Data Lake Storage Gen2 and finally check out the Gen2 documentation.

More info:

Individually great, collectively unmatched: Announcing updates to 3 great Azure Data Services

10 THINGS TO KNOW ABOUT AZURE DATA LAKE STORAGE GEN2

Resources for Learning About Azure Data Lake Storage Gen2

Posted in Azure Data Lake, SQLServerPedia Syndication | 4 Comments

Storage options for SQL Server database files in Azure

If you are using SQL Server in an Azure VM (IaaS) you have a number of options of where to store the database files (.mdf, .ldf, and .ndf).  Most customers use managed disks, available in a number of offerings: Standard HDDStandard SSDPremium SSD, and Ultra SSD.  Managed disks are highly recommended to use over unmanaged disks (see Azure Managed vs Unmanaged disks : The choice).  You create and then attach one or more managed disks to the VM.  Each VM comes with built-in storage, but this is a temporary storage drive, labeled as the D: drive, and is not persisted to Azure blob storage.  You do not want to store your user database files or user transaction log files on the D: drive (see Performance best practices for Azure VM).  The same goes for the C: drive, which holds the operating system, and should not be used for database files.

If using managed disks, the recommendation is to attach several premium SSDs to a VM.  From Sizes for Windows virtual machines in Azure you can see for the size DS5_v2 you can add 64 of 4TB data disks (P50) yielding the potential total volume size of up to 256TB per VM, and if you use the preview sizes (P80) your application can have up to around 2PB of storage per VM (64 of 32TB disks).  With premium storage disks, your applications can achieve 80,000 I/O operations per second (IOPS) per VM, and a disk throughput of up to 2,000 megabytes per second (MB/s) per VM.  Note the more disks you add the more storage you get and the more IOPS you get, but only up to a certain point due to VM limits, meaning at some point more disks get you more storage but not more IOPS.

Ultra SSD is in preview and an option for data-intensive workloads such as SAP HANA, top tier databases, and transaction-heavy workloads and has a disk capacity up to 4PB (64 of 64TB disks) and 160,000 IOPS per disk with a disk throughput of up to 2,000 megabytes per second (MB/s) per VM.  Ultra SSD has higher performance than Premium SSD so you can use fewer of them to achieve the performance you need, simplifying things (i.e. possibly not having to create a storage space) and saving costs without sacrificing IOPS (see Mission critical performance with Ultra SSD for SQL Server on Azure VM).  Note that there is no SQL Server VM image that uses Ultra SSD from the Azure Marketplace yet.

In the past, after provisioning a SQL Server VM (a SQL VM created from an Azure Marketplace image), you had to manually attach and configure the right number of data disks to provide the desired number of IOPS or throughput (MB/s).  Then you needed to stripe your SQL files across the disks or create a Storage Pool to divide the IOPS or throughput across them.  Finally, you’d have to configure SQL Server according to the performance best practices for Azure VM.

A couple of years ago, Microsoft made this part of the provisioning experience (but only when choosing a SQL Server VM from the Marketplace, not a regular VM).  When creating the VM you can easily configure the desired IOPS, throughput, and storage size within the limits of the selected VM size, as well as the target workload to optimize for (either online transaction processing or data warehousing).  As you change the IOPS, throughput, and storage size, it will automatically select the right number of disks to attach to the VM.  During the VM provisioning, if more than one disk is required for your specified settings, it will automatically create one Windows storage space (virtual drive) across all disks (see Windows Server 2012 Storage Virtualization Explained). For more details see Storage configuration for SQL Server VMs.

If you are using Ultra SSD, since there is no SQL Server VM image in the Azure Marketplace that uses it, you will need to implement storage spaces manually if you wish to use more than one data disk – see Implementing Storage Spaces Inside Azure Virtual Machines.

Keep in mind that Azure premium managed disks are just premium page blobs with some API make-up so they look like disks (disks and blobs use different APIs).  Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs.  Page blobs are one of three types of blob storage, the other two being block blobs and append blobs.

If you need extreme performance for SQL Server and are OK with not having high-availability for storage, the Lsv2-series VMs are optimized to use the local disk on the node attached directly to the VM rather than using durable data disks.  This allows for greater IOPS / throughput for your workloads.  It supports up to 19.2TB of storage.  Your applications can achieve up to 3.4M I/O operations per second (IOPS) per VM, and a disk throughput of up to 22,000 megabytes per second (MB/s) per VM.  However, you miss out on the high-availability benefits of Azure blob storage, so instead of using it for SQL Server, it is better used for NoSQL stores such as Apache Cassandra and MongoDB which replicate data across multiple VMs to achieve persistence in the event of the failure of a single VM.

Another option for storing the database files is Azure blob storage.  I see this used for some on-prem SQL Server installations when the latency caused by accessing blob storage from on-prem is not a big deal (see SQL Server data files in Microsoft Azure and SQL Server 2014 and Windows Azure Blob Storage Service: Better Together).  It is also sometimes used when using SQL Server in a VM with a large database (3-30 TB range) and a fast backup/restore is needed, as a file-snapshot backup (BACKUP DATABASE using WITH FILE_SNAPSHOT) could be done very quickly (note a file-snapshot cannot be done when database files are on managed disks, so you must do streaming backups).  Other benefits that come with storing the database files in Azure blob storage is Instant Log Initialization for SQL Server in Azure, fast consistency checking, and reporting off of a restored copy of the snapshot.  Be aware there are some limitations: you are limited to a database size of 35TB and geo-replication for your storage account is not supported (if a storage account is geo-replicated and a geo-failover happened, database corruption could occur), there is no local SSD cache to potentially improve storage performance and the same network bandwidth is shared between the VM network traffic and the database storage traffic, which can affect network intensive applications.  For more info see SQL Server VLDB in Azure: DBA Tasks Made Simple.

In looking at the PaaS version of SQL Server, Azure SQL Database, here is the underlying persistent storage that is used for its databases for the various options:

  • Singleton: DTU Basic tiers (max 2GB database size) and Standard tiers (max 1TB database size) use blob storage, Premium tier (max 4TB database size) uses a locally attached SSD; vCore General Purpose tier (max 4TB database size) uses premium blob storage, Business Critical tier (max 4TB database size) uses a locally attached SSD
  • Managed Instance: General Purpose tier (max 8TB database size) uses premium blob storage (see Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose)), Business Critical tier (max 4TB database size) uses a locally attached SSD
  • Hyperscale: Uses standard blob storage (max 100TB database size) but achieves very high performance due to multiple SSD-based caching tiers

More info:

Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose)

Azure SQL Database high availability

Premium blob storage

High-performance Premium Storage and managed disks for VMs

About disks storage for Azure Windows VMs

Tutorial: Use Azure Blob storage service with SQL Server 2016

Storage Configuration Guidelines for SQL Server on Azure VM

How to Choose the Correct Azure VM Disk Tier

Larger, more powerful Managed Disks for Azure Virtual Machines

Storage Considerations for Running SQL Server in Azure

Posted in Azure, SQL Server, SQLServerPedia Syndication | 6 Comments

What product to use to transform my data?

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL).  Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data.  There is good news and bad news when it comes to which product to use.  The good news is there are a lot of products to choose from.  The bad news is there are a lot of products to choose from :-).  I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part):

  • SSIS – The product used most often for on-prem ETL still has use cases for it when moving to the cloud, especially when migrating a solution from on-prem to the cloud.  If you have hundreds of SSIS packages, it may make the most sense to continue to use those packages and just change the destination adapter (i.e. from your on-prem SQL Server to Azure SQL Database) and have them execute under the Azure-SSIS integration runtime (IR) in Azure Data Factory (ADF).  Keep in mind if a package is accessing local resources (like storing temp files to a local path or using customer controls), some re-work will be required.  A concern will be if a package is moving a ton of data, as executing transformations row-by-row (ETL) can be much slower than batch transformations (ELT) – see Difference between ETL and ELT.  Also, the data has to be moved to the location of the IR and the compute power is limited by the size of the IR cluster.  For those SSIS packages that are too slow you can just replace those with a product below
  • Azure Data Factory (ADF) – Now that ADF has a new feature called Data Flow, it can transform data so it is more than just an orchestration tool.  Behind the scenes, the ADF JSON code that is created when you build a solution is converted to the appropriate code in the Scala programming language and is prepared, compiled and executed in Azure Databricks.  This means Data Flow operates in an ELT manner: It loads the data into a place where Databricks can access it, performs the transformations, and then moves it to the destination.  ADF provides a native ETL scheduler so that you can automate data transformation and movement processes either through visual data flows or via script activities that execute in Databricks or other execution engines (so, like with SSIS, data flows are row-by-row transformations and for large amounts of data it may be faster to execute a batch transformation via a script in Databricks).  My thoughts on when to use ADF are obviously if you are already using it or if your skillset lies in SSIS as it’s pretty easy to learn ADF with a SSIS background
  • Databricks – It is a Spark-based analytics platform which makes it great to use if you like to work with Spark, Python, Scala, and notebooks.  When choosing between Databricks and ADF, what I’ve noticed is that it depends highly on the customer personas and their capabilities.  There are plenty of Data Engineers and Data Scientists who want to get deep into Python or Scala and sling some code in Databricks Notebooks.  But the larger audience who wants to focus on building business logic to clean customer/address data, for example, doesn’t want to learn Python libraries, and will use the ADF visual data flow designer.  Many of those are also Data Engineers and Data Scientists, but then we start to move up the value stack to include Data Analysts and Business Analysts, which is where we start to overlap with Power BI Dataflow (see below).  Either way, when you want to orchestrate these cleaning routines with schedules, triggers, and monitors, you want that to be through ADF.  Keep in mind if you code your transformations in Databricks Notebooks, you will be responsible for maintaining that code, troubleshooting, and scheduling those routines
  • HDInsight (HDI) – Databricks is the preferred product over HDI, unless the customer has a mature Hadoop ecosystem already established.  But more and more I tend to find that the majority of workloads are Spark, so Databricks is a better option.  In terms of pure Spark workloads Databricks greatly outperforms HDI.  Although Databricks clusters can be up and running indefinitely, they’re intended to be created and terminated as necessary – so if you wish to use other Apache Hadoop data transformation tools and have them available 24/7 then HDI may better a better option than Databricks
  • PolyBase – Azure SQL Data Warehouse (SQL DW) supports PolyBase, so if you are using SQL DW you can pull data from the data lake into SQL DW using T-SQL.  An option is to use PolyBase to import the raw data from the data lake into SQL DW and then clean it there using Stored Procedures, especially if you want to stick with T-SQL and don’t want to deal with Spark or Hive or other more-difficult technologies.  But the benefit of cleaning the data as it sits in the data lake is to off-load the cleaning process so you are not affecting user queries and to reduce the storage space in SQL DW.  Cleaning the data in the data lake can also be less expensive and there are more tools available for data manipulation than just T-SQL.  Another reason is by cleaning the data in the data lake you can make it available to data scientists or power users who don’t have to wait until it’s in SQL DW
  • Power BI Dataflow – In short, Dataflows integrates data lake and data prep technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  So even though the main use case for Dataflow is for an individual solution or for small workloads, you can think of Power Query as the “compute” part that transforms data that lands in the data lake and can then be used as part of an enterprise solution

In addition to what products to use to clean the data, another consideration is where to clean the data.  Say I have an on-prem SQL Server database and I’m exporting some tables into CSV files and then copying them to the data lake (ADLS Gen2) and then to Azure SQL DW.  And I not only have to clean the data but join these CSV files with other files also in the data lake.  Should I clean the data as it moves from the data source to the data lake via ADF or always land the raw data in the data lake and then clean it and join it and land it back into the data lake (via ADF of Databricks)?  Or clean it as it moves from the data lake to SQL DW via ADF or Databricks?  In most cases you will want do all the transforming in the data lake.  So copy the raw data into the data lake, transform it and write it back into the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW, all of which can be orchestrated by ADF.  But this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.

More info:

Decision Tree for Enterprise Information Management (EIM)

Big Data Decision Tree v4

Posted in Azure Data Factory, Azure Databricks, Big Data, Data Lake, ETL, HDInsight, PolyBase, Power BI, SQLServerPedia Syndication, SSIS | 9 Comments

Azure Data Factory Data Flow

Azure Data Factory v2 (ADF) has a new feature in public preview called Data Flow.  I have usually described ADF as an orchestration tool instead of an Extract-Transform-Load (ETL) tool since it has the “E” and “L” in ETL but not the “T”.  But now it has the data transformation capability, making ADF the equivalent of “SSIS in the cloud” since it has the ability to mimic SSIS Data Flow business logic.

Data Flow works without you having to write any lines of code as you build the solution by using drag-and-drop features of the ADF interface to perform data transformations like data cleaning, aggregation, and preparation.  Behind the scenes, the ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compile and execute in Azure Databricks which will automatically scale-out as needed.

There is an excellent chart created by that shows the SSIS tasks and the equivalent ADF operation.  There are currently 13 different dataset manipulation operations with more to come:

  • New Branch
  • Join
  • Conditional Split
  • Union
  • Lookup
  • Derived Column
  • Aggregate
  • Surrogate Key
  • Exists
  • Select
  • Filter
  • Sort
  • Extend

The Data Flow feature in ADF is currently in limited public preview.  If you would like to try this out on your Data Factories, please fill out this form to request whitelisting your Azure Subscription for ADF Data Flows: http://aka.ms/dataflowpreview.  Once your subscription has been enabled, you will see “Data Factory V2 (with data flows)” as an option from the Azure Portal when creating Data Factories.

Follow Mark Kromer and the ADF team on Twitter to stay up to date on the rollout of the preview.  Also check out the ADF Data Flow’s documentation and these ADF Data Flow’s videos.

Don’t confuse this with Dataflows in Power BI, they have nothing to do with each other.

More info:

Azure Data Factory v2 and its available components in Data Flows

WHAT IS AZURE DATA FACTORY DATA FLOW?

AZURE DATA FACTORY DATA FLOW PREVIEW: SABERMETRICS EXAMPLE

SSIS-INSPIRED VISUAL DATA TRANSFORMATION COMES TO AZURE DATA FACTORY

Mapping Data Flow in Azure Data Factory (v2)

Posted in Azure Data Factory, SQLServerPedia Syndication | 2 Comments