Two more of my presentations

I recently made available two more presentations that you might find helpful.  Feel free to download them and present them to others (adding a line that you got them from me is all I ask).  There is a full list of all my presentations with slide decks here.  I continually update all my decks, but unfortunately SlideShare no longer allows you to update decks, so just email me at jamesserra3@gmail.com if you want the latest one.

Azure data platform overview

Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build. (slides)

Machine Learning and AI

The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases. (slides)

Posted in SQLServerPedia Syndication | Leave a comment

Azure Data Explorer

Azure Data Explorer (ADX) was announced as generally available on Feb 7th.  In short, ADX is a fully managed data analytics service for near real-time analysis on large volumes of data streaming (i.e. log and telemetry data) from such sources as applications, websites, or IoT devices.  ADX makes it simple to ingest this data and enables you to perform complex ad-hoc queries on the data in seconds – ADX has speeds of up to 200MB/sec per node (up to 1000 nodes) and queries across a billion records take less than a second.  A typical use case is when you are generating terabytes of data from which you need to understand quickly what that data is telling you, as opposed to a traditional database that takes longer to get value out of the data because of the effort to collect the data and place it in the database before you can start to explore it.

It’s a tool for speculative analysis of your data, one that can inform the code you build, optimizing what you query for or helping build new models that can become part of your machine learning platform.  It can not only work on numbers but also does full-text search on semi-structured or un-structured data.  One of my favorite demo’s was watching a query over 6 trillion log records, counting the number of critical errors by doing a full-text search for the word ‘alert’ in the event text that took just 2.7 seconds.  Because of this speed, ADX can be a replacement for search and log analytics engines such as elasticsearch, Splunk, LogStash, or influxDB.  One way I heard it described that I liked was to think of it as an optimized cache on top of a data lake.

Azure Data Explorer integrates with other major services to provide an end-to-end solution that includes data collection, ingestion, storage, indexing, querying, and visualization.  It has a pivotal role in the data warehousing flow by executing the EXPLORE step below on terabytes of diverse raw data.

In addition to getting instant feedback from streaming data (via ad-hoc analysis of source data), another way to use ADX is to have it consume large amounts of streaming data, aggregate it, have it land in its internal database, and then have that aggregated data copied to ADLS Gen2 in CSV format via data export or Azure Data Factory.  Once there, it can then be transformed and cleaned using another tool such as Databricks (since ADX can’t transform data) and then copied to a traditional data warehouse such as Azure SQL Data Warehouse.  This can be much faster than trying to use other tools such as Azure Data Factory to land tons of streaming data into ADLS Gen2 and much less expensive than landing streaming data into Azure SQL Data Warehouse.

The following diagram shows the different aspects of working with Azure Data Explorer:

Work in Azure Data Explorer generally follows this pattern:

  1. Create database: Create a dedicated cluster (currently available VMs are D11_v2, D12_v2, D13_v2, D14_v2, L4, L8, L16) and then create one or more databases in that cluster.  See Quickstart: Create an Azure Data Explorer cluster and database.  You can scale up the cluster (increasing the size of the node/instance) or scale out the cluster, also known as autoscale (adding more nodes/instances).  Note the max number of nodes you can scale out to depends on the VM you have chosen
  2. Ingest data: Load data from a streaming or a batch source (event hub, iot hub, blob storage or ADLS Gen2 via event grid, Kafka, Logstash plugin, apps via APIs, or Azure Data Factory) into database tables in the database created in step 1 so that you can run queries against it.  This is done by creating a target table in ADX and connecting the source such as event hub to this table and mapping the incoming data to the column names in the target table.  Supported formats are Avro, CSV, JSON, MULTILINE JSON, PSV, SOH, SCSV, TSV, and TXT.  The database is maintenance free so no DBA is needed for it.  See Quickstart: Ingest data from Event Hub into Azure Data Explorer
  3. Query database: Use an Azure web application or download the Kusto Explorer client tool to run, review, and share queries and results (you can go to the web application directly via dataexplorer.azure.com or via the “Query” link while on the Azure Data Explorer cluster blade in the Azure portal).  In addition, you can send queries programmatically (using an SDK to query data from your code) or to a REST API endpoint.  The Azure portal-based query builder displays the results in a table where you can filter your data and even apply basic pivot table options.  It also includes tools for basic visualization, and you can choose from a range of chart types.  ADX uses a different query language called Kusto Query Language (KQL) that is for querying only (you can not update or delete data using KQL) but is a very robust and powerful language that has 400+ operators (filtering, aggregation, grouping, joining, etc as well as full-text indexing and time series analysis) and has built-in machine learning features such as clustering, regression, detecting anomalies or doing forecasting.  Note the Kusto service can interpret and run T-SQL queries with some language limitations.  KQL can also query data from a SQL Database, Cosmos DB, and blob files.  KQL is the same language used in Azure Log Analytics and Application Insights.  You can find more information about the query language here.  You can also take an online PluralSight course.  See Quickstart: Query data in Azure Data Explorer
  4. You can also visualize your analysis in ADX either through the ‘render’ command in KQL or via three options for connecting to data in Power BI: use the built-in connector, import a query from Azure Data Explorer, or use a SQL query.  Finally, there is a Python plugin so you can start using tools like the Anaconda analytics environment and Jupyter Notebooks with saved trained ML models that work with your Azure Data Explorer data sets

All the products and tools that can interact with ADX:

  1. Capability for many data types, formats, and sources
    Structured (numbers), semi-structured (JSON\XML), and free text
  2. Batch or streaming ingestion
    Use managed ingestion pipeline or queue a request for pull ingestion
  3. Compute and storage isolation
    – Independent scale out / scale in
    – Persistent data in Azure Blob Storage
    – Caching for low-latency on compute
  4. Multiple options to support data consumption
    Use out-of-the box tools and connectors or use APIs/SDKs for custom solution

Kusto stores its ingested data in reliable storage (Azure Blob Storage), away from its actual processing (e.g. Azure Compute) nodes.  To speed-up queries on that data, Kusto caches this data (or parts of it) on its processing nodes, SSD or even in RAM.  Kusto includes a sophisticated cache mechanism designed to make intelligent decisions as to which data objects to cache.  The Kusto cache provides a granular cache policy (set when creating a database in step #1 above) that customers can use to differentiate between two data cache policies: hot data cache and cold data cache.  The Kusto cache will attempt to keep all data that falls into the hot data cache in local SSD (or RAM), up to the defined size of the hot data cache.  The remaining local SSD space will be used to hold data that is not categorized as hot.

The main implications of setting the hot cache policy are:

  • Cost: The cost of reliable storage can be dramatically cheaper than local SSD (for example, in Azure it is currently about 45x times cheaper)
  • Performance: Data can be queried faster when it is in local SSD.  This is particularly true for range queries, i.e. queries that scan large quantities of data

For more details on why ADX is so fast, check out Azure Data Explorer Technology 101 and Azure Data Explorer whitepaper.

Note that Azure Time Series Insights (TSI) is built on top of Azure Data Explorer.  TSI is a SaaS offering as part of the IoT product suite and ADX is a PaaS offering, part of the Azure data platform, powering TSI and other SaaS apps.  So, using ADX allows us to get a layer deeper than TSI and build a custom solution.  Products like TSI, Azure Monitor Log Analytics, Application insights, and Azure Security Insights are pre-built solutions for a specific purpose, where ADX is used when you are building your own analytics solution or platform and need full control of data sources, data schema and data management, the powerful analytics capabilities of KQL over time series and telemetry data, and a reserved resources cost model.

To sum up in one sentence, Azure Data Explorer is a big data analytics cloud platform optimized for interactive, ad-hoc queries on top of fast flowing data.  To get an idea of the cost, check out the cost estimator and note that your cluster can be stopped (you only pay for when it is running).

More info:

Azure Data Explorer Technology 101

Azure Data Explorer whitepaper

What is Azure Data Explorer and Kusto Querying Language (KQL)?

Time series analysis in Azure Data Explorer

How to save $$$ when using Azure Data Explorer

How to use Azure Data Explorer for large-scale data analysis

Video An overview of Azure Data Explorer (ADX)

Video Azure Data Explorer – Query billions of records in seconds!

Posted in SQLServerPedia Syndication | 3 Comments

My latest presentations

I frequently present at user groups, and always try to create a brand new presentation to keep things interesting.  We all know technology changes so quickly so there is no shortage of topics!  There is a list of all my presentations with slide decks.  Here are the new presentations I created the past year:

Building a modern data warehouse

Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each. (slides)

AI for an intelligent cloud and intelligent edge: Discover, deploy, and manage with Azure ML services

Discover, manage, deploy, monitor – rinse and repeat.  In this session we show how Azure Machine Learning can be used to create the right AI model for your challenge and then easily customize it using your development tools while relying on Azure ML to optimize them to run in hardware accelerated environments for the cloud and the edge using FPGAs and Neural Network accelerators.  We then show you how to deploy the model to highly scalable web services and nimble edge applications that Azure can manage and monitor for you.  Finally, we illustrate how you can leverage the model telemetry to retrain and improve your content. (slides)

Power BI for Big Data and the New Look of Big Data Solutions

New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution.  In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution.  We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized. (slides)

How to build your career

In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, a SQL Server MVP, and then joined Microsoft.  Along the way I saw my yearly income triple.  Is it because I know some secret?  Is it because I am a genius?  No!  It is just about laying out your career path, setting goals, and doing the work.

I’ll cover tips I learned over my career on everything from interviewing to building your personal brand.  I’ll discuss perm positions, consulting, contracting, working for Microsoft or partners, hot fields, in-demand skills, social media, networking, presenting, blogging, salary negotiating, dealing with recruiters, certifications, speaking at major conferences, resume tips, and keys to a high-paying career.

Your first step to enhancing your career will be to attend this session! Let me be your career coach! (slides)

Is the traditional data warehouse dead?

With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that?  No!  In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds.  I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake.  I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution.  And I’ll put it all together by showing common big data architectures. (slides)

Posted in Presentation, SQLServerPedia Syndication | 2 Comments

Azure Data Lake Store Gen2 is GA

Azure Data Lake Store (ADLS) Gen2 was made generally available on February 7th.  In short, ADLS Gen2 is the best of the previous version of ADLS (now called ADLS Gen1) and Azure Blob Storage.  ADLS Gen2 is built on Blob storage and because of that it is a “ring 0” service and is available in all regions.  To create an ADLS Gen2 account, see Quickstart: Create an Azure Data Lake Storage Gen2 storage account.  Note that creating Blob storage or ADLS Gen2 both follow the same process, which is to create a storage account.  The difference is there is an Advanced tab with an ADLS Gen2 section where you set Hierarchical namespace (HNS) to Enabled to make the account ADLS Gen2 instead of Blob.

ADLS Gen2 has most of the features of both ADLS Gen1 and Blob storage (with the features not supported yet listed below).  Features currently supported include limitless storage capacity, Azure Active Directory integration, hierarchical file system, and read-access geo-redundant storage.

When to use Blob vs ADLS Gen2

New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).

Upgrade guidance

Soon existing Blob storage accounts (general purpose v2) will have a seamless mechanism to convert it to ADLS Gen2 to get all the features of ADLS Gen2 (eg. go to the Configuration screen and set Hierarchical namespace to Enabled).  If your Blob storage account is general purpose v1, you will first need to upgrade it to general purpose v2 before you can set Hierarchical namespace to Enabled.  For now you can’t upgrade to ADLS Gen2 this way so you will need to copy your data from Blob storage to ADLS Gen2.

For existing customers of ADLS Gen1, no new features will be added to ADLS Gen1.  You can stay on ADLS Gen1 if you don’t need any of the new capabilities or can move to ADLS Gen2 where you can leverage all the goodness of the combined capabilities and save money (ADLS Gen2 pricing will be roughly half of ADLS Gen1).  You can upgrade when you chose to do so, which is done by copying the data from ADLS Gen1 to ADLS Gen2 as well as any app modifications required to run on ADLS Gen2, with migration tools coming soon.  See Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2 for help on migrating as well as the list of some Azure products and 3rd-party products that don’t yet support ADLS Gen2 that may make you want to hold off on migrating.

What works and what does not work

You might want to stay with Blob storage for now if there is a feature that is not yet supported in ADLS Gen2 such as soft delete, snapshots, object level storage tiers (Hot, Cool, and Archive), and lifecycle management (see Known issues with Azure Data Lake Storage Gen2).  If you are using a 3rd-party app or an Azure app, make sure that it supports ADLS Gen2 (see Upgrade your big data analytics solutions from Azure Data Lake Storage Gen1 to Azure Data Lake Storage Gen2).  If you are using the WASB or ADLS driver, it will be as simple as switching to the new ADLS Gen2 driver and changing configs.

Blob Storage APIs and ADLS Gen2 APIs aren’t interoperable with each other yet in an ADLS Gen2 account (an account with the HNS enabled).  If you have tools, applications, services, or scripts that use Blob APIs, and you want to use them to work with all of the content that you upload to your account, then don’t upgrade yet until Blob APIs become interoperable with ADLS Gen2 APIs.  Using a storage account without a hierarchical namespace means you then don’t have access to ADLS Gen2 specific features, such as directory and filesystem access control lists.

Note that the file size limit in ADLS Gen2 is 5TB.

Replication options

Below is an excellent slide that describes the four replication options available with Azure Blob Storage which is also available for ADLS Gen2.  Notice the SLA which is a guarantee of the percentage of time Azure will successfully process requests to read data or write data from storage.  Also notice the durability, which specifies the chances of losing data, which is incredible small.

Check out the Azure Data Lake Storage Gen2 overview video for more info as well as A closer look at Azure Data Lake Storage Gen2 and finally check out the Gen2 documentation.

More info:

Individually great, collectively unmatched: Announcing updates to 3 great Azure Data Services

10 THINGS TO KNOW ABOUT AZURE DATA LAKE STORAGE GEN2

Resources for Learning About Azure Data Lake Storage Gen2

Posted in Azure Data Lake, SQLServerPedia Syndication | 4 Comments

Storage options for SQL Server database files in Azure

If you are using SQL Server in an Azure VM (IaaS) you have a number of options of where to store the database files (.mdf, .ldf, and .ndf).  Most customers use managed disks, available in a number of offerings: Standard HDDStandard SSDPremium SSD, and Ultra SSD.  Managed disks are highly recommended to use over unmanaged disks (see Azure Managed vs Unmanaged disks : The choice).  You create and then attach one or more managed disks to the VM.  Each VM comes with built-in storage, but this is a temporary storage drive, labeled as the D: drive, and is not persisted to Azure blob storage.  You do not want to store your user database files or user transaction log files on the D: drive (see Performance best practices for Azure VM).  The same goes for the C: drive, which holds the operating system, and should not be used for database files.

If using managed disks, the recommendation is to attach several premium SSDs to a VM.  From Sizes for Windows virtual machines in Azure you can see for the size DS5_v2 you can add 64 of 4TB data disks (P50) yielding the potential total volume size of up to 256TB per VM, and if you use the preview sizes (P80) your application can have up to around 2PB of storage per VM (64 of 32TB disks).  With premium storage disks, your applications can achieve 80,000 I/O operations per second (IOPS) per VM, and a disk throughput of up to 2,000 megabytes per second (MB/s) per VM.  Note the more disks you add the more storage you get and the more IOPS you get, but only up to a certain point due to VM limits, meaning at some point more disks get you more storage but not more IOPS.

Ultra SSD is in preview and an option for data-intensive workloads such as SAP HANA, top tier databases, and transaction-heavy workloads and has a disk capacity up to 4PB (64 of 64TB disks) and 160,000 IOPS per disk with a disk throughput of up to 2,000 megabytes per second (MB/s) per VM.  Ultra SSD has higher performance than Premium SSD so you can use fewer of them to achieve the performance you need, simplifying things (i.e. possibly not having to create a storage space) and saving costs without sacrificing IOPS (see Mission critical performance with Ultra SSD for SQL Server on Azure VM).  Note that there is no SQL Server VM image that uses Ultra SSD from the Azure Marketplace yet.

In the past, after provisioning a SQL Server VM (a SQL VM created from an Azure Marketplace image), you had to manually attach and configure the right number of data disks to provide the desired number of IOPS or throughput (MB/s).  Then you needed to stripe your SQL files across the disks or create a Storage Pool to divide the IOPS or throughput across them.  Finally, you’d have to configure SQL Server according to the performance best practices for Azure VM.

A couple of years ago, Microsoft made this part of the provisioning experience (but only when choosing a SQL Server VM from the Marketplace, not a regular VM).  When creating the VM you can easily configure the desired IOPS, throughput, and storage size within the limits of the selected VM size, as well as the target workload to optimize for (either online transaction processing or data warehousing).  As you change the IOPS, throughput, and storage size, it will automatically select the right number of disks to attach to the VM.  During the VM provisioning, if more than one disk is required for your specified settings, it will automatically create one Windows storage space (virtual drive) across all disks (see Windows Server 2012 Storage Virtualization Explained). For more details see Storage configuration for SQL Server VMs.

If you are using Ultra SSD, since there is no SQL Server VM image in the Azure Marketplace that uses it, you will need to implement storage spaces manually if you wish to use more than one data disk – see Implementing Storage Spaces Inside Azure Virtual Machines.

Keep in mind that Azure premium managed disks are just premium page blobs with some API make-up so they look like disks (disks and blobs use different APIs).  Azure storage page blobs do not officially announce performance numbers per size while Azure premium managed disks announce this info, so that is why tables like this will show performance numbers for disks and not page blobs.  Page blobs are one of three types of blob storage, the other two being block blobs and append blobs.

If you need extreme performance for SQL Server and are OK with not having high-availability for storage, the Lsv2-series VMs are optimized to use the local disk on the node attached directly to the VM rather than using durable data disks.  This allows for greater IOPS / throughput for your workloads.  It supports up to 19.2TB of storage.  Your applications can achieve up to 3.4M I/O operations per second (IOPS) per VM, and a disk throughput of up to 22,000 megabytes per second (MB/s) per VM.  However, you miss out on the high-availability benefits of Azure blob storage, so instead of using it for SQL Server, it is better used for NoSQL stores such as Apache Cassandra and MongoDB which replicate data across multiple VMs to achieve persistence in the event of the failure of a single VM.

Another option for storing the database files is Azure blob storage.  I see this used for some on-prem SQL Server installations when the latency caused by accessing blob storage from on-prem is not a big deal (see SQL Server data files in Microsoft Azure and SQL Server 2014 and Windows Azure Blob Storage Service: Better Together).  It is also sometimes used when using SQL Server in a VM with a large database (3-30 TB range) and a fast backup/restore is needed, as a file-snapshot backup (BACKUP DATABASE using WITH FILE_SNAPSHOT) could be done very quickly (note a file-snapshot cannot be done when database files are on managed disks, so you must do streaming backups).  Other benefits that come with storing the database files in Azure blob storage is Instant Log Initialization for SQL Server in Azure, fast consistency checking, and reporting off of a restored copy of the snapshot.  Be aware there are some limitations: you are limited to a database size of 35TB and geo-replication for your storage account is not supported (if a storage account is geo-replicated and a geo-failover happened, database corruption could occur), there is no local SSD cache to potentially improve storage performance and the same network bandwidth is shared between the VM network traffic and the database storage traffic, which can affect network intensive applications.  For more info see SQL Server VLDB in Azure: DBA Tasks Made Simple.

In looking at the PaaS version of SQL Server, Azure SQL Database, here is the underlying persistent storage that is used for its databases for the various options:

  • Singleton: DTU Basic tiers (max 2GB database size) and Standard tiers (max 1TB database size) use blob storage, Premium tier (max 4TB database size) uses a locally attached SSD; vCore General Purpose tier (max 4TB database size) uses premium blob storage, Business Critical tier (max 4TB database size) uses a locally attached SSD
  • Managed Instance: General Purpose tier (max 8TB database size) uses premium blob storage (see Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose)), Business Critical tier (max 4TB database size) uses a locally attached SSD
  • Hyperscale: Uses standard blob storage (max 100TB database size) but achieves very high performance due to multiple SSD-based caching tiers

More info:

Storage performance best practices and considerations for Azure SQL DB Managed Instance (General Purpose)

Azure SQL Database high availability

Premium blob storage

High-performance Premium Storage and managed disks for VMs

About disks storage for Azure Windows VMs

Tutorial: Use Azure Blob storage service with SQL Server 2016

Storage Configuration Guidelines for SQL Server on Azure VM

How to Choose the Correct Azure VM Disk Tier

Larger, more powerful Managed Disks for Azure Virtual Machines

Posted in Azure, SQL Server, SQLServerPedia Syndication | 6 Comments

What product to use to transform my data?

If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake.  And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL).  Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data.  There is good news and bad news when it comes to which product to use.  The good news is there are a lot of products to choose from.  The bad news is there are a lot of products to choose from :-).  I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part):

  • SSIS – The product used most often for on-prem ETL still has use cases for it when moving to the cloud, especially when migrating a solution from on-prem to the cloud.  If you have hundreds of SSIS packages, it may make the most sense to continue to use those packages and just change the destination adapter (i.e. from your on-prem SQL Server to Azure SQL Database) and have them execute under the Azure-SSIS integration runtime (IR) in Azure Data Factory (ADF).  Keep in mind if a package is accessing local resources (like storing temp files to a local path or using customer controls), some re-work will be required.  A concern will be if a package is moving a ton of data, as executing transformations row-by-row (ETL) can be much slower than batch transformations (ELT) – see Difference between ETL and ELT.  Also, the data has to be moved to the location of the IR and the compute power is limited by the size of the IR cluster.  For those SSIS packages that are too slow you can just replace those with a product below
  • Azure Data Factory (ADF) – Now that ADF has a new feature called Data Flow, it can transform data so it is more than just an orchestration tool.  Behind the scenes, the ADF JSON code that is created when you build a solution is converted to the appropriate code in the Scala programming language and is prepared, compiled and executed in Azure Databricks.  This means Data Flow operates in an ELT manner: It loads the data into a place where Databricks can access it, performs the transformations, and then moves it to the destination.  ADF provides a native ETL scheduler so that you can automate data transformation and movement processes either through visual data flows or via script activities that execute in Databricks or other execution engines (so, like with SSIS, data flows are row-by-row transformations and for large amounts of data it may be faster to execute a batch transformation via a script in Databricks).  My thoughts on when to use ADF are obviously if you are already using it or if your skillset lies in SSIS as it’s pretty easy to learn ADF with a SSIS background
  • Databricks – It is a Spark-based analytics platform which makes it great to use if you like to work with Spark, Python, Scala, and notebooks.  When choosing between Databricks and ADF, what I’ve noticed is that it depends highly on the customer personas and their capabilities.  There are plenty of Data Engineers and Data Scientists who want to get deep into Python or Scala and sling some code in Databricks Notebooks.  But the larger audience who wants to focus on building business logic to clean customer/address data, for example, doesn’t want to learn Python libraries, and will use the ADF visual data flow designer.  Many of those are also Data Engineers and Data Scientists, but then we start to move up the value stack to include Data Analysts and Business Analysts, which is where we start to overlap with Power BI Dataflow (see below).  Either way, when you want to orchestrate these cleaning routines with schedules, triggers, and monitors, you want that to be through ADF.  Keep in mind if you code your transformations in Databricks Notebooks, you will be responsible for maintaining that code, troubleshooting, and scheduling those routines
  • HDInsight (HDI) – Databricks is the preferred product over HDI, unless the customer has a mature Hadoop ecosystem already established.  But more and more I tend to find that the majority of workloads are Spark, so Databricks is a better option.  In terms of pure Spark workloads Databricks greatly outperforms HDI.  Although Databricks clusters can be up and running indefinitely, they’re intended to be created and terminated as necessary – so if you wish to use other Apache Hadoop data transformation tools and have them available 24/7 then HDI may better a better option than Databricks
  • PolyBase – Azure SQL Data Warehouse (SQL DW) supports PolyBase, so if you are using SQL DW you can pull data from the data lake into SQL DW using T-SQL.  An option is to use PolyBase to import the raw data from the data lake into SQL DW and then clean it there using Stored Procedures, especially if you want to stick with T-SQL and don’t want to deal with Spark or Hive or other more-difficult technologies.  But the benefit of cleaning the data as it sits in the data lake is to off-load the cleaning process so you are not affecting user queries and to reduce the storage space in SQL DW.  Cleaning the data in the data lake can also be less expensive and there are more tools available for data manipulation than just T-SQL.  Another reason is by cleaning the data in the data lake you can make it available to data scientists or power users who don’t have to wait until it’s in SQL DW
  • Power BI Dataflow – In short, Dataflows integrates data lake and data prep technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep).  So even though the main use case for Dataflow is for an individual solution or for small workloads, you can think of Power Query as the “compute” part that transforms data that lands in the data lake and can then be used as part of an enterprise solution

In addition to what products to use to clean the data, another consideration is where to clean the data.  Say I have an on-prem SQL Server database and I’m exporting some tables into CSV files and then copying them to the data lake (ADLS Gen2) and then to Azure SQL DW.  And I not only have to clean the data but join these CSV files with other files also in the data lake.  Should I clean the data as it moves from the data source to the data lake via ADF or always land the raw data in the data lake and then clean it and join it and land it back into the data lake (via ADF of Databricks)?  Or clean it as it moves from the data lake to SQL DW via ADF or Databricks?  In most cases you will want do all the transforming in the data lake.  So copy the raw data into the data lake, transform it and write it back into the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW, all of which can be orchestrated by ADF.  But this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.

More info:

Decision Tree for Enterprise Information Management (EIM)

Big Data Decision Tree v4

Posted in Azure Data Factory, Azure Databricks, Big Data, Data Lake, ETL, HDInsight, PolyBase, Power BI, SQLServerPedia Syndication, SSIS | 9 Comments

Azure Data Factory Data Flow

Azure Data Factory v2 (ADF) has a new feature in public preview called Data Flow.  I have usually described ADF as an orchestration tool instead of an Extract-Transform-Load (ETL) tool since it has the “E” and “L” in ETL but not the “T”.  But now it has the data transformation capability, making ADF the equivalent of “SSIS in the cloud” since it has the ability to mimic SSIS Data Flow business logic.

Data Flow works without you having to write any lines of code as you build the solution by using drag-and-drop features of the ADF interface to perform data transformations like data cleaning, aggregation, and preparation.  Behind the scenes, the ADF JSON code is converted to the appropriate code in the Scala programming language and will be prepared, compile and execute in Azure Databricks which will automatically scale-out as needed.

There is an excellent chart created by that shows the SSIS tasks and the equivalent ADF operation.  There are currently 13 different dataset manipulation operations with more to come:

  • New Branch
  • Join
  • Conditional Split
  • Union
  • Lookup
  • Derived Column
  • Aggregate
  • Surrogate Key
  • Exists
  • Select
  • Filter
  • Sort
  • Extend

The Data Flow feature in ADF is currently in limited public preview.  If you would like to try this out on your Data Factories, please fill out this form to request whitelisting your Azure Subscription for ADF Data Flows: http://aka.ms/dataflowpreview.  Once your subscription has been enabled, you will see “Data Factory V2 (with data flows)” as an option from the Azure Portal when creating Data Factories.

Follow Mark Kromer and the ADF team on Twitter to stay up to date on the rollout of the preview.  Also check out the ADF Data Flow’s documentation and these ADF Data Flow’s videos.

Don’t confuse this with Dataflows in Power BI, they have nothing to do with each other.

More info:

Azure Data Factory v2 and its available components in Data Flows

WHAT IS AZURE DATA FACTORY DATA FLOW?

AZURE DATA FACTORY DATA FLOW PREVIEW: SABERMETRICS EXAMPLE

SSIS-INSPIRED VISUAL DATA TRANSFORMATION COMES TO AZURE DATA FACTORY

Posted in Azure Data Factory, SQLServerPedia Syndication | 2 Comments

Should I load structured data into my data lake?

With data lakes becoming very popular, a common question I have been hearing often from customers is, “Should I load structured/relational data into my data lake?”.  I talked about this a while back in my blog post What is a data lake? and will expand on it in this blog.  Melissa Coates also talked about this recently, and I used her graphic below to illustrate:

I would not say it’s common place to load structured data into the data lake, but I do see it frequently.

In most cases it is not necessary to first copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:

  1. Wanting to offload the data refinement to the data lake (usually Hadoop), so the processing and space on the enterprise data warehouse (EDW) is reduced, and you avoid clashing with people running queries/reports on the EDW
  2. Wanting to use some Hadoop technologies/tools to refine/filter data that do the refinement quicker/better than your EDW
  3. It can ingest large files quickly and provide data redundancy
  4. ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
  5. The data lake is a good place for data that you “might” use down the road.  You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value.  Note PolyBase allows end-users to query data in a data lake using regular SQL, so they are not required to learn any Hadoop-related technologies.  PolyBase even allows the end-user to use SQL, or any reporting tool that uses SQL, to join data in a relational database with data in a Hadoop cluster
  6. Have a backup of the raw data in case you need to load it again due to an ETL error (and not have to go back to the source).  You can keep a long history of raw data
  7. A power user/data scientist wants to make use of the structured data in the data lake (usually combining it with non-relational data)
  8. As a quicker way to load data into Azure SQL Data Warehouse via PolyBase from the data lake (which is usually much faster than using SSIS to copy from the source to Azure SQL Data Warehouse)

You will have to balance these benefits with the extra work required to export data from a relational source into a format such as CSV, then copied to the data lake (where it can be cleaned with a tool such as Databricks and then copied into the relational EDW such as Azure SQL Data Warehouse).  Keep in mind for the relational data that is moved to the data lake you will lose valuable metadata for that data such as data types, constraints, foreign keys, etc.

I have seen some customers with relational data who bypassed the data lake and had the source systems copy data directly to the EDW (i.e. Azure SQL Data Warehouse) using SSIS/ADF, but still wanted all data in the data lake so they used Databricks to copy the data warehouse structures into the data lake using the SQL Data Warehouse Spark Connector.  In these cases, the customers used the data lake for static data (i.e. web logs, IOT data streams and older ERP data that will not change) and managed non-static data directly in the EDW, as moving data updates through the data lake to the EDW adds a layer of complexity they wanted to avoid.

I have also seen cases when the customer moved EDW data to Hadoop, refined it, and moved it back to the EDW which gave the benefit of offloading the processing and/or when they needed to use Hadoop tools.

A common practice when I see customers migrate an on-prem EDW solution to the cloud is they don’t use the data lake for their current sources (which are usually all structured data), but will use it for all new sources (which are usually mostly unstructured data).  However, they like the benefits listed above and will slowly modify most of their ETL packages (usually SSIS) to use the data lake, in some cases replacing SSIS with Azure Data Factory.

More info:

When Should We Load Relational Data to a Data Lake?

Posted in Data Lake, SQLServerPedia Syndication | 4 Comments

Premium blob storage

As a follow-up to my blog Azure Archive Blob Storage, Microsoft has released another storage tier called Azure Premium Blob Storage (announcement).  It is in private preview in US East 2, US Central and US West regions.

This is a performance tier in Azure Blob Storage, complimenting the existing Hot, Cool, and Archive tiers.  Data in Premium Blob Storage is stored on solid-state drives, which are known for lower latency and higher transactional rates compared to traditional hard drives.

It is ideal for workloads that require very fast access time such as interactive video editing, static web content, and online transactions.  It also works well for workloads that perform many relatively small transactions, such as capturing telemetry data, message passing, and data transformation.

Microsoft internal testing shows that both average and 99th percentile server latency is significantly better than the Hot access tier, providing faster and more consistent response times for both read and write across a range of object sizes.

Premium Blob Storage is available with Locally-Redundant Storage and comes with High-Throughput Block Blobs (HTBB), which provides a) improved write throughput when ingesting larger block blobs, b) instant write throughput, and c) container and blob names have no effect on throughput.

You can store block blobs and append blobs in Premium Blob Storage (page blobs is not yet available).  To use Premium Blob Storage you provision a new ‘Block Blob’ storage account in your subscription and start creating containers and blobs using the existing Blob Service REST API and/or any existing tools such as AzCopy or Azure Storage Explorer.

Premium Blob Storage has higher data storage cost, but lower transaction cost compared to data stored in the regular Hot tier.  This makes it cost effective and can be less expensive for workloads with very high transaction rates.  Check out the pricing page for more details.

At present data stored in Premium cannot be tiered to Hot, Cool or Archive access tiers.  Microsoft is working on supporting object tiering in the future.  To move data, you can synchronously copy blobs from using the new PutBlockFromURL API (sample code) or a version of AzCopy that supports this API.  PutBlockFromURL synchronously copies data server side, which means that the data has finished copying when the call completes and all data movement happens inside Azure Storage.

Check out the announcement on how to signup for the private preview.

To summarize the four storage tiers:

  • Premium storage (preview) provides high performance hardware for data that is accessed frequently
  • Hot storage: is optimized for storing data that is accessed frequently
  • Cool storage is optimized for storing data that is infrequently accessed and stored for at least 30 days
  • Archive storage is optimized for storing data that is rarely accessed and stored for at least 180 days with flexible latency requirements (on the order of hours)

More info:

Premium Block Blob Storage – a new level of performance

Azure Premium Blob Storage public preview

Posted in Azure, SQLServerPedia Syndication | 1 Comment

SQL Server 2019 Big Data Clusters

At the Microsoft Ignite conference, Microsoft announced that SQL Server 2019 is now in preview and that SQL Server 2019 will include Apache Spark and Hadoop Distributed File System (HDFS) for scalable compute and storage.  This new architecture that combines together the SQL Server database engine, Spark, and HDFS into a unified data platform is called a “big data cluster”, deployed as containers on Kubernetes.  Big data clusters can be deployed in any cloud where there is a managed Kubernetes service, such as Azure Kubernetes Service (AKS), or in on-premises Kubernetes clusters, such as AKS on Azure Stack.  The SQL Server 2019 relational database engine in a big data cluster leverages an elastically scalable storage layer that integrates SQL Server and HDFS to scale to petabytes of data storage. The Spark engine is now part of SQL Server:

While extract, transform, load (ETL) has its use cases, an alternative to ETL is data virtualization, which integrates data from disparate sources, locations, and formats, without replicating or moving the data, to create a single “virtual” data layer.  The virtual data layer allows users to query data from many sources through a single, unified interface.  Access to sensitive data sets can be controlled from a single location. The delays inherent to ETL need not apply; data can always be up to date.  Storage costs and data governance complexity are minimized.  See the pro’s and con’s of data virtualization via Data Virtualization vs Data Warehouse and  Data Virtualization vs. Data Movement.

SQL Server 2019 big data clusters with enhancements to PolyBase act as a virtual data layer to integrate structured and unstructured data from across the entire data estate (SQL Server, Azure SQL Database, Azure SQL Data Warehouse, Azure Cosmos DB, MySQL, PostgreSQL, MongoDB, Oracle, Teradata, HDFS, Blob Storage, Azure Data Lake Store) using familiar programming frameworks and data analysis tools:

In SQL Server 2019 big data clusters, the SQL Server engine has gained the ability to natively read HDFS files, such as CSV and parquet files, by using SQL Server instances collocated on each of the HDFS data nodes that can filter and aggregate data locally in parallel across all of the HDFS data nodes.

Performance of PolyBase queries in SQL Server 2019 big data clusters can be boosted further by distributing the cross-partition aggregation and shuffling of the filtered query results to “compute pools” comprised of multiple SQL Server instances that work together (this is similar to a PolyBase scale-out group).

When you combine the enhanced PolyBase connectors with SQL Server 2019 big data clusters data pools, data from external data sources can be partitioned and cached across all the SQL Server instances in a data pool, creating a “scale-out data mart”.  There can be more than one scale-out data mart in a given data pool, and a data mart can combine data from multiple external data sources and tables, making it easy to integrate and cache combined data sets from multiple external sources.  This will also be a great solution for importing IoT data.

SQL Server 2019 big data clusters make it easier for big data sets to be joined to the data stored in the enterprise relational database, enabling people and apps that use SQL Server to query big data more easily.  The value of the big data greatly increases when it is not just in the hands of the data scientists and big data engineers but is also included in reports, dashboards, and applications used by regular end users.  At the same time, the data scientists can continue to use big data ecosystem tools against HDFS while also utilizing easy, real-time access to the high-value data in SQL Server because it is all part of one integrated, complete system.

Azure Data Studio (previously released under the name of SQL Operations Studio) is an open-source, multi-purpose data management and analytics tool for DBAs, data scientists, and data engineers.  New extensions for Azure Data Studio integrate the user experience for working with relational data in SQL Server with big data.  The new HDFS browser lets analysts, data scientists, and data engineers easily view the HDFS files and directories in the big data cluster, upload/download files, open them, and delete them if needed.  The new built-in notebooks in Azure Data Studio are built on Jupyter, enabling data scientists and engineers to write Python, R, or Scala code with Intellisense and syntax highlighting before submitting the code as Spark jobs and viewing the results inline.  Notebooks facilitate collaboration between teammates working on a data analysis project together.  Lastly, the External Table Wizard, which uses PolyBase connectors, simplifies the process of creating external data sources and tables, including column mappings (it’s much easier than the current way of creating external tables).

There will also be a management service that will provision a bunch of agents on each pod that will collect monitoring data and the logs that can be seen via a browser-based cluster admin portal, which will also provide managed services for HA, backup/recovery, security, and provisioning.

In summary, SQL Server 2019 Big Data Clusters improves the 4 V’s of Big Data with these features:

More info:

Introducing Microsoft SQL Server 2019 Big Data Clusters

What are SQL Server 2019 big data clusters?

SQL Server 2019 Big Data Clusters white paper

Video New Feature Offerings in SQL Server 2019

Video SQL Server vNext meets AI and Big Data

Video The future of SQL Server and big data

Video Deep dive on SQL Server and big data

Posted in Big Data, PolyBase, SQL Server 2019, SQLServerPedia Syndication | 1 Comment