Azure Data Factory and SSIS compared

I see a lot of confusion when it comes to Azure Data Factory (ADF) and how it compares to SSIS.  It is not simply “SSIS in the cloud”.  See What is Azure Data Factory? for an overview of ADF, and I’ll assume you know SSIS.  So how are they different?

SSIS is an Extract-Transfer-Load tool, but ADF is a Extract-Load Tool, as it does not do any transformations within the tool, instead those would be done by ADF calling a stored procedure on a SQL Server that does the transformation, or calling a Hive job, or a U-SQL job in Azure Data Lake Analytics, as examples.  Think of it more as an orchestration tool.  SSIS has the added benefit of doing transformations, but keep in mind the performance of any transformations depends on the power of the server that SSIS is installed on, as the data to be transformed will be pushed to that SSIS server.  Other major differences:

  • ADF is a cloud-based service (via ADF editor in Azure portal) and since it is a PaaS tool does not require hardware or any installation.  SSIS is a desktop tool (via SSDT) and requires a good-sized server that you have to manage and you have to install SQL Server with SSIS
  • ADF uses JSON scripts for its orchestration (coding), while SSIS uses drag-and-drop tasks (no coding)
  • ADF is pay-as-you-go via an Azure subscription, SSIS is a license cost as part of SQL Server
  • ADF can fire-up HDInsights clusters and run Pig and Hive scripts.  SSIS cannot
  • SSIS has a powerful GUI, intellisense, and debugging.  ADF has a basic editor and no intellisense or debugging
  • SSIS is administered via SSMS, while ADF is administered via the Azure portal
  • SSIS has a wider range of supported data sources and destinations
  • SSIS has a programming SDK, automation via BIML, and third-party components.  ADF does not have a programming SDK, has automation via PowerShell, and no third-party components
  • SSIS has error handling.  ADF does not
  • ADF has “data lineage“, tagging and tracking the data from different sources.  SSIS does not have this

Think of ADF as a complementary service to SSIS, with its main use case confined to inexpensively dealing with big data in the cloud.

Note that moving to the cloud requires you to think differently when it comes to loading a large amount of data, especially when using a product like SQL Data Warehouse (see Azure SQL Data Warehouse loading patterns and strategies).

More info:

Azure Data Factory vs SSIS

Posted in Azure Data Factory, SQLServerPedia Syndication, SSIS | 2 Comments

IoT Hub vs. Event Hub

There can sometimes be confusion, in IoT scenarios, between IoT Hub and Event Hub, as they can both be valid choices for streaming data ingestion.  My Microsoft colleague, Steve Busby, an IoT Technology Specialist, is here to clear up the confusion:

The majority of the time, if the data is coming directly from the devices, either directly or via a field-based gateway, IoT Hub will be the more appropriate choice.  Event Hub will generally be the more appropriate choice if either the data will not be coming to Azure directly from the devices, but rather either cloud-to-cloud through another provider, intra-cloud, or if the data is already landing on-premise and needs to be streamed to the cloud from a small number of endpoints internally.  There are exceptions to both conditions, of course.

Both solutions offer very high throughput data ingestion and can handle tremendous streaming data volumes.  In fact, today, IoT Hub is primarily a set of additional services that wrap an underlying Event Hub.

Event Hub, however, has some limitations that may make it inappropriate for an IoT scenario when data is being pulled from devices in the field, specifically:

  • Event Hub is limited to 5000 concurrent AMQP connections.  At “IoT scale”, there would likely be a lot more devices than that
  • Event Hub does not authenticate individual device connections, but rather all the devices essentially share the same (or few) access keys.  You also cannot enable/disable communication from a single device.
  • Event Hub is ingestion only and has no facility for sending traffic back to the devices (command and control or device management scenarios)
  • Event Hub does not support MQTT.  For better or worse, MQTT is a very popular IoT protocol and is supported by competitors, particularly AWS

In addition to the features of Event Hub, IoT Hub also adds:

  • Scale out connections.  A single IoT Hub has been tested to more than a million concurrent connections
  • IoT Hub does individual device authentication (via either a device-specific key, or x.509 certificate) and you can enable/disable individual devices
  • Bi-directional communication, allowing you to either do asynchronous or synchronous commands to the device
  • Supports AMQP and HTTP, like Event Hub, but also adds support for MQTT as a transport protocol
  • Device Management – the ability to orchestrate firmware updates and desired and reported configuration management with devices
  • Device Twins – A queryable cloud-side representation of the state of a device, allowing customers to group and manage their devices “at IoT scale”
  • File update – orchestrate file uploads from device through the hub

As you can see, both Event Hub and IoT Hub are valuable Azure services with distinct use cases.  Event Hub is great for high throughput ingestion (only) of data from a limited number of sources (like cloud to cloud, on-prem to cloud, or intra-cloud).  In those scenarios, Event Hub will generally be appropriate and significantly cheaper than IoT Hub.  However, for IoT scenarios where interacting with the individual devices directly are involved, IoT Hub will generally be the better technology and adds a lot of features we need to be competitive.

More info

Comparison of Azure IoT Hub and Azure Event Hubs

Posted in Azure, Cortana Intelligence Suite | 1 Comment

Microsoft Products vs Hadoop/OSS Products

Microsoft’s end goal is for Azure to become the best cloud platform for customers to run their data workloads.  This means Microsoft will provide customers the best environment to run their big data/Hadoop as well as a place where Microsoft can offer services with our unique point-of-view.  Specific decision points on using Hadoop is if the customer wants to use open source technologies or not.  Some of the benefits of running open source software (OSS) on Azure include:

  • Quick installs
  • Support
  • Easy scale
  • Products work together
  • Don’t need to get your own hardware

To determine the cost savings by moving your OSS to Azure, see the Total Cost of Ownership (TCO) Calculator.

Of course there are many benefits of using Microsoft products over OSS, such as ease of use, support, easier to find people with skills, less frequent version updates, more compatibility between products, etc.  But there are still reasons to use OSS, so I created a list that shows many of the Microsoft products and their equivalent, or close equivalent, Hadoop/OSS product.

I tried to list only Apache products unless there was no equivalent Apache product or there is a really popular Open Source Software (OSS) product.

Microsoft Product Hadoop/Open Source Software Product
Office365/Excel OpenOffice/Calc
DocumentDB MongoDB, HBase, Cassandra
SQL Database SQLite, MySQL, PostgreSQL, MariaDB
Azure Data Lake Analytics/YARN None
Azure VM/IaaS OpenStack
Blob Storage HDFS, Ceph (Note: These are distributed file systems and Blob storage is not distributed)
Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion
Event Hub Apache Kafka
Azure Stream Analytics Apache Storm, Apache Spark, Twitter Heron
Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana
HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay)
Azure ML Apache Mahout, Apache Spark MLib
Microsoft R Open R
SQL Data Warehouse Apache Hive, Apache Drill, Presto
IoT Hub Apache NiFi
Azure Data Factory Apache Falcon, Airbnb Airflow
Azure Data Lake Storage/WebHDFS HDFS Ozone
Azure Analysis Services/SSAS Apache Kylin, AtScale (pay)
SQL Server Reporting Services None
Hadoop Indexes Jethro Data (pay)
Azure Data Catalog Apache Atlas
PolyBase Apache Drill
Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES)
Others Apache Flink, Apache Ambari, Apache Ranger, Apache Knox

Many of the Hadoop/OSS products are available in Azure.  If you feel I’m missing some products from this list, please let me know as this is very subjective and comments are always welcome!

Posted in Hadoop, SQLServerPedia Syndication | Leave a comment

Create quick queries quickly with SSAS in Azure

In a perfect world a data warehouse would always return all queries very quickly, no matter what the size or complexity of the data warehouse.  But the reality is that sometimes we won’t get the performance we need from the data warehouse, and the solution is to use a SQL Server Analysis Services (SSAS) cube, which also has other benefits (see Why use a SSAS cube?).  Thanks to Azure, creating a cube can be done in a matter of hours, giving you quick results for just about any query no matter how big the data warehouse gets.

If you want to create a cube quickly you can do so by creating a VM in Azure and installing SQL Server and SSAS (or choosing a VM from the marketplace with SQL Server and SSAS already installed) and assuming the following restrictions:

  • No high-availability needed (can be accomplished in the future by creating another VM in an availability set and syncing cubes and using a Azure load balancer)
  • No disaster recovery needed (can be accomplished in the future by creating another VM in another region and syncing cubes).  Solution for now would be to backup the cube and restore if needed, or just rebuild and reprocess the cube
  • Number of concurrent connections under a hundred.  To handle more connections in the future can be accomplished by scaling up the VM and/or creating multiple VMs and syncing cubes and adding a load balancer (keeping in mind VMs can be shut-down off hours to save cost).  The number of concurrent connections is highly variable based on the type of querying and caching.  See SQL Server Best Practices Article
  • Using SSAS tabular model, and the size of the cube is less than 180GB.  Cube compression is between 10-100x and the 180GB number is with a conservative average of 10x and using 60% of the servers memory.  A larger cube can be accomplished via a SSAS multidimensional model as it does not require the cube in memory like Tabular.  The multidimensional SSAS cube can use HOLAP storage mode to support more source data, as the cube will not have a copy of the detail records from the source system, greatly reducing cube size and processing time.  Drilling down to the detail records can still be made by the cube which will accomplish this by querying the source system
  • Available maintenance window at night to process the tabular cube (in the future can add a “cube processing VM” to process cube and then sync the cubes to avoid queries hitting a cube when it is being processed and slowing the queries down)
  • Only need to update the cube once a day (in the future can use VM to process cube and sync as well as use partitioning)
  • Future expansion can be via Azure Analysis Services (in preview) as well as creating multiple cubes by subject areas and using a true scale-out architecture
  • It’s a balance between using extra developer hours to save cost and choosing more hardware up-front

To choose the appropriate size VM, see Hardware Sizing a Tabular Solution (SQL Server Analysis Services) and Sizes for virtual machines in Azure and Largest VM in the Cloud and How Much RAM Do I Need For My SSAS Tabular Server?

More info:

SQL Server 2008 R2 Analysis Services Operations Guide

Why a Semantic Layer Like Azure Analysis Services is Relevant (Part 1)

Posted in SQLServerPedia Syndication, SSAS | 6 Comments

Data gateway confusion

Microsoft has created data gateways as a way that cloud products such as Power BI and Azure Machine Learning can access on-prem data, often called a hybrid model.  This is most useful for companies that do not want to put their data in the cloud, or there is simply too much data to upload to the cloud.

The confusion is that there are a number of different gateways, and some of those have been renamed multiple times.  I’ll try to provide a quick summary to help you understand all the current options:

On-premises data gateway: Formally called the enterprise version.  Multiple users can share and reuse a gateway in this mode.  This gateway can be used by Power BI, PowerApps, Microsoft Flow or Azure Logic Apps.  For Power BI, this includes support for both scheduled refresh and DirectQuery.  To add a data source such as SQL Server that can be used by the gateway, check out Manage your data source – SQL Server.  To connect the gateway to your Power BI, you will sign in to Power BI after you install it (see On-premises data gateway in-depth).

Personal gateway: This is for Power BI only and can be used as an individual without any administrator configuration.  This can only be used for on-demand refresh and scheduled refresh (not DirectQuery).

To install either of these gateways, see On-premises data gateway.  The differences between the two:

gw

Data Management Gateway: This is the old solution for Power BI but still used for Azure Data Factory and Azure Machine Learning.  You can download it here.

Here is the differences between the Data Management Gateway and the On-premises data gateway:

  Data Management Gateway PBI On-Premises Data Gateway
Target Scenarios Use Data Management Gateway for building advance analytics solutions.  With Data Management Gateway you can securely and efficiently transfer data between on-prem and cloud and integrate on-prem transformation with cloud. Use the on-premises gateway to keep PBI dashboards and reports up-to-date with on-premises data sources.
Key Capabilities Batch data transfer to/from a set of supported sources and destinations

Execute T-SQL against on-prem SQL Server

DirectQuery to SQL Server

Live connection to Analysis Services

Scheduled refresh against set of supported data sources

Cloud services it works with Azure Data Factory

Azure Machine Learning

Power BI

PowerApps (Preview)

Azure Logic Apps (Preview)

Microsoft Flow (Preview)

Microsoft has another different gateway that is specific to the Azure Analysis Services (Preview) service, at least for now: On-premises data gateway.

Posted in SQLServerPedia Syndication | 2 Comments

What is HTAP?

I have been seeing the term “HTAP” mentioned a lot recently, and I thought I would briefly explain the term.  HTAP stands for Hybrid Transactional and Analytical Processing.

HTAP is used to describe the capability of a single database that can perform both online transaction processing (OLTP) and online analytical processing (OLAP) for the purpose of real-time operational intelligence processing.  The term was created by Gartner in 2014.

In the SQL Server world you can think of it as: In-memory analytics (columnstore) + in-memory OLTP = real-time operational analytics.  Microsoft supports this in SQL Server 2016 (see SQL Server 2016 real-time operational analytics).

To clarify the difference between OLTP and OLAP: OLTP, as well as Operational Data Stores (ODS), are operational workloads.  They are low latency, high volume, high concurrency workloads that are usually used to operate a business, such as taking and fulfilling orders, making shipments, billing customers, collecting payments, and so on.  On the other hand, OLAP are BI/EDW and considered analytics workloads.  They are relatively higher latency, lower volume, and lower concurrency workloads that are used to improve the performance of a company, by analyzing operational, historical, and external data (“big data”), to make strategic decisions, or take actions, to improve the quality of products, customer experience, and so forth, as well as to do predictive analytics.

So a HTAP query engine must be able to serve everything, from simple, short transactional queries to complex, long-running analytical ones, delivering to the service-level objectives for all these workloads.

This does not replace the need for a separate data warehouse when you need to integrate data from multiple sources before running the analytics workload or when you require extreme analytics performance using pre-aggregated data such as cubes.  Or when one server serving both OLTP and OLAP simply does not give you enough processing power and you need additional servers.

More info:

In Search of Database Nirvana

HTAP: What it Is and Why it Matters

Evaluating HTAP Databases for Machine Learning Applications

What is Hybrid Transaction/Analytical Processing (HTAP)?

Accelerate SQL Server 2016 HTAP performance with Windows 2016 and HPE Persistent Memory technology (video)

How HTAP Database Technology Can Help You

Posted in SQLServerPedia Syndication | 1 Comment

Total Cost of Ownership (TCO) Calculator

For a long time clients would ask me how to determine the cost savings by migrating their applications and databases to Azure.  I never had a good answer until now: The Total Cost of Ownership (TCO) Calculator.  Now in preview, just provide a brief description of your on-premises environment to get an instant estimate of the cost savings you can realize by migrating your application workloads to Microsoft Azure.  Here is an example:

Enter the inputs:

tco1

Enter the assumptions:

tco2

Hit the “Calculate” button, and out comes the report:

tco3

tco4

Posted in Azure, SQLServerPedia Syndication | 3 Comments

Azure Advisor

Azure Advisor is a really cool personalized recommendation engine that provides you with proactive best practices guidance on High Availability, Security, Performance and Cost when using Azure.  It is in public preview.  Learn more about Azure Advisor here.

Azure Advisor analyzes your resource configuration and usage telemetry to detect risks and potential issues.  It then draws on Azure best practices to recommend solutions that will reduce your cost and improve the security, performance, and reliability of your applications.

It is real easy to use.  Just login to the Azure portal and click here (or click on More services link available at the bottom of the Menu and select Azure Advisor).  You then pick the subscription you want to analyze, filter by resource groups if you wish, and click the “Get recommendations” button.  You will then see a list of recommendations, that you can filter based on category (High Availability, Security, Performance, Cost):

aa1

You can then click to see the list of recommendations (some which have a snooze button so you can delay or dismiss taking action):

aa2

You can then drill down to see the list of resources the recommendation is for:

aa3

And then you can perform the fix:

aa4

This is a easy way to optimally configure your Azure resources!

More info:

AZURE ADVISOR – YOUR PERSONAL CLOUD ASSISTANT

A Quick Tour of the Microsoft Azure Advisor

Posted in Azure, SQLServerPedia Syndication | Leave a comment

Analytics Platform System 2016 Release

Microsoft has a new release for the Analytics Platform System (APS).  This appliance update is called APS 2016 and has been released to manufacturing and is now generally available.  APS is Microsoft’s scale-out Massively Parallel Processing fully integrated system for data warehouse specific workloads.

This release is built on the latest SQL Server 2016 release, offers additional language surface coverage to aid in migrations from SQL Server and other platforms, adds PolyBase connectivity to the current versions of Hadoop from Hortonworks, additional PolyBase security with Kerberos support and credential support for Azure Storage Blobs, greater indexing and collation support and improvements to the setup and upgrade experience with FQDN support.

The majority of these capabilities have shipped in the monthly releases of Azure SQL Data Warehouse service and/or SQL Server 2016 following the cloud first principle of shipping, getting feedback, and improving rapidly across all of our products.

What’s New in the Release:

SQL Server 2016 re-platform and TSQL compatibility improvements to reduce migration friction from SQL SMP

Data Types:

Windowing Functions:

Security Functions:

Additional Functions:

  • NEWID() – creates a UNIQUEIDENTIFIER
  • RAND() – returns a pseudo-random float value from 0-1
  • DBCC SHRINKDATABASE() – shrinks the size of data and log files in a database
  • sp_spaceused() – displays the disk space used or reserved in a table or database

Polybase/Hadoop Enhancements

  • Support for the latest distributions from Hortonworks (HDP 2.4 and HDP 2.5)
  • Kerberos support via database scoped credentials
  • Credential support with Azure Storage Blobs

Install/Upgrade Enhancements

  • WSUS updated images
  • Security updates taken in updated firmware and drivers
  • Fully Qualified Domain Name support
  • APS installer includes PAV and hotfixes
  • The latest generation processor support (Broadwell), DDR4 DIMMs, and improved DIMM throughput

The latest APS update is an addition to already existing data warehouse portfolio from Microsoft, covering a range of technology and deployment options that help customers get to insights faster.  Customers exploring data warehouse products can also consider SQL Server with Fast Track for Data Warehouse or Azure SQL Data Warehouse, a cloud based fully managed service.

For more details about these features, please visit our online documentation or download the client tools.

Posted in PDW/APS, SQLServerPedia Syndication | 4 Comments

PASS Summit Announcements: PolyBase enhancements

Microsoft usually has some interesting announcements at the PASS Summit, and this year was no exception.  I’m writing a set of blogs covering the major announcements.  Next up is the PolyBase enhancements.

PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.  It allows you to run queries on external data in Hadoop or Azure blob storage.  The queries are optimized to push computation to Hadoop via MapReduce jobs.

By simply using Transact-SQL (T-SQL) statements, you an import and export data back and forth between relational tables in SQL Server and non-relational data stored in Hadoop or Azure Blob Storage.  You can also query the external data from within a T-SQL query and join it with relational data.

The major use cases for PolyBase are:

  • Load data: Use Hadoop as an ETL tool to cleanse data before loading to data warehouse with PolyBase
  • Interactively Query: Analyze relational data with semi-structured data using split-based query processing
  • Age-out Data: Age-out data to HDFS and use it as ‘cold’ but query-able storage

The main benefits of PolyBase are:

  • New business insights across your data lake
  • Leverage existing skillsets and BI tools
  • Faster time to insights and simplified ETL process

PolyBase supports the following file formats: Delimited text (UTF-8), Hive RCFile, Hive ORC, Parquet, gzip, zlib, Snappy compressed files.

For more details see: Introduction to PolyBase presentation, PolyBase Guide, and the list of supported data sources here.

polybase

Polybase was first made available in Analytics Platform System in March 2013, and then in SQL Server 2016.  The announcement at the PASS Summit was that by preview early next year, in addition to Hadoop and Azure blob storage, PolyBase will support Teradata, Oracle, SQL Server, and MongoDB in SQL Server 2016.  And the Azure Data Lake Store will be supported in Azure SQL Data Warehouse PolyBase.

With SQL Server 2016, you can create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance (see PolyBase scale-out groups):

polybase2016

In summary, the main reasons to use PolyBase:

  • Ability to integrate SQL Server with data stored in HDFS or Windows Azure Storage BLOB
  • Commodity hardware and storage are cheap, easily distributed on HDFS; increases data reliability at a low cost
  • Increasing number of different types of data; structured, unstructured, semi-structured (Can have them stored on the best system suitable and queried in one place)
  • Increasing size of data and strong aversion to data deletion due to company culture or restrictions

More info:

Integrating Big Data and SQL Server 2016

PolyBase in SQL Server 2016 video

Polybase in SQL Server – Big Data Queried with T-SQL video

Why Polybase?

Posted in PolyBase, SQLServerPedia Syndication | 2 Comments