If you are building a big data solution in the cloud, you will likely be landing most of the source data into a data lake. And much of this data will need to be transformed (i.e. cleaned and joined together – the “T” in ETL). Since the data lake is just storage (i.e. Azure Data Lake Storage Gen2 or Azure Blob Storage), you need to pick a product that will be the compute and will do the transformation of the data. There is good news and bad news when it comes to which product to use. The good news is there are a lot of products to choose from. The bad news is there are a lot of products to choose from :-). I’ll try to help your decision-making by talking briefly about most of the Azure choices and the best use cases for each when it comes to transforming data (although some of these products also do the Extract and Load part):
- SSIS – The product used most often for on-prem ETL still has use cases for it when moving to the cloud, especially when migrating a solution from on-prem to the cloud. If you have hundreds of SSIS packages, it may make the most sense to continue to use those packages and just change the destination adapter (i.e. from your on-prem SQL Server to Azure SQL Database) and have them execute under the Azure-SSIS integration runtime (IR) in Azure Data Factory (ADF). Keep in mind if a package is accessing local resources (like storing temp files to a local path or using customer controls), some re-work will be required. A concern will be if a package is moving a ton of data, as executing transformations row-by-row (ETL) can be much slower than batch transformations (ELT) – see Difference between ETL and ELT. Also, the data has to be moved to the location of the IR and the compute power is limited by the size of the IR cluster. For those SSIS packages that are too slow you can just replace those with a product below
- Azure Data Factory (ADF) – Now that ADF has a new feature called Data Flow, it can transform data so it is more than just an orchestration tool. Behind the scenes, the ADF JSON code that is created when you build a solution is converted to the appropriate code in the Scala programming language and is prepared, compiled and executed in Azure Databricks. This means Data Flow operates in an ELT manner: It loads the data into a place where Databricks can access it, performs the transformations, and then moves it to the destination. ADF provides a native ETL scheduler so that you can automate data transformation and movement processes either through visual data flows or via script activities that execute in Databricks or other execution engines (so, like with SSIS, data flows are row-by-row transformations and for large amounts of data it may be faster to execute a batch transformation via a script in Databricks). My thoughts on when to use ADF are obviously if you are already using it or if your skillset lies in SSIS as it’s pretty easy to learn ADF with a SSIS background
- Databricks – It is a Spark-based analytics platform which makes it great to use if you like to work with Spark, Python, Scala, and notebooks. When choosing between Databricks and ADF, what I’ve noticed is that it depends highly on the customer personas and their capabilities. There are plenty of Data Engineers and Data Scientists who want to get deep into Python or Scala and sling some code in Databricks Notebooks. But the larger audience who wants to focus on building business logic to clean customer/address data, for example, doesn’t want to learn Python libraries, and will use the ADF visual data flow designer. Many of those are also Data Engineers and Data Scientists, but then we start to move up the value stack to include Data Analysts and Business Analysts, which is where we start to overlap with Power BI Dataflow (see below). Either way, when you want to orchestrate these cleaning routines with schedules, triggers, and monitors, you want that to be through ADF. Keep in mind if you code your transformations in Databricks Notebooks, you will be responsible for maintaining that code, troubleshooting, and scheduling those routines
- HDInsight (HDI) – Databricks is the preferred product over HDI, unless the customer has a mature Hadoop ecosystem already established. But more and more I tend to find that the majority of workloads are Spark, so Databricks is a better option. In terms of pure Spark workloads Databricks greatly outperforms HDI. Although Databricks clusters can be up and running indefinitely, they’re intended to be created and terminated as necessary – so if you wish to use other Apache Hadoop data transformation tools and have them available 24/7 then HDI may better a better option than Databricks
- PolyBase – Azure SQL Data Warehouse (SQL DW) supports PolyBase, so if you are using SQL DW you can pull data from the data lake into SQL DW using T-SQL. An option is to use PolyBase to import the raw data from the data lake into SQL DW and then clean it there using Stored Procedures, especially if you want to stick with T-SQL and don’t want to deal with Spark or Hive or other more-difficult technologies. But the benefit of cleaning the data as it sits in the data lake is to off-load the cleaning process so you are not affecting user queries and to reduce the storage space in SQL DW. Cleaning the data in the data lake can also be less expensive and there are more tools available for data manipulation than just T-SQL. Another reason is by cleaning the data in the data lake you can make it available to data scientists or power users who don’t have to wait until it’s in SQL DW
- Power BI Dataflow – In short, Dataflows integrates data lake and data prep technology directly into Power BI, so anyone with Power Query skills (yes – Power Query is now part of Power BI service and not just Power BI Desktop and is called Power Query online) can create, customize and manage data within their Power BI experience (think of it as self-service data prep). So even though the main use case for Dataflow is for an individual solution or for small workloads, you can think of Power Query as the “compute” part that transforms data that lands in the data lake and can then be used as part of an enterprise solution
In addition to what products to use to clean the data, another consideration is where to clean the data. Say I have an on-prem SQL Server database and I’m exporting some tables into CSV files and then copying them to the data lake (ADLS Gen2) and then to Azure SQL DW. And I not only have to clean the data but join these CSV files with other files also in the data lake. Should I clean the data as it moves from the data source to the data lake via ADF or always land the raw data in the data lake and then clean it and join it and land it back into the data lake (via ADF of Databricks)? Or clean it as it moves from the data lake to SQL DW via ADF or Databricks? In most cases you will want do all the transforming in the data lake. So copy the raw data into the data lake, transform it and write it back into the data lake (so you have multiple data lake layers such as raw and cleaned), then copy it to SQL DW, all of which can be orchestrated by ADF. But this is not a “one size fits all” as you may be building a solution that is moving lots of data from many different data sources as there may be a handful of use cases within that solution where it may be faster and/or easier to clean the data as it moves to or from the data lake.