HomeAzure Data FactoryWhere should I clean my data?

Comments

Where should I clean my data? — 10 Comments

  1. Hello James,

    I am surprised by your recommendation of keeping the data in SQL DW as the number of concurrent queries that can be executed is pretty low compared with Exadata, Teradata and SQL Server. Didnt you face these issues at any of your clients?

  2. Thanks for the article James! Can you point me to resources that gives examples of how to do Type 1 and Type 2 dimensional modeling using Azure Databricks and Azure Data Lakes? I understand the value in using Azure Databricks for doing the type of data wrangling that is often necessary for data science work but I don’t understand how to use it to perform ETL tasks that I currently do using SQL based tools like MERGE statements and SSIS to populate data warehouses. I am trying to get a better understanding of the “modern data warehouse”.

  3. Hi James

    Great article as always.

    I’ve found that ADFv2 is exorbitantly expensive, any advice on how to implement incremental load patterns using the Modern DW architecture? For example, how/where to surface LastModifiedDateTime or LastIncrementalExecutionDateTime metadata for each table load – CSV in ADLv2? Good ol’ SQL Server? Databricks Delta? HDFS? I’d love to see a detailed article.

    Also, is there any plan for Power BI Dataflow integration with ADFv2 (via a component) or Databricks (via a jar/egg)? This is currently a big disconnect, e.g., Databricks cannot natively read/write the model.json metadata file associated with PBI Data Flows.

  4. Just wondering: Is building SCD2 in Databricks faster than on-premise SQL server?
    FYI I tested the SCD2 merge SQL in SQL Server: For a dimension with 20million rows, supposing half of them got updated the SCD2 process runs ~4 min. I got 30 million records after the dimension processing.
    I know for a fact I could not get the same performance on on-premise data lake with spark.

    • @Will , are you sure you have partitioned you data optimally in on premise data lake also make sure that it is not some of your spark jobs starved of resources on the nodes from an Apple to Apple comparison are you working.

      We have done a test on Databricks Delta and have found it faster .

  5. James,

    Excellent summary . As I’m still learning these technologies, I wanted to try and test each one of the scenarios mentioned and compare them.

    Are there any sample data sets or tutorials that I can use to compare each scenario. For example If I had to copy 2 tables, do an aggregation and then report on structured data as well as the same with unstructured data-sets. Hope that makes sense

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>