HomeAzure Data FactoryWhat product to use to transform my data?

Comments

What product to use to transform my data? — 9 Comments

  1. I’m surprised that you left off Azure Data Lake Analytics.

    The U-SQL set-based workflow integrated with C# make for quite a robust tool set regarding the Transform arena. ADLA has many of the same features as the systems listed above (Azure Storage integration or ADLS, scale-out/clustering, etc.)

  2. ADLA is an option for customers looking for C# friendly analytics engine, but we are seeing more interest in open-source solutions. Just be aware ADLA does not work with ADLS Gen2.

    • I ran a comparison with Azure Data Lake Analytics and Spark (databricks). Data lake analytics has some serious problems that are going to be too slow to overcome and catch-up with spark that has already been through the pain and is maturing… I can’t see that ADLA will survive and microsoft doesn’t need to & should focus on the services that bring something to the table e.g. powerbi, datalake store, data factory, sql server.

      1. The string size limit is a serious problem in ADLA and the workaround is a lot of frustrating coding and maintenance which can still hit the binary field limit. Compared with how easy it is to load data in spark.
      2. Type inference. Having intelligent type inference is really important especially with large unwieldy data sets. You have to get the data in somehow to start with to profile before you can define the best schema or understand the data. This is really easy and designed ground up for spark. I found it to be a huge pain in ADLA. It’s strongly typed, even on nulls, lots of coding for simple things that work easily in spark.
      3. adhoc query capability – it’s just quicker, smoother and nicer to look at in spark
      4. transactions and performance – databricks has some pretty impressive black magic going on in delta with snapshot transaction isolation. Allowing quite impressive lambda architecture in the same physical tables. Also performance enhancements – they’ve been at it for a while!
      5. Spark supports an array and complex types natively that can be inferred directly from json with very little typing. Handling json in spark is pleasurable and because the table types natively support equivalent structures fully shredding data isn’t necessary and can be useful for multi-value attributes in a dimension – for example. ADLA support for json was pretty poor
      7. Full parquet – ADLA brought in Parquet but again using it was a pain and doesn’t have complex types.
      6. The built-in function library in spark is massive – there isn’t much you can’t do. The custom API flexibility is huge… it’s a bit of wrestle in ADLA
      7. Spark has a functional language support (scala) and OO. Also supports python and R with some performance hits. ADLA – has python but not as good and no functional language
      8. All the database integration sinks and driver capabilities in spark are quite comprehensive – ADLA tables are isolated… you can’t connect to them with anything!
      7. Spark is a one stop shop for streaming and batch – the transformation code (dataframes) is identical! Or unified as they call it.
      9. ADLA – I did like the job visualisation and heat map. That was a nice feature and was cheap. But I had so many failures on that platform so maybe not as cheap in the long run (pay per job).

      It may seem like I’m a spark fan boy… but I wasn’t. I used to be very enthusiastic about ADLA because I’m familiar with C#. But the scars I came away with ADLA still hurt; Databricks just blew me away how easy things were in comparison and potential of what you can do with it… I’m fully converted and can’t get enough. If you can code you can code… C#, scala, java… coding is coding… having a head for data though and distributed coding is where it all comes together.

      I have found some issues with Gen2 with Databricks at the mo… but still looking into it. That could be me… not Gen2

      PS I don’t work for databricks! That was a genuine comparative study.

  3. Great summary, thank you, James! I also find that sometimes doing transformations on-prem is sometimes worth considering, since that is compute capacity that you already own and pay for. That contrasts with cloud resources, where you pay-as-you-go for compute. That only works, of course, if the data fits in the on-prem scale, and only makes sense when you have already paid for the machine.

  4. Pingback:Data Transformation Tools In The Azure Space – Curated SQL

  5. Hi,
    In ADF dataflow, do you know if the source is able to read directly from gzip files?

    Right now this is possible with the Copy Activity, so hoping this will be available in dataflow.
    Also will the dataflow source read all files in blob storage without having to create any looping logic, again like the copy activity

    thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>