HomeAzure Synapse Analytics/SQL DWWhy use a data lake?

Comments

Why use a data lake? — 13 Comments

  1. Pingback:Why Data Lakes? – Curated SQL

    • Hi Sacha,

      Good question! I have updated the blog to mention that option, and will have a future blog that will compare those two options and go over best use cases for both.

  2. James, without data governance on the way in how do you keep the lake reasonably pristine? I think I get the concept, but in practice it seems like it could easily end up being the “data” share where everyone dumps any file that seems like data.

    • Great question Andy. Having data governance on the way into the data lake would defeat the main benefits of a data lake, but these leaves open the data lake becoming a data swamp as you suggest. So data governance needs to apply on data once it lands in the lake via the multiple stages of data in the data lake as I mentioned above, such as: raw, cleaned, mastered, test, production ready. Some say data governance should not happen until data is pulled out of the data lake, but I feel that should be limited to a user just exploring data to see if it valuable for repeated use.

  3. For large data this makes good sense. A lot of organizations start wondering about a data warehouse when someone asks for adhoc reporting. The size of the data is not that large and typically an adhoc solution is put together consisting of replicated transaction data and a BI tool. As data gets large (this always happens) and more use cases arise for the data stored, this starts to create a problem. I think an important case needs to be made as to how building a data lake is important from the start and how and which tools to use to make it work without having key business stake holders complain about the management and learning complexity introduced. The key benefits of the tools and options available over a data lake/hadoop based architecture and how to avoid the common pitfalls (ones which make you think – hey, I am not a data company yet. Should I really be taking on this headache?)

  4. James, great summary. Question on PolyBase and Azure Data Lake. Can/will PolyBase connect directly to an Azure Data Lake? Or does PolyBase have to first connect to an HDInsight/Hadoop cluster, which then connects to the Azure Data Lake?

    Thanks

  5. Pingback:The art of possible with the cloud – Cloud Data Architect

  6. Hi

    You write: “The disadvantages of a data lake are that it is usually not useful for analytical processing”

    I think just the opposite. It is excellent for analytical processing, actually that is what HDFS is made for: Parallel processing of large data sets.
    Also you don’t mention the possibility of using stream analytics on the message broker (iot hub). This would enable you to apply a model on your data from the message broker, and clean them before saving them to the data lake. Or even make two separate stream, one for validated data and one for raw data. Also you can let the stream analytics write to the Power BI directly and get good real-time graphs, etc.

    • Good point – I reworded it to say “The disadvantages of a data lake are that it is usually not good for quick and easy analytical processing”.

      Other good points about stream analytics – there are dozens of possibilities and I did not want to confuse people so much 🙂

  7. Pingback:The art of possible with the cloud | James Serra's Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

HTML tags allowed in your comment: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>