Previously I covered what a data lake is (including the Azure Data Lake and enhancements), and now I wanted to touch on the main reason why you might want to incorporate a data lake into your overall data warehouse solution.
To refresh, a data lake is a landing zone, usually in Hadoop, for disparate sources of data in their native format (NoSQL databases can be used for “data lakes” too, although they often require data conversion to a single format so are much less frequently used). Data is not structured or governed on its way into the data lake. This eliminates the upfront costs of data ingestion, especially transformation. Once data is in the lake, the data is available to everyone. You don’t need a priority understanding of how data is related when it is ingested, rather, it relies on the end-user to define those relationships as they consume it. Data governorship happens on the way out instead of on the way in. This makes a data lake very efficient in processing huge volumes of data. Another benefit is the data lake allows for data exploration and discovery, to find out if data is useful or to simply leave data in the data lake while you wait to see how you might eventually use it. Or it can be used by people who don’t require top-quality cleansed data and are desperately trying to address time-critical business requirements. Think of a data lake as a solution for any case in which you do not need the benefits that an RDBMS gives you, such as accelerated vperformance for query processing and complex joins, workload management features required to meet strict SLAs, and high concurrency. Such use cases can be as simple a one-time operational report, or as complicated as using the data lake to offload workloads for refining data, in which the data lake become the primary site for the preparation (cleansing, aggregation, and integration) of data.
The disadvantages of a data lake are that it is usually not good for quick and easy analytical processing. If you need to return queries very fast, as in a few seconds or less, the data lake won’t give you that performance. Also the data is usually not clean, and it’s not easy to join the data. This leads to the inability to determine data quality or the lineage of findings by other users that have previously found value in using the same data in the lake. And finally, it can be difficult for less technical people to explore and ask questions of the data (i.e. having to write a Spark job in Java). This is where federated query technologies like PolyBase help by allowing end users to use regular SQL to access data in the data lake as well as combine that data with data in an RDBMS. What many people don’t realize is data governance needs to be applied to a data lake, just like it needs to be applied to a data warehouse, if people will try to report from it (as opposed to passing the raw data in the data lake to a RDBMS where it will be cleaned and accessible to end users). This means you should have multiple stages of data in the data lake, such as: raw, cleaned, mastered, test, production ready. Don’t make the data lake become a collection of silos (a data swamp!). Think about creating a data lake competency center to prevent this.
A data warehouse, which is a central repository of integrated data from one or more disparate sources, has many benefits. However, it takes a lot of time and money to get the data ready to be analyzed, performing the tasks to acquire the data, prep it, govern it, and finding keys to join the data. Often integration is done without a clear visibility into the value that is going to be extracted from it. Just because you can join and relate the data, if you have not figured out the value, the risk is that if turns out not to be valuable you have wasted a lot of time. Also, if relationships change between new data sources it becomes a challenge. The big benefit of a data warehouse is speed of queries. If you create a dashboard where a user can slice and dice the data, you want to be hitting a data warehouse. You get the speed because a data warehouse can be in a RDBMS as well as a star schema format which will give you optimal performance, much more than you would get from querying a data lake.
So using both a data lake and a data warehouse is the ideal solution. Think of it as the data lake is a technology solution, and the data warehouse is the business solution. The best way to see this in action is an architecture diagram that lists the many possible technologies and tools that you can use:
The details on the numbered steps in the above diagram (which is cloud solution, but it can access on-prem sources):
1) Copy source data into the Azure Data Lake Store (twitter data example)
2) Massage/filter the data using Hadoop (or skip using Hadoop and use stored procedures in SQL DW/DB to massage data after step #5)
3) Pass data into Azure ML to build models using Hive query (or pass in directly from Blob Storage). You can use a Python package to pull data directly from the Azure Data Lake Store
4) Azure ML feeds prediction results into the data warehouse (you can also pull in data from SQL Database or SQL Data Warehouse)
5) Non-relational data in Azure Data Lake Store copied to data warehouse in relational format (optionally use PolyBase with external tables to avoid copying data)
6) Power BI pulls data from data warehouse to build dashboards and reports
7) Azure Data Catalog captures metadata from Azure Data Lake Store, SQL DW/DB, and SSAS cubes
8) Power BI can pull data from the Azure Data Lake Store via HDInsight/Spark (beta) or directly. Excel can pull data from the Azure Data Lake Store via Hive ODBC or PowerQuery/HDInsight
9) To support high concurrency if using SQL DW, or for easier end-user data layer, create an SSAS cube
The beauty in the above diagram is you are separating out storage (Azure Data Lake Store) from compute (HDInsight), so you can shut down your HDInsight cluster to save costs without affecting the data. So you can fire up a HDInsight cluster, scale it up, and do processing, and then shut it down when not in use or scale it down.
You can also use Azure Data Lake Analytics and uSQL instead of HDInsight/Hadoop. ADL Analytics (big data queries as a service) give you convenience, efficiency, and automatic scaling, while HDInisght (clustes as a service) gives you customization, control, and flexibility allowing you to leverage the Hadoop ecosystem (Spark, Storm, Hbase, etc).
There are so many possible technologies and tools that can be used. You can even have the data from a data lake feed a NoSQL database, a SSAS cube, a data mart, or go right into Power BI. There is not a cookie-cutter solution – it depends on such things as the type of data, the size, the frequency (real-time streaming), the skill set, the required speed, the allowed company software, the company risk or comfort level (bleeding edge, leading edge), whether there is sensitive data that cannot be in the cloud, etc. For example, even though the data lake is a staging area of the data warehouse, operational reports that run at night can be generated from the data lake in which various reporting tools that support the Azure Data Lake Store would be used. Or if you wanted to stream data, tools like Azure Event Hubs and Azure Stream Analytics would be used. There could even be cases where you want to take relational sources and use products like Data Quality Services (DQS) and Master Data Services (MDS) to clean and master the data, then import it using SSIS into SQL DW/DB, bypassing the data lake altogether.
If you would rather go the route of using Hadoop software, many of the above technologies have Hadoop or open source equivalents: AtScale and Apache Kylin create SSAS-like OLAP cubes on Hadoop, Jethro Data creates indexes on Hadoop data, Apache Atlas for metadata and lineage tools, Apache Drill to query Hadoop files via SQL, Apache Mahout or Spark MLib for machine learning, Apache Flink for distributed stream and batch data processing, Apache HBase for storing non-relational streaming data and supporting fast query response times, SQLite/MySQL/PostgreSQL/MariaDB for storing relational data, Apache Kafka for event queuing, Apache Falcon for data and pipeline management (ETL), Apache Solr and Apache ElasticSearch for full-text search, Kibana for dashboarding on top of ElasticSearch, Apache Zeppelin and Apache Jupyter for interactive data analytics, Apache Ambari for provisioning and managing and monitoring Hadoop clusters, Apache Ranger for data security, and Apache Knox for authentication and authorization.
Just be mindful the pace of change for Hadoop technologies can be too much for a large company. Consider the number of popular processing platforms over the past few years: MapReduce -> YARN -> Storm -> Spark -> Flink!
If you already have a data warehouse solution, incorporating a data lake does not mean all existing sources should land in the data lake. Rather, it can be used for new sources and down the road you can migrate some of the older sources. For now those existing sources can bypass the data lake and go right into the data warehouse.
Another big use for a data lake is for offloading data (i.e. historical data) from other systems, such as your data warehouse or operational systems where the per-terabyte cost of storage and processing is considerable higher.
The goal of this post is to educate you on all the tools so you choose the best one. The bottom line is a data lake should be a key component to your modern data architecture.