There have been a number of enhancements to Hadoop recently when it comes to fast interactive querying with such products as Hive LLAP and Spark SQL which are being used over slower interactive querying options such as Tez/Yarn and batch processing options such as MapReduce (see Azure HDInsight Performance Benchmarking: Interactive Query, Spark and Presto).
This has led to a question I have started to see from customers: Do I still need a data warehouse or can I just put everything in a data lake and report off of that using Hive LLAP or Spark SQL? Which leads to the argument: “Is the data warehouse dead?”
I think what is confusing is the argument should not be over whether the “data warehouse” is dead but clarified if the “traditional data warehouse” is dead, as the reasons that a “data warehouse” is needed are greater than ever (i.e. integrate many sources of data, reduce reporting stress on production systems, data governance including cleaning and mastering and security, historical analysis, user-friendly data structure, minimize silos, single version of the truth, etc – see Why You Need a Data Warehouse). And what is meant by a “traditional” data warehouse is usually referring to a relational data warehouse built using SQL Server (if using Microsoft products) and when a data lake is mentioned it is usually one that is built in Hadoop using Azure Data Lake Store (ADLS) and HDInsight (which has cluster types for Spark SQL and Hive LLAP that is also called Interactive Query).
I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why.
I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update. I also touched on this topic in my blogs Use cases of various products for a big data cloud solution, Data lake details, Why use a data lake? and What is a data lake? and my presentation
First lets talk about cost and dismiss the incorrect assumption that Hadoop is cheaper: Hadoop can be 3x cheaper for data refinement, but to build a data warehouse in Hadoop it can be 3x more expensive due to the cost of writing complex queries and analysis (based on a WinterCorp report and my experiences).
Understand that a “big data” solution does not mean just using Hadoop-related technologies, but could mean a combination of Hadoop and relational technologies and tools. Many clients will build their solution using just Microsoft products, while others use a combination of both Microsoft and open source (see Microsoft Products vs Hadoop/OSS Products). Building a data warehouse solution on the cloud or migrating to the cloud is often the best idea (see To Cloud or Not to Cloud – Should You Migrate Your Data Warehouse?) and you can often migrate to the cloud without retooling technology and skills.
I have seen Hadoop adopters typically falling into two broad categories: those who see it as a platform for big data innovation, and those who dream of it providing the same capabilities as an enterprise data warehouse but at a cheaper cost. Big data innovators are thriving on the Hadoop platform especially when used in combination with relational database technologies, mining and refining data at volumes that were never before possible. However, most of those who expected Hadoop to replace their enterprise data warehouse have been greatly disappointed, and in response have been building complex architectures that typically do not end up meeting their business requirements.
As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this. The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question. Another risk in the first case is slower performance because the data is not laid out efficiently. Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).
Relational data warehouses continue to meet the information needs of users and continue to provide value. Many people use them, depend on them, and don’t want them to be replaced with a data lake. Data lakes offer a rich source of data for data scientists and self-service data consumers (“power users”) and serves analytics and big data needs well. But not all data and information workers want to become power users. The majority (at least 90%) continue to need well-integrated, systematically cleansed, easy to access relational data that includes a large body of time-variant history. These people are best served with a data warehouse.
I can’t stress enough if you need high data quality reports you need to apply the exact same transformations to the same data to produce that report no matter what your technical implementation is. If you call it a data lake or a data warehouse, or use an ETL tool or Python code, the development and maintenance effort is still there. You need to avoid falling into the old mistake that the data lake does not need data governance. It’s not a place with unicorns and fairies that will magically make all the data come out properly – a data lake is just a glorified file folder.
Here are some of the reasons why it is not a good idea to have a data lake in Hadoop as your data warehouse and forgo a relational data warehouse:
- Hadoop does not provide for very fast query reads in all use cases. While Hadoop has come a long way in this area, Hive LLAP and Spark SQL have limits on what type of queries they support (i.e. not having full support for ANSI SQL such as certain aggregate functions which limits the range of users, tools, and applications that can access Hadoop data) and it still isn’t quite at the performance level that a relational database can provide
- Hadoop lacks a sophisticated query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing strategies and therefore performs poorly for complex queries
- Hadoop does not have the ability to place “hot” and “cold” data on a variety of storage devices with different levels of performance to reduce cost
- Hadoop is not relational, as all the data is in files in HDFS, so there is always a conversion process to convert the data to a relational format if a reporting tool requires it in a relational format
- Hadoop is not a database management system. It does not have functionality such as update/delete of data, referential integrity, statistics, ACID compliance, data security, and the plethora of tools and facilities needed to govern corporate data assets
- There is no metadata stored in HDFS, so another tool such as a Hive Metastore needs to be used to store that, adding complexity and slowing performance. And most metastores only work with a limited number of tools, requiring multiple metastores
- Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL
- Hadoop is super complex, with lot’s of integration with multiple technologies to make it work
- Hadoop has many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard. See all the various Apache Hadoop technologies here
- Some reporting tools don’t work against Hadoop
- May require end-users to learn new reporting tools and Hadoop technologies to query the data
- The newer Hadoop solutions (Tez, Spark, Hive LLAP etc) are still figuring themselves out. Customers might not want to take the risk of investing in one of these solutions that may become obsolete (like MapReduce)
- It might not save you much in costs: you still have to purchase hardware or pay for cloud consumption, support, licenses, training, and migration costs. As relational databases scale up, support non-standard data types like JSON, and run functions written in Python, Perl, and Scala, it makes it even more difficult to replace them with a data lake as the migration costs alone would be substantial
- If you need to combine relational data with Hadoop, you will need to move that relational data to Hadoop or invest in a technology such as PolyBase to query Hadoop data using SQL
- Is your current IT experience and comfort level mostly around non-Hadoop technologies, like SQL Server? Many companies have dozens or hundreds of employees that know SQL Server and not Hadoop so therefore would require a ton of training as Hadoop can be overwhelming
As far as performance, it is greatly affected by the use of indexing – Hive with LLAP (or not) doesn’t have indexing, so when you run a query, it reads all of the data (minus partition elimination). Spark SQL, on the other hand, isn’t really an interactive environment – it’s fast-batch – so again, not going to see the performance users will expect from a relational database. Also, a relational database still beats most competitors when performing complex, multi-way joins. Given that most analytic queries are just that, a traditional data warehouse still might be the right choice.
From a security standpoint, you would need to integrate Hive LLAP or Spark with Apache Ranger to support granular security definition at the column level, including data masking where appropriate.
Concurrency is another thing to think about – Hadoop clusters have to get VERY large to support hundreds or thousands of concurrent connections – remember, these systems aren’t designed for interactive usage – they are optimized for batch and we are trying to shoehorn interactivity on top of that.
A traditional relational data warehouse should be viewed as just one more data source available to a user on some very large federated data fabric. It is just pre-compiled to run certain queries very fast. And a data lake is another data source for the right type of people. A data lake should not be blocked from all users so you don’t have to tell everyone “please wait three weeks while I mistranslate your query request into a new measure and three new dimensions in the data warehouse”.
Most data lake vendors assume data scientists or skilled data analysts are the principal users of the data. So, they can feed these skilled data users the raw data. But most business users get lost in that morass. So, someone has to model the data so it makes sense to business users. In the past, IT did this, but now data scientists and data analysts can do it using powerful, self-service tools. But the real question is: does a data scientist or analyst think locally or globally? Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases? So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.
I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo. But are you as big as them? Do you have their resources? Do you generate data like them? Do you want a solution that only 1% of the workforce has the skillset for? Is your IT department radical or is it conservative?
I think a relational data warehouse still has an important place: performance, ease of access, security, integration with reporting components, and concurrency all lean towards using it, especially when performing complex, multi-way joins that make up analytic queries which is the sweet spot for a traditional data warehouse.
The bottom line is a majority of end users need the data in a relational data warehouse to easily do self-service reporting off of it. A Hadoop data lake should not be a replacement for a data warehouse, but rather should augment/complement a data warehouse.