At the recent Microsoft Build Developer Conference, Executive Vice President Scott Guthrie announced the Azure Data Lake. It is a new flavor of Azure Storage which can handle streaming data (low latency, high volume, short updates), is geo-distributed, data-locality aware and allows individual files to be sized at petabyte scale.
Azure Data Lake is built to solve for restrictions found in traditional analytics infrastructure and realize the idea of a “data lake” – a single place to store every type of data in its native format with no fixed limits on account size or file size, high throughput to increase analytic performance and native integration with the Hadoop ecosystem.
I have previously blogged about the benefits of a data lake (here). To review, a data lake is an enterprise wide repository of every type of data collected in a single place prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics. Data lakes remove many of the restrictions constraining traditional analytics infrastructure like the pre-definition of schema, the cost of storing large datasets, and the propagation of different silos where data is located. Once captured in the data lake, different computational engines like Hadoop can be used to analyze and mine the raw data to discover new insights. Data Lakes can also act as a lower cost data preparation location prior to moving curated data into a data warehouse. In these cases, customers would load data into the data lake prior to defining any transformation logic.
Capabilities of the Azure Data Lake include:
- HDFS for the cloud: Azure Data Lake is Microsoft’s implementation of a Hadoop File System compatible with HDFS that works with the Hadoop ecosystem including Azure HDInsight, Hortonworks, and Cloudera
- Unlimited storage, petabyte files: Azure Data Lake has unbounded scale with no limits to how much data can be stored in a single account (Azure blobs have a 500TB limit per account). Azure Data Lake can also store very large files in the petabyte-range with immediate read/write access and high throughput (Azure blobs have a 5TB limit for individual files)
- Optimized for massive throughput: Azure Data Lake is built for running large analytic systems that require massive throughput to query and analyze petabytes of data. You need only focus on the application logic and throughput can be tuned to meet the needs of the application
- High frequency, low latency, read immediately: Azure Data Lake is built to handle high volumes of small writes at low latency making it optimized for near real-time scenarios like website analytics, Internet of Things (IoT), analytics from sensors, and others
- Store data in its native format: Azure Data Lake is built as a distributed file store allowing you to store unstructured, semi-structured and structured data without transformation or schema definition. This allows you to store all of your data and analyze them in their native format
- Integration with Azure Active Directory: Azure Data Lake is integrated with Azure Active Directory for identity and access management over all of your data
- Automatically replicates your data – 3 copies within a single data center
- Uses Azure AD for security
- Up and running in a few clicks: no hardware to purchase or install or tune or maintain. Scale out on demand
Azure Data Lake can be addressed with Azure Storage APIs and it’s also compatible with the Hadoop Distributed File System (HDFS). That means the same range of Hadoop clusters can use it as PolyBase can use in reverse.
Answers to common questions:
What are the differences between Azure Data Lake and Azure Storage? In summary, this includes petabyte file sizes, high throughput, and built-in Hadoop integration. Azure Storage is a generic store for many use cases whereas Azure Data Lake is optimized for doing big data analytics.
Will Azure Data Lake integrate with Azure HDInsight? Yes, Azure HDInsight will be able to access Azure Data Lake as a data source similar to how it accesses Azure Blobs today. This is available immediately at public preview allowing HDInsight customers to leverage a hyper scale big data repository in conjunction with Hadoop.
How is Azure Data Lake different from HDFS? Azure Data Lake is an implementation of HDFS in the cloud and leverages the WebHDFS REST interface. WebHDFS Rest APIs have a subset of the APIs available for HDFS.
When is Microsoft Azure Data Lake available? Today, Azure Data Lake is only available as a private preview. Public preview will be available post-Build conference. At public preview, Data Lake will be available at US East 2 data center.
What technologies can use the Azure Data Lake? Any HDFS compliant projects can use azure data lake (Spark, Storm, Flume, Sqoop, Kafka, R, etc). The idea is to put all your data in the Azure Data Lake and then later use any technology on top of it (Azure ML, Azure Data Factory, HDInsights, Azure Stream Analytics, DocumentDB, Hortonworks, Cloudera, Azure Event Hubs, Qubole, etc).
Couldn’t this be done before with Azure blob storage? The data lake is dramatically different that azure blog storage in a few areas. It is built specifically to run massive parallel queries, with a large improvement in throughput performance and scale and the ability to store large files. Also there is much bigger vision for the data lake and it will continue to differentiate more and more over blob storage.
You can sign up to be notified when the Azure Data Lake preview becomes available.