Storage options on Azure

Microsoft Azure is a cloud computing platform and infrastructure, created by Microsoft, for building, deploying and managing applications and services through a global network of Microsoft-managed and Microsoft partner-hosted datacenters.  Included in this platform are multiple ways of storing data.  Below I will give a brief overview of each so you can get a feel for the best use case for each, with links provided that go into more detail:

  • Local Storage: Provides temporary storage for a running application instance.  Local storage represents a directory on the physical file system of the underlying hardware that an application instance is running on, and can be used to store any information that is specific to the local running application instance.  You can create a number of local stores for each instance
  • Windows Azure Storage:
    • Blob: A reliable, cost-effective cloud storage for large amounts of unstructured data, such as documents and media files.  It is a highly scalable, REST-based cloud object store.  The storage service offers three types of blobs: block blobs, page blobs, and append blobs.  Block blobs are best for sequential file I/O, page blobs are best for random-write pattern data, and append blobs are optimized for append operations
    • Data Lake Store: This is “HDFS as a service”.  It is a new flavor of Azure Blob Storage which can handle streaming data (low latency, high volume, short updates), is geo-distributed, data-locality aware and allows individual files to be sized at petabyte scale.  See Azure Data Lake
    • File: Fully managed file shares that use the standard SMB 3.0 protocol.  Migrate file share-based applications to the cloud with no code changes, allowing you to lift and shift legacy apps
    • Disks: Persistent block storage for Azure Iaas VMs.  The disks are actually .vhd files that need to reside in an Azure storage account.  You have the option of using Premium storage disks: SSD based, high IOPS, low latency
    • Table: A NoSQL key-value store for rapid development using massive semi-structured datasets.  Highly scalable to PBs, and has dynamic scaling based on load.  Has fast key/value lookups
    • Queue: A way to reliably store and send messages for scenarios including workflow processing or communication between application components.  Can decouple and scale components.  Helps your applications absorb unexpected traffic bursts and can prevent servers from being overwhelmed by a sudden flood of requests.  Instead of getting dropped, incoming requests are buffered in the queue until servers catch up—so traffic bursts don’t take down your applications
  • SQL Database: Fully managed database-as-a-service that is in the cloud and is based on the Microsoft SQL Server engine.  SQL Database delivers predictable performance, scalability with no downtime, business continuity and data protection—all with near-zero administration
  • DocumentDB: A NoSQL document database service that natively supports JSON and JavaScript directly inside the database engine. It’s a solution for applications that run in the cloud when predictable throughput, low latency, and flexible query are key

More info:

Data Storage Offerings on the Azure Platform

Azure Table Storage and Microsoft Azure SQL Database – Compared and Contrasted

Posted in Azure, SQLServerPedia Syndication | Leave a comment

Redundancy Options in Azure Blob Storage

So you have data in Azure Blob Storage and are concerned about reliability.  Have no fear!  There are three replication options for redundancy:

1. Locally Redundant Storage (LRS): All data in the storage account is made durable by replicating transactions synchronously to three different storage nodes within the same region (there are 24 regions throughout the world, which each region made up of multiple datacenters).

2. Geo Redundant Storage (GRS): This is the default option for redundancy when a storage account is created. Like LRS, transactions are replicated synchronously to three different storage nodes within the same primary region chosen for creating the storage account.  However, transactions are also queued for asynchronous replication to another secondary region that is hundreds of miles away from the primary (geo-replication).  In this secondary region the data is again made durable by replicating it to three more storage nodes there (i.e. total of 6 copies).  So even in the case of a complete regional outage or a regional disaster in which the primary location is not recoverable, your data is still durable.

3. Read Access – Geo Redundant Storage (RA-GRS): For a GRS storage account, there is the ability to turn on read-only access to a storage account’s data in the secondary region.  Since replication to the secondary region is done asynchronously, this provides an eventual consistent version of the data to be read from.  When you enable read-only access to your secondary region, you get a secondary endpoint in addition to the primary endpoint for accessing your storage account.  This secondary endpoint is similar to the primary endpoint except for the suffix “-secondary”. For example: if the primary endpoint is myaccount.<service>, the secondary endpoint is myaccount-secondary.<service>

These options mean the data in your Microsoft Azure storage account is always replicated to ensure durability and high availability, meeting the Azure Storage SLA even in the face of transient hardware failures.

For locally redundant storage, Microsoft stores CRCs of the data to ensure correctness and periodically reads and validates the CRCs to detect bit rot (random errors occurring on the disk media over a period of time).  If CRC fails, the data is recovered via an automated process.  And since each VM disk is a blob in Azure storage, if CRC fails on a disk it is automatically commissioned/decommissioned.

For remote storage (GRS and RA-GRS), in the event of a major disaster that affects the primary storage location, Microsoft will manually first try to restore the primary location.  Restoring of primary is given precedence since failing over to secondary may result in recent delta changes being lost because of the nature of replication being asynchronous, and not all applications may prefer failing over if the availability to the primary can be restored.  Dependent upon the nature of the disaster and its impacts, in some very rare occasions, Microsoft may not be able to restore the primary location, and they would need to perform a geo-failover.

When this happens, affected customers will be notified via their subscription contact information or via the Azure portal.  As part of the failover, the customer’s “” DNS entry would be updated to point from the primary location to the secondary location.  Once this DNS change is propagated, the existing Blob URIs will work.  This means that you do not need to change your application’s URIs – all existing URIs will work the same before and after a geo-failover.

After the failover occurs, the location that is accepting traffic is considered the new primary location for the storage account.  This location will remain as the primary location unless another geo-failover was to occur.  Once the new primary is up and accepting traffic, Microsoft will bootstrap to a new secondary to get the data geo redundant again.

More info:

Windows Azure Storage Redundancy Options and Read Access Geo Redundant Storage

Azure Storage replication

Azure Business Continuity Technical Guidance

Posted in Azure, SQLServerPedia Syndication | Leave a comment

Copying data from Azure Blob Storage

In a previous blog I talked about copying on-prem data to Azure Blob Storage (Getting data into Azure Blob Storage).  Let’s say you have copied the data and it is sitting in Azure Blob Storage (or an Azure Data Lake) and you now want to copy it from Azure Blob Storage into either SQL Server on an Azure Virtual Machine (SQL Server IaaS), SQL DW, or SQL DB.  Below I cover the various ways to do this by listing the technology and the supported destinations:

Here is an important point to keep in mind when reviewing your options if you are building a Big Data platform: all these options are copying data, but you can use PolyBase to query the data as it sits in Azure Blob Storage and avoid the ETL time and storage of copying the data.

Note all the above technologies that work against Azure Blob Storage will also work against Azure Data Lake except for PolyBase, which is not supported yet.

More info:

Load data into SQL Data Warehouse

Posted in Azure, SQLServerPedia Syndication | Leave a comment

Azure Data Lake enhancements

I first blogged about Microsoft’s new product, the Azure Data Lake, a few months back (here).  There are already enhancements, as announced at Stata + Hadoop World.  Here they are in brief:

  • The Azure Data Lake has been renamed to the Azure Data Lake Store.  The Data Lake Store provides a single repository where you can easily capture data of any size, type, and speed and without forcing changes to your application as the data scales.  In the store your data is accessible from any HDFS application and tool via WebHDFS.  It will be available in preview later this year
  • A new service called Azure Data Lake Analytics, which is a distributed analytics services build on Apache YARN that allows developers to be productive immediately on big data.  This is accomplished by submitting a job to the service where the service will automatically run it in parallel in the cloud and scale to process data of any size (scaling is achieved by simply moving a slider).  Then the job completes, it winds down resources automatically, and you only pay for the processing power used.  This makes it easy to get started quickly and be productive with the SQL, .NET, or Hive skills you already have, whether you’re a DBA, data engineer, data architect, or data scientist.  Because the analytics service works over both structured and unstructured data, you can quickly analyze all of your data – social sentiment, web clickstreams, server logs, devices, sensors, and more.  There’s no infrastructure setup, configuration, or management.  It will be available in preview later this year
  • A new language called U-SQL, which is a big data language that seamlessly unifies the ease of use of SQL with the expressive power of C#.  U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the Azure Data Lake Store and across Azure Blog Storage, SQL Servers in Azure, Azure SQL Database and Azure SQL Data Warehouse.  U-SQL is built on the learnings from Microsoft’s internal experience with SCOPE and existing languages such as T-SQL, ANSI SQL, and Hive.  See Introducing U-SQL – A Language that makes Big Data Processing Easy
  • Azure HDInsight is now included as part of the Azure Data Lake.  This integrates the fully managed Apache Hadoop cluster service into the Azure Data Lake Store.  It was also announced the general availability of HDInsight on Linux with an industry-leading 99.9% uptime SLA is available now
  • Azure Data Lake Tools for Visual Studio, provide an integrated development environment that spans the Azure Data Lake, dramatically simplifying authoring, debugging and optimization for processing and analytics at any scale.  With this tool you can write U-SQL, and you can also write U-SQL in the Azure management portal.  See Azure Data Lake: Making Big Data Easy

Note that Hortonworks, Cloudera, and MapR will integrate with the Azure Data Lake store, as well as other partners.


The services are now in private preview and interested developers can sign up here.

More info:

Microsoft expands Azure Data Lake to unleash big data productivity

Announcing General Availability of HDInsight on Linux + new Data Lake Services and Language

Posted in Azure, SQLServerPedia Syndication | 1 Comment

Getting data into Azure Blob Storage

If you have on-prem data and want to copy it to Azure Blob Storage in the cloud, what are all the possible ways to do it?  There are many, and here is a quick review of them:

AzCopy: A popular command-line utility designed for high-performance uploading, downloading, and copying data to and from Microsoft Azure Blob Storage.  See Getting Started with the AzCopy Command-Line Utility

Azure Import/Export Service: Used to transfer large amounts of file data to Azure Blob storage in situations where uploading over the network is prohibitively expensive or not feasible by sending one or more hard drives containing that data to an Azure data center.  See Use the Microsoft Azure Import/Export Service to Transfer Data to Blob Storage

SSIS: The Microsoft SQL Server 2014 Integration Services (SSIS) Feature Pack for Azure provides SSIS the capability of connecting to Azure Blob Storage.  It enables you to create SSIS packages that transfer data between an Azure Blob Storage and on-premises data sources.  See Microsoft SQL Server 2014 Integration Services Feature Pack for Azure

Azure Data Factory (ADF): With the latest ADF service update and Data Management Gateway release, you can copy from on-premises file system and SQL Server to Azure Blob.  See Azure Data Factory Update – New Data Stores and Move data to and from Azure Blob using Azure Data Factory and Move data to and from SQL Server on-premises or on IaaS (Azure VM) using Azure Data Factory

FTP: Deployed in an Azure worker role, this code creates an FTP server that can accept connections from all popular FTP clients (like FileZilla, for example) for command and control of your blob storage account.  See FTP to Azure Blob Storage Bridge

Other Command-line Utilities: See Azure Command-Line Interface (CLI), CloudCopy Command Line Tool

Graphical Clients: Windows Azure Storage explorers that can be used to enumerate and/or transfer data to and from blobs.  See Azure Storage Explorer, Blob Transfer Utility for Windows Azure Blob Storage, CloudBerry Explorer.  For more, see Windows Azure Storage Explorers

PowerShell/Cmdlets, .NET SDK, .NET Azure Storage Client, JavaScript CLI: Work with Azure storage programmatically.  See Uploading data with Windows PowerShell and Uploading data with the Microsoft .NET Framework and Uploading data with the Azure Storage SDK and Using the Azure CLI with Azure Storage

Azure Portal: Allows downloading, deleting and editing certain properties of Blob files.  For the old portal, this is done by going to the Azure Portal, choosing Storage, clicking on the Storage Name, and clicking Containers.  Then click on a container and you will then see a list of the Blob files in that container and at the bottom is options to Download, Edit, and Delete.  For the new portal, this is done by going to the Azure Portal, choosing Storage accounts, clicking on the storage account, and clicking on the Blobs service.  Then click on a container and you will then see a list of the Blob files in that container.  Click on one of the files and you will see options to Download and Delete.

PolyBase: A technology that allows SQL to be used across relational data stores and non-relational Hadoop data.  See APS Polybase for Hadoop and Windows Azure Blob Storage (WASB) Integration.  This is to move data to/from APS or SQL DW or SQL Server 2016 and Azure Blob Storage

Note that for on-prem to cloud transfers, there is a service call Azure ExpressRoute to give you a faster pipe.

More info:

Upload data for Hadoop jobs in HDInsight

Posted in Azure, SQLServerPedia Syndication | 2 Comments

Relational databases vs Non-relational databases

I see a lot of confusion about the place and purpose of the many new database solutions (“NoSQL databases”) compared to the relational databases solutions that have been around for many years.  So let me try to explain the differences and best use cases for each.

First lets clarify these database solutions into two groups:

1) Relational databases, which can also be called relational database management systems (RDBMS) or SQL databases.  The most popular of these are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2.  These RDBMS’s are mostly used in large enterprise scenarios, with the exception of MySQL, which is mostly used to store data for web applications, typically as part of the popular LAMP stack (Linux, Apache, MySQL, PHP/ Python/ Perl).

2) Non-relational databases, also called NoSQL databases, the most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j.  These databases are usually grouped into four categories: Key-value stores, Graph stores, Column stores, and Document stores (see Types of NoSQL databases).

All relational databases can be used to manage transaction-oriented applications (OLTP), and most non-relational databases that are in the categories Document stores and Column stores can also be used for OLTP, adding to the confusion.  OLTP databases can be thought of as “Operational” databases, characterized by frequent, short transactions that include updates and that touch a small amount of data and where concurrency of thousands of transactions is very important (examples including banking applications and online reservations).  Integrity of data is very important so they support ACID transactions (Atomicity, Consistency, Isolation, Durability).  This is opposed to data warehouses, which are considered “Analytical” databases characterized by long, complex queries that touch a large amount of data and require a lot of resources.  Updates are infrequent.  An example is analysis of sales over the past year.

Relational databases usually work with structured data, while non-relational databases usually work with semi-structured data (i.e. XML, JSON).

Let’s look at each group in more detail:

Relational Databases

A relational database is organized based on the relational model of data, as proposed by E.F. Codd in 1970.  This model organizes data into one or more tables (or “relations”) of rows and columns, with a unique key for each row.  Generally, each entity type that is described in a database has its own table with the rows representing instances of that type of entity and the columns representing values attributed to that instance.  Since each row in a table has its own unique key, rows in a table can be linked to rows in other tables by storing the unique key of the row to which it should be linked (where such unique key is known as a “foreign key”).  Codd showed that data relationships of arbitrary complexity can be represented using this simple set of concepts.

Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.

The reasons for the dominance of relational databases are: simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data.

But to offer all of this, relational databases have to be incredibly complex internally.  For example, a relatively simple SELECT statement could have dozens of potential query execution paths, which a query optimizer would evaluate at run time.  All of this is hidden to users, but under the hood, the RDBMS determines the best “execution plan” to answer requests by using things like cost-based algorithms.

For large databases, especially ones used for web applications, the main concern is scalability.  As more and more applications are created in environments that have massive workloads (i.e. Amazon), their scalability requirements can change very quickly and grow very large.  Relational databases scale well, but usually only when that scaling happens on a single server (“scale-up”).  When the capacity of that single server is reached, you need to “scale-out” and distribute that load across multiple servers, moving into so-called distributed computing.  This is when the complexity of relational databases starts to cause problems with their potential to scale.  If you try to scale to hundreds or thousands of servers the complexities become overwhelming.  The characteristics that make relational databases so appealing are the very same that also drastically reduce their viability as platforms for large distributed systems.

Non-relational databases

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Motivations for this approach include:

  1. Simplicity of design.  Not having to deal with the “impedance mismatch” between the object-oriented approach to write applications and the schema-based tables and rows of a relational database.  For example, storing all the customer order info in one document as opposed to having to join many tables together, resulting in less code to write, debug, and maintain
  2. Better “horizontal” scaling to clusters of machines, which solves the problem when the number of concurrent users skyrockets for applications that are accessible via the web and mobile devices.  Using documents makes it much easier to scale-out as all the info for that customer order is contained in one place as opposed to being spread out on multiple tables.  NoSQL databases automatically spread data across servers without requiring application changes (auto-sharding), meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool.  Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption
  3. Finer control over availability.  Servers can be added or removed without application downtime.  Most NoSQL databases support data replication, storing multiple copies of data across the cluster or even across data centers, to ensure high availability and disaster recovery
  4. To easily capture all kinds of data “Big Data” which include unstructured and semi-structured data.  Allowing for a flexible database that can easily and quickly accommodate any new type of data and is not disrupted by content structure changes.  This is because document database are schemaless, allowing you to freely add fields to JSON documents without having to first define changes (schema-on-read instead of schema-on-write).  You can have documents with a different number of fields than other documents.  For example, a patient record that may or may not contain fields that list allergies
  5. Speed.  The data structures used by NoSQL databases (i.e. JSON documents) differ from those used by default in relational databases, making many operations faster in NoSQL than relational databases due to not having to join tables (at the cost of increased storage space due to duplication of data – but storage space is so cheap nowadays this is usually not an issue).  In fact, most NoSQL databases do not even support joins
  6. Cost.  NoSQL databases usually use clusters of cheap commodity servers, while RDBMS tend to rely on expensive proprietary servers and storage systems.  Also, the licenses for RDBMS systems can be quite expensive while many NoSQL databases are open source and therefore free

The particular suitability of a given NoSQL database depends on the problem it must solve.

NoSQL databases are increasingly used in big data and real-time web applications.  They became popular with the introduction of the web, when databases went from a max of a few hundred users on an internal company application to thousands or millions of users on a web application.  NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance.  Some reasons that block adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL.  Also, most NoSQL stores lack true ACID transactions or only support transactions in certain circumstances and at certain levels (e.g., document level).  Finally, RDBMS’s are usually much simpler to use as they have GUI’s where many NoSQL solution use a command-line interface.

Comparing the two

One of the most severe limitations of relational databases is that each item can only contain one attribute.  If we use a bank example, each aspect of a customer’s relationship with a bank is stored as separate row items in separate tables.  So the customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on.  All these tables are linked to each other through the use of relations such as primary keys and foreign keys.

Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model.  Key-value pairs allow you to store several related items in one “row” of data in the same table.  We place the word “row” in quotes because a row here is not really the same thing as the row of a relational table.  For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details.  All data relating to one customer would be conveniently stored together as one record.

This seems an obviously superior method of storing data, but it has a major drawback: key-value stores, unlike relational databases, cannot enforce relationships between data items.  For instance, in our key-value database, the customer details (name, social security, address, account number, loan processing number, etc.) would all be stored as one data record (instead of being stored in several tables, as in the relational model).  The customer’s transactions (account withdrawals, account deposits, loan repayments, bank charges, etc.) would also be stored as another single data record.

In the relational model, there is an built-in and foolproof method of ensuring and enforcing business logic and rules at the database layer, for instance that a withdrawal is charged to the correct bank account, through primary keys and foreign keys.  In key-value stores, this responsibility falls squarely on the application logic and many people are very uncomfortable leaving this crucial responsibility just to the application.  This is one reason why relational databases will continued to be used.

However, when it comes to web-based applications that use databases, the aspect of rigorously enforcing business logic is often not a top priorities.  The highest priority is the ability to service large numbers of user requests, which are typically read-only queries.  For example, on a site like eBay, the majority of users simply browse and look through posted items (read-only operations).  Only a fraction of these users actually place bids or reserve the items (read-write operations).  And remember, we are talking about millions, sometimes billions, of page views per day.  The eBay site administrators are more interested in quick response time to ensure faster page loading for the site’s users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes.

Relational-model databases can be tweaked and set up to run large-scale read-only operations through data warehousing, and thus potentially serve a large amount of users who are querying a large amount of data, especially when using relational MPP architectures like Analytics Platform System, Teradata, Oracle Exadata, or IBM Netezza, which all support scaling.  As mentioned before, data warehouses are distinct from typical databases in that they are used for more complex analysis of data.  This differs from the transactional (OLTP) database, whose main use is to support operational systems and offer day-to-day, small scale reporting.

However, the real challenge is the relational model’s lack of scalability when dealing with OLTP applications, or any solution with a lot of individual writes, which is the domain of relational SMP architectures.  This is where non-relational models can really shine.  They can easily distribute their data loads across dozens, hundreds and in extreme cases (think Google search) even thousands of servers.  With each server handling only a small percentage of the total requests from users, response time is very good for each individual user.  Although this distributed computing model can be built for relational databases, it is a real pain to implement, especially when there are a lot of writes (i.e OLTP), requiring techniques like sharding which usually requires significant coding outside of the application’s business logic.  This is because the relational model insists on data integrity at all levels, which must be maintained, even as the data is accessed and modified by several different servers.  This is the reason for the non-relational model as the architecture of choice for web applications such as cloud-computing and social networking.

So in summary, RDBMS’s suffer from no horizontal scaling for high transaction loads (millions of read-writes), while NoSQL databases solve high transaction loads but at the cost of data integrity and joins.

Keep in mind many solutions will use a combination of relational and non-relational databases (see What is Polyglot Persistence?).

Also keep in mind that you may not need the performance of a non-relational database and instead just going with storing files in HDFS and using Apache Hive will be enough (Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis that it provides via an SQL-like language called HiveQL).

And to end on a note that adds to the confusion, we have a another category forming called NewSQL: NewSQL is a class of modern RDBMS’s that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional relational database system.  The disadvantages is they are not for OLAP-style queries, and they are inappropriate for databases over a few terabytes.  Examples include VoltDB, NuoDB, MemSQL, SAP HANA, Splice Machine, Clustrix, and Altibase.

A picture showing the categories that many of the products fit into:

Untitled picture

An excellent graphic that shows how all the technologies fit in the Azure cloud is from Understanding NoSQL on Microsoft Azure:


The bottom line for using a NoSQL solution is if you have an OLTP application that has thousands of users and has a very large database requiring a scale-out solution and/or is using JSON data, in particular if this JSON data has various structures.  You also get the benefit of high availability as NoSQL solutions store multiple copies of the data.  Just keep in mind you for performance you may sacrifice data consistency, as well as the ability to join data, use SQL, and to do quick mass updates.

More info:

MySQL vs MongoDB

MySQL vs. MongoDB: Looking At Relational and Non-Relational Databases

10 things you should know about NoSQL databases

Introduction to Databases

Difference between SQL and NoSQL : Comparision

SQL vs NoSQL Database Differences Explained with few Example DB

NoSQL, NewSQL, or RDBMS: How To Choose

NewSQL – RDBMS on Steroids

NoSQL vs NewSQL Databases Choose the Right Tool for the Right Job

SQL vs NoSQL: you do want to have a relational storage by default

Understanding NoSQL on Microsoft Azure

Posted in SQLServerPedia Syndication | 5 Comments

Hadoop and Microsoft

In my Introduction to Hadoop I talked about the basics of Hadoop.  In this post, I wanted to cover some of the more common Hadoop technologies and tools and show how they work together, in addition to showing how they work well with Microsoft technologies and tools.  So you don’t have to choose between going with Open-source software (OSS) and going with Microsoft.  Instead, they can be combined and work together very nicely.

In short, Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.  The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common – Contains libraries and utilities needed by other Hadoop modules
  • Hadoop Distributed File System (HDFS) – A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
  • Hadoop MapReduce – A programming model for large scale data processing.  It is designed for batch processing.  Although the Hadoop framework is implemented in Java, MapReduce applications can be written in other programming languages (R, Python, C# etc).  But Java is the most popular
  • Hadoop YARN – YARN is a resource manager introduced in Hadoop 2 that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1 (see Hadoop 1.0 vs Hadoop 2.0).  YARN is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop

The term “Hadoop” has come to refer not just to the base modules above, but also to the “ecosystem”, or collection of additional software packages that can be installed on top of or alongside Hadoop.  There are hundreds of such software packages (see the Apache Projects Directory) that are frequently updated, making it difficult to build a Hadoop solution that uses multiple software packages.  Making things much easier are companies that provide “Hadoop in a box”.  These are ready-to-use platforms that contain numerous Hadoop software packages pre-installed and where all the packages are tested to ensure they all work together.  These include Microsoft’s HDInsight, Hortonworks Data Platform (HDP), Cloudera’s CDH, and MapR’s Distribution.

Microsoft’s HDInsight is simply a managed Hadoop service on the cloud (via Microsoft Azure) built on Apache Hadoop that is available for Windows and Linux.  Microsoft Azure is a cloud computing platform and infrastructure, created by Microsoft, for building, deploying and managing applications and services through a global network of Microsoft-managed and Microsoft partner hosted datacenters.  Behind the covers HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.  Although there are many software packages installed by default on a HDInsight distribution, you can add any additional packages if you wish via a Script Action (see Customize HDInsight clusters using Script Action).  With HDInsight you can easily build a Hadoop cluster in the cloud in minutes.  The cluster will consist of virtual machines with the data stored separately in Azure Storage Blobs instead of HDFS (see Why use Blob Storage with HDInsight on Azure).  The great thing about this is you can shut down the cluster and the data will remain (or you can store the data in the native HDFS file system that is local to the compute nodes but would lose the data if you deleted your cluster).  This can result in substantial savings as you can do things like shut down the cluster on weekends if it is not needed.  Since HDInsight is 100 percent Apache-based and not a special Microsoft version, this means as Hadoop evolves, Microsoft will embrace the newer versions.  Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a great deal of time and effort into improving various Hadoop tools.

Shown below is the software packages and their versions included in the Hortonworks Data Platform:


HDInsight is currently using HDP 2.2 and will be upgraded in the next couple of months to HDP 2.3 (see What’s new in the Hadoop cluster versions provided by HDInsight?).  Microsoft also has a new partner program for Hadoop independent software vendors (ISVs) to expand their offerings on to HDInsight (see Azure HDInsight launches ISV Partner Program at WPC).

Pictured below is the HDInsight platform that shows how the most popular Hadoop software packages fit together:

  • Hive – A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.  The query language is similar to standard SQL statements and is called Hive Query Language (HQL).  It converts queries to MapReduce, Apache Tez and Spark jobs (all three execution engines can run in Hadoop YARN).  This makes querying data much easier as you are writing SQL instead of Java code
  • Storm – A distributed real-time computation system for processing fast, large streams of data, adding real-time data processing to Apache Hadoop
  • HBase – A scalable, distributed NoSQL database that supports structured data storage for large tables
  • Tez – Provides API’s to write YARN apps.  It is a replacement for MapReduce since it is much faster.  It will process data for both batch and interactive use-cases
  • Spark – Spark is intended to be a drop in replacement for Hadoop MapReduce providing the main benefit of improved performance.  It does not require YARN.  It is up to a 100x faster than MapReduce and can also perform real-time analysis of data and supports interactive processing.  It can run on HDFS as well as the Cassandra File System (CFS)

Untitled picture

Data in a Hadoop solution can be processed either in a batch or in real time.

For batch processing, your choices are MapReduce or Spark.  MapReduce was the first and only way for a long time, but now Spark has become the best choice.  Spark has a number of benefits over MapReduce such as performance, a unified programming model (can be used for both batch and real-time data processing), richer and simper API, and multiple datastore support.

For real-time processing, the most popular tools to use are Kafka and Storm for event queuing, and Cassandra or HBase for storage.  These are all OSS solutions but there are many Microsoft equivalent products should you choose to go that route (with most of the Microsoft products easier to use than the OSS products).  The equivalent on the Microsoft stack is Azure Event Hub for Kafka, Azure Stream Analytics for Storm, and Azure blob storage/Azure Data Lake for HBase.


Note that Spark is quickly becoming the tool of choice over Storm.  The difference between Hadoop and Storm in the way they work is that Storm is processing an event at a time versus Hadoop or Spark, which is processing a batch at a time.  The batch can be large, the batch can be small, or the batch can be really small, which is called a micro-batch.  The paradigm in Spark that processes a micro-batch is called Spark streaming.

Storm & Trident (Trident is a high-level abstraction for doing real-time computing on top of Storm) were previously the most popular solution for real-time streaming, but now the most popular is Spark Streaming (an abstraction on Spark to perform stateful stream processing).  Spark is also supported on HDInsight (see Announcing Spark for Azure HDInsight public preview).


I hope this has provided a clearer picture to you on which Hadoop technologies are the most popular for building solutions today and how Microsoft supports those technologies via HDInsight.  You can learn more about HDInsight and try it for free by clicking here.

Posted in Hadoop, SQLServerPedia Syndication | 1 Comment

Analytics Platform System (APS) AU4 released

The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), has just released an appliance update (AU4), which is sort of like a service pack, except that it includes many new features.  Below is what is new in this release:

AU4 continues to improve how SQL Server customers can migrate to APS and existing customers can upgrade by enhancing T-SQL Compatibility, adding support for next generation hardware from our IHV partners (Dell, Quanta, and HP), improved data return rates to power advanced analytics scenarios, and workload level upgrades reducing upgrade durations by 50%.  These features combine to offer additional functionality, reliability, and compatibility for appliance customers while reducing the downtime when moving to AU4.

What’s New in the Release:

TSQL compatibility improvements to reduce migration friction from SQL SMP

  • Implements scalar user defined functions – CREATE/DROP FUNCTION statements
  • Implements PIVOT/UNPIVOT relational operators
  • Introduce Round Robin as a third table type for hybrid performance
  • Query plan stability to favor hash joins over nested loop joins in DW scenarios
  • Addition of a new DBCC command – DBCC DROPCLEANBUFFERS
  • Merged PDW Migration Advisor and Migration Utility into a single utility for DDL validation and creation with rules and best practices integration

Performance Improvement

  • 1.5x data return rate improvement for SELECT * queries improves external analytics (SAS, R) integration

Polybase/Hadoop Enhancements

  • Enabled Cloudera 5.1 and Hortonworks Data Platform 2.1 and 2.2 support
  • Extended partial aggregate pushdowns into Hadoop
  • Add Unit for HDInsight regions within the appliance

Install, Upgrade, and Servicing Improvements

  • Reduces end-to-end upgrade time from AU3 to AU4 by 50%
  • Top hitter upgrade reliability fixes
  • Platform update to SQL Server CU8 for latest stability and reliability improvements
  • Integrated Windows Server patches for minimize patching workflows
  • Upgrade support for appliances with HDInsight regions
  • Skip level upgrade support from previous versions to enable direct upgrades to AU4

Next Generation Hardware

  • Dell, Quanta, and HP (Gen9) next generation server support
    • 2x Intel Haswell Processors, 256 GB (16x16Gb) 2133MHz memory
  • Updated HP 5900 series switches including High Availability improvements
  • Mixed hardware version support allows extended appliance lifetime and expandability

More info:

APS AU4 Feature Review: 1.5x data return rate improvement

Analytics Platform System Appliance Update 4 Documentation and Client Tools

Posted in PDW/APS | Leave a comment

Cortana Analytics Suite

Yesterday at the Microsoft World Wide Partner Conference in Orlando Microsoft announced the Cortana Analytics Suite, which is a new package of data storage, information management, machine learning, and business intelligence software in a single convenient monthly subscription.  Microsoft’s Cortana personal digital assistant, until now available to consumers on mobile devices and PCs, is part of the new package, designed to give non-technical users a simple interface that can let them ask questions of data and receive answers in response.

In short, it is a fully managed big data and advanced analytics suite that transforms your data into intelligent action.


Here’s a video demonstrating some capabilities of the new Cortana Analytics Suite.

It will be available later this fall as an integrated offering with a simple monthly subscription model that takes advantage of all the relevant analytics and big data capabilities to make better decisions as well as intelligently automate operations.  It includes a comprehensive set of cloud services:

Capabilities Service
Preconfigured solutions Business scenarios Recommendations, Forecasting, Churn, etc.
Personal Digital Assistant Personal Digital Assistant Cortana
Perceptual intelligence Recognition of human interactions and intent Face, Vision, Speech and Text Analytics
Dashboards and visualizations Dashboards and visualizations Power BI
Machine learning & analytics Machine learning
Complex event processing
Azure Machine Learning
Azure HDInsight (Hadoop)
Azure Stream Analytics
Big Data Stores Big Data repository
Elastic data warehouse
Azure Data Lake
Azure SQL Data Warehouse
Information management Data orchestration
Data catalog
Event ingestion
Azure Data Factory
Azure Data Catalog
Event Hubs

Note that all of these services are available now so you can start using them immediately.  There will be an on-premises version of Cortana Analytics sometime in the future via the Microsoft Azure Stack (which provides enterprises with the ability to create private clouds with the same bits Microsoft uses to build Azure itself).  Cortana Analytics includes capabilities to both discover and connect to a variety of data sources including SQL Server and Azure SQL Database.

For more information on Cortana Analytics Suite visit

More info:

Announcing Cortana Analytics Suite and New Partner Investments at WPC 2015

Posted in SQLServerPedia Syndication | 1 Comment

Microsoft Azure Data Catalog

Just announced is the Microsoft Azure Data Catalog, which is an enterprise metadata catalog / portal for the self-service discovery of data sources.  It becomes available on Monday next week, July 13, 2015.  Check out this short video on it.  My response to this is – woo hoo!  I have been waiting years for Microsoft to come up with a tool to catalog metadata and I’m excited this day has finally arrived.

From the Microsoft blog post announcement:

Businesses of every size face the challenge of sifting through their myriad data sources and discovering the right ones for a given problem. Although businesses collect and store tons of data as part of their everyday activities, too often they fail to reap the full benefit of all the data that’s being gathered. Employees too often end up spending more time searching for data than they actually do working with the data itself.

To address these problems, Azure Data Catalog uses a crowdsourced approach. Any user, for instance an analyst, data scientist or data developer, can register, enrich, discover, understand and consume data sources. Every user is empowered to register the data sources that they use. Registration extracts the structural metadata from the data source and stores it in the cloud-based Catalog, while the data itself remains in the data source.

Crowdsourced annotations let users who are knowledgeable about the data assets registered in the Catalog to enrich the system at any time. This helps others understand the data more readily, including its intended purpose and how it’s being used within the business.

Azure Data Catalog also lets users discover data sources by searching and filtering. Users can then connect to data sources using any tool of their choice, and they can similarly work with the data that they need using the tools with which they are already familiar.

Azure Data Catalog bridges the gap between IT and the business – it encourages the community of data producers, data consumers and data experts to share their business knowledge while still allowing IT to maintain control and oversight over all the data sources in their constantly evolving systems.

Untitled picture

There will be open API’s to allow 3rd parties to integrate directly with the Data Catalog for both registration and discovery of data sources.  There will be both a free and standard edition (the free version has limits on the max number of users and max number of catalog objects).

The Azure Data Catalog is an evolution of the existing Data Catalog that today ships as a feature of Power BI for Office 365.  Soon the two catalogs will merge into a single service.  The exact timings and details for how the Azure Data Catalog might integrate into the Power BI service and Power BI Designer is still to be decided.

More info:

Announcing the Public Preview of Azure Data Catalog

Azure Data Catalog: Improve Your BI Program Effectiveness

Posted in SQLServerPedia Syndication | Leave a comment