What is Microsoft Azure Stream Analytics?

Microsoft Azure Stream Analytics (ASA) is a fully managed cloud service for real-time processing of streaming data.  ASA makes it easy to set up real-time analytic computations on data flowing in from devices, sensors, web sites, applications and infrastructure systems.  It supports a powerful high-level SQL-like language that dramatically simplifies the logic to visualize, alert, or act in near real-time.  ASA makes it simpler to build a wide range of Internet-of-Things (IoT) applications such as real-time remote device management, and to monitor and gain analytic insights from connected devices of all types including mobile phones and connected cars.

It was made generally available on April 16, 2015 (read).


ASA supports two different types of inputs, either stream data or reference data, and two different input data sources, either Azure Event Hubs or files from Azure Blob Storage.

The ingestor, or data source, for stream analytics is usually an Azure Event Hub.  Event Hubs is a highly scalable publish-subscribe data integrator capable of consuming large volumes of events per second, enabling Azure to process vast amounts of data from connected applications or devices.  It provides a unified collection point for a broad array of platforms and devices, and as such, abstracts the complexity of ingesting multiple different input streams directly into the streaming analytics engine.  ASA has an Event Hubs adapter built into the offering.

ASA supports five different types of outputs: Blob storage, Event Hub, Power BI, SQL Database or Table Storage.  ASA also enhances SQL by supporting groupings by time (see Windowing).  ASA provides a native connector for SQL Database for consuming events that are output from the stream.

One of the common presentation use cases for ASA is to analyze high volume streaming data in real-time and get the insight in a live dashboard (a dashboard that updates in real-time without user having to refresh the browser).  You can build a live dashboard using Power BI as an output for your ASA job (see Azure Stream Analytics & Power BI: Live dashboard for real-time analytics of streaming data).

It is real easy to build an ASA solution as it is all done thru the Azure web portal.  There is no need to create a VM and remote into it.  It is also possible to easily integrate with Azure ML.

There are also a couple of tutorials you can check out if you want to build a end-to-end solution.

Microsoft has two other stream processing platforms: StreamInsight and Azure HDInsight Storm.

More info:

Stream Analytics documentation

Reference Architecture: Real-time event processing with Microsoft Azure Stream Analytics

Video An Introduction to Azure Stream Analytics

Video Gaining Real-Time IoT Insights using Azure Stream Analytics, AzureML and PowerBI

Azure Stream Analytics Team Blog

Video Azure Stream Analytics Demo

Building an IoT solution with Azure Event Hubs and Stream Analytics – Part 3

How to Process Google Data in Real Time with Azure Stream Analytics

Microsoft Azure Stream Analytics

Posted in Azure, SQLServerPedia Syndication | Leave a comment

How an MPP appliance solution can improve your future

Massive parallel processing (MPP) is the future for data warehousing.

So what is MPP?  SQL Server is a Symmetric Multiprocessing (SMP) solution, which essentially means it uses one server.  MPP provides scalability and query performance by running independent servers in parallel.  That is the quick definition.  For more details, read What MPP means to SQL Server Parallel Data Warehouse.

Microsoft has an MPP appliance called the Analytics Platform System (APS).  If you are building a new data warehouse that will be of any decent size in the next few years (i.e. 1TB or greater), it is really a “no brainer” to purchase a MPP solution over a SMP solution,

Looking at the value over time for a MPP appliance vs a SMP solution:

Comparison of Cumulative Cash Flows of Customer Experience Project using a Big Data Solution (MPP) vs. a Traditional Data Warehouse Appliance
Source: Wikibon 2011

The financial metrics of the two approaches were overwhelmingly in favor of the Big Data Solution (MPP) approach:

  • Big Data Approach:
    • Cumulative 3-year Cash Flow – $152M,
    • Net Present Value – $138M,
    • Internal Rate of Return (IRR) – 524%,
    • Breakeven – 4 months.
  • Traditional DW Appliance Approach:
    • Cumulative 3-year Cash Flow – $53M,
    • Net Present Value – $46M,
    • Internal Rate of Return (IRR) – 74%,
    • Breakeven – 26 months.

The bottom line is that for big data projects, the traditional data warehouse approach is more expensive in IT resources, takes much longer to do, and provides a less attractive return-on-investment.

Getting into the specific reasons to choose MPP (i.e. APS) over SMP when building a data warehouse:

  • Be proactive instead of reactive (solve future performance problems now and not when they rear their ugly head)
  • The hardware cost to upgrade to powerful servers that are clustered (for high availability) can be more than a quarter rack of APS (which has high availability built-in)
  • You can use the same SQL Server EE licenses for APS
  • You can use your current premium support – just add hours for APS
  • So you may need to just need to purchase hardware and there are three vendors to choose from: HP, Dell, and Quanta.  Each solution uses commodity hardware than can be de-racked and repurposed if for whatever reason the MPP solution did not work out
  • It makes data warehouse design easier as you don’t need the typical “band aids” and work-arounds with a SMP solution to get the required performance: star schema’s, aggregate tables, cubes, data marts, multiple data warehouse servers, etc.  This also reduces the complexity of ETL and therefore makes it easier to maintain
  • Data warehouse development is quicker because of APS speed: http://www.jamesserra.com/archive/2014/10/non-obvious-apspdw-benefits/
  • Knocks down any potential hardware barriers down the road as it is long-term solution where you can easily scale your hardware by sliding in a new rack instead of fork-lifting to a new server (i.e. purchase and build and tune, backup and restore to the new hardware, move over security, repoint users to the new server)
  • You get 30x-100x performance improvement over SMP and when scaling your hardware you get linear performance benefits (i.e. double your hardware you double your performance) as opposed to the 20-30% performance improvement when fork-lifting to a new server
  • Use can use your existing SQL Server skillset as APS is very much like SQL Server so little new training is needed
  • You don’t have to say “no” anymore to end-users when they ask for more data because of the reasons: we don’t have room on the SAN, we can’t give you the query performance, we are bumping up into our maintenance window
  • You need non-relational data down the road.  Use can use Hadoop which is a platform designed and optimized for new forms of data, and then use PolyBase for easy access.  You can create a data lake in Hadoop for the end-user to mine data and let you know what is useful
  • If your data warehouse has been around a while this may be a good time to re-engineer and development will be so much quicker with MPP
  • Even if you don’t have performance problems now but see the value in big data and analytics and want to be ready for it, especially IoT
  • Tons of additional benefits that you don’t get with SMP: See Parallel Data Warehouse (PDW) benefits made simple

If you are currently housing a data warehouse on SMP, it will almost always be worth the migration effort to switch to a MPP solution.  Remember that old saying: “pay me now or pay me later”!

But there are some reasons where MPP may not be a good fit or should be supplemented with a SMP solution and/or a SSAS cube:

  • High volume transactional workloads (OLTP).  MPP is for data warehousing (heavy reads and batch writes)
  • Small company with small budget
  • Thousands of concurrent users
  • Super-fast dashboard query performance
  • Chatty workload
  • 24/7 SLA
  • Need replication (true real-time updates)

Another benefit is APS is an appliance solution, meaning it is prebuilt with software, hardware, and networking components already installed.  Think of it as “Big data in a box”.  Some of the appliance benefits:

  • Deploy in hours, not weeks
  • Save time by implementing a turnkey solution complete with hardware and software
  • Gain confidence and piece-of-mind by deploying a pre-tested and tuned data warehouse optimized for your specific workload
  • Reduce operational costs and simplify management
  • Reduce energy costs and environmental impact through balanced infrastructure and performance engineering
  • Maximize reliability through the use of industry standard infrastructure and software components
  • React quickly with unprecedented database agility
  • No individualized patching of servers
  • Much lower overall TCO

More info:

Financial Comparison of Big Data MPP Solution and Data Warehouse Appliance

Posted in Big Data, PDW/APS, SQLServerPedia Syndication | 4 Comments

SQL Server 2016 public preview arriving this summer

SQL Server 2016 was recently announced.  Top new features include:

  • Always Encrypted protects data at rest and in motion.  With Always Encrypted, SQL Server can perform operations on encrypted data and best of all, the encryption key resides with the application in the customers trusted environment.  Encryption and decryption of data happens transparently inside the application which minimizes the changes that have to be made to existing applications.
  • Stretch Database allows you to dynamically stretch your warm and cold transactional data to Microsoft Azure, so your operational data is always at hand, no matter the size, and you benefit from the low cost of using Microsoft Azure.  You can use Always Encrypted with Stretch Database to extend your data in a more secure manner for greater peace of mind
  • Real-time Operational Analytics & In-Memory: OLTP For In-Memory OLTP, which customers today are using for up to 30x faster transactions than disk based systems, you will now be able to apply this technology tuned for transactional performance to a significantly greater number of applications as well as benefit from increased concurrency.  With these enhancements, you also have the unique capability to use in-memory columnstore delivering 100X faster queries with in-memory OLTP for in-memory performance and real-time operational analytics
  • Built-in Advanced Analytics, PolyBase and Mobile BI deliver advanced analytics as an additional workload in SQL Server with proven technology from Revolution Analytics.  Advanced analytics is more accessible and has increased performance for your advanced analytic workloads by bringing R processing closer to the data and building advanced analytic capabilities right into SQL Server.  Additionally, Microsoft has built PolyBase into SQL Server, expanding the power to extract value from unstructured and structured data using your existing T-SQL skills.
  • Additional security enhancements for Row-level Security and Dynamic Data Masking
  • Improvements to AlwaysOn for more robust availability and disaster recovery with multiple synchronous replicas and secondary load balancing
  • Native JSON support to offer better performance and support for your many types of your data
  • SQL Server Enterprise Information Management (EIM) tools and Analysis Services get an upgrade in performance, usability and scalability
  • Faster hybrid backups, high availability and disaster recovery scenarios to backup and restore your on-premises databases to Azure and place your SQL Server AlwaysOn secondaries in Azure

Check out the data sheet for more details.  You can sign up to be notified of the preview availability this summer

More info:

SQL Server 2016 public preview coming this summer

‘Always Encrypted’ Leads New SQL Server 2016 Features

SQL Server 2016 is coming….

Reading the SQL Server 2016 Data Sheet

Video on new SQL Server 2016 features: The SQL Server Evolution and The SQL Server Evolution: Deep Dive

SQL Server 2016 Security Roadmap Session Notes #MSIgnite

SQL Server 2016

SSAS Related Enhancements in SQL Server 2016

Microsoft Breathes Life back into SQL Server BI On-Prem

Top 7 Features Coming to SQL Server 2016

Posted in SQL Server, SQLServerPedia Syndication | 1 Comment

Azure Data Lake

At the recent Microsoft Build Developer Conference, Executive Vice President Scott Guthrie announced the Azure Data Lake.  It is a new flavor of Azure Storage which can handle streaming data (low latency, high volume, short updates), is geo-distributed, data-locality aware and allows individual files to be sized at petabyte scale.

Azure Data Lake is built to solve for restrictions found in traditional analytics infrastructure and realize the idea of a “data lake” – a single place to store every type of data in its native format with no fixed limits on account size or file size, high throughput to increase analytic performance and native integration with the Hadoop ecosystem.

I have previously blogged about the benefits of a data lake (here).  To review, a data lake is an enterprise wide repository of every type of data collected in a single place prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics.  Data lakes remove many of the restrictions constraining traditional analytics infrastructure like the pre-definition of schema, the cost of storing large datasets, and the propagation of different silos where data is located.  Once captured in the data lake, different computational engines like Hadoop can be used to analyze and mine the raw data to discover new insights.  Data Lakes can also act as a lower cost data preparation location prior to moving curated data into a data warehouse.  In these cases, customers would load data into the data lake prior to defining any transformation logic.

Capabilities of the Azure Data Lake include:

  • HDFS for the cloud: Azure Data Lake is Microsoft’s implementation of a Hadoop File System compatible with HDFS that works with the Hadoop ecosystem including Azure HDInsight, Hortonworks, and Cloudera
  • Unlimited storage, petabyte files: Azure Data Lake has unbounded scale with no limits to how much data can be stored in a single account (Azure blobs have a 500TB limit per account).  Azure Data Lake can also store very large files in the petabyte-range with immediate read/write access and high throughput (Azure blobs have a 5TB limit for individual files)
  • Optimized for massive throughput: Azure Data Lake is built for running large analytic systems that require massive throughput to query and analyze petabytes of data.  You need only focus on the application logic and throughput can be tuned to meet the needs of the application
  • High frequency, low latency, read immediately: Azure Data Lake is built to handle high volumes of small writes at low latency making it optimized for near real-time scenarios like website analytics, Internet of Things (IoT), analytics from sensors, and others
  • Store data in its native format: Azure Data Lake is built as a distributed file store allowing you to store unstructured, semi-structured and structured data without transformation or schema definition.  This allows you to store all of your data and analyze them in their native format
  • Integration with Azure Active Directory: Azure Data Lake is integrated with Azure Active Directory for identity and access management over all of your data
  • Automatically replicates your data – 3 copies within a single data center
  • Uses Azure AD for security
  • Up and running in a few clicks: no hardware to purchase or install or tune or maintain.  Scale out on demand

Azure Data Lake can be addressed with Azure Storage APIs and it’s also compatible with the Hadoop Distributed File System (HDFS). That means the same range of Hadoop clusters can use it as PolyBase can use in reverse.

Untitled picture

Answers to common questions:

What are the differences between Azure Data Lake and Azure Storage?  In summary, this includes petabyte file sizes, high throughput, and built-in Hadoop integration.  Azure Storage is a generic store for many use cases whereas Azure Data Lake is optimized for doing big data analytics.

Will Azure Data Lake integrate with Azure HDInsight?  Yes, Azure HDInsight will be able to access Azure Data Lake as a data source similar to how it accesses Azure Blobs today.  This is available immediately at public preview allowing HDInsight customers to leverage a hyper scale big data repository in conjunction with Hadoop.

How is Azure Data Lake different from HDFS?  Azure Data Lake is an implementation of HDFS in the cloud and leverages the WebHDFS REST interface.  WebHDFS Rest APIs have a subset of the APIs available for HDFS.

When is Microsoft Azure Data Lake available?  Today, Azure Data Lake is only available as a private preview.  Public preview will be available post-Build conference.  At public preview, Data Lake will be available at US East 2 data center.

What technologies can use the Azure Data Lake?  Any HDFS compliant projects can use azure data lake (Spark, Storm, Flume, Sqoop, Kafka, R, etc).  The idea is to put all your data in the Azure Data Lake and then later use any technology on top of it (Azure ML, Azure Data Factory, HDInsights, Azure Stream Analytics, DocumentDB, Hortonworks, Cloudera, Azure Event Hubs, Qubole, etc).

Couldn’t this be done before with Azure blob storage?  The data lake is dramatically different that azure blog storage in a few areas.  It is built specifically to run massive parallel queries, with a large improvement in throughput performance and scale and the ability to store large files.  Also there is much bigger vision for the data lake and it will continue to differentiate more and more over blob storage.

You can sign up to be notified when the Azure Data Lake preview becomes available.

More info:

Microsoft BUILDs its cloud Big Data story

Azure Data Lake: Why you might want one

Microsoft Announces Azure SQL Database elastic database, Azure SQL Data Warehouse, Azure Data Lake

6 Key Features from Microsoft’s Azure Data Lake

Posted in Big Data, SQLServerPedia Syndication | 4 Comments

Azure SQL Data Warehouse

Analytics Platform System (APS) is Microsoft’s massively parallel processing (MPP) data warehouse technology.  This has only been available as an on-prem solution (see video Overview of Microsoft Analytics Platform System).  Until now.  At the recent Microsoft Build Developer Conference, Executive Vice President Scott Guthrie announced the Azure SQL Data Warehouse (SQL DW).  This is a cloud data warehouse-as-a-service (DWaaS) that will compete with Amazon’s Redshift.

But it has some additional benefits over Redshift:

  • With Redshift you must scale your data warehouse by increasing both the compute and storage units.  With SQL DW, compute and storage is decoupled so you can scale them individually.  This is a very different economic model that can save customers a lot of money as you don’t have to purchase additional storage when you just need more compute power, or vice-versa
  • The ability to pause compute when not in use so you only pay for storage, as opposed to Redshift in which you are billed 24/7 for all the VM’s that make up the nodes in your cluster
  • With Redshift you have to pick a pre-defined size and it can take hours to days to resize.  With SQL DW you can start small and grow or shrink in seconds
  • And keep in mind with Microsoft you can have a hybrid architecture that can use an on-prem APS combined with a SQL DW, allowing you to keep sensitive data on-prem and non-sensitive data in the cloud if you wish.  With Redshift you only have the option of keeping all your data in the cloud
  • There is also a lot more compatibility with SQL DW as it supports many features that Redshift does not, such as indexes, stored procs, SQL UDFs, partitioning, and constraints

SQL DW is built with the same technology as APS, except that instead of using SQL Server 2014 it uses version 12 of Azure SQL Database.  It also includes PolyBase.  PolyBase allows APS and SQL DW to query data in a Hadoop cluster, either directly or by pushing some of the work to Hadoop itself so the query is actually run using the Hadoop clusters CPU’s.  The Hadoop data is made to look as if it were local to the data warehouse, so that end-users can use their existing skill sets to query it via SQL or any reporting tool that using SQL (like Excel, SSRS, Power BI, etc).  PolyBase can integrate with Hadoop in this manner via a Microsoft HDInsight cluster that can either be inside APS or in the cloud, or via a Hortonworks or Cloudera cluster.


SQL DW will work with existing data tools including Power BI for data visualization, Azure Machine Learning for advanced analytics, Azure Data Factory for data orchestration and Azure HDInsight.

The preview for Azure SQL Data Warehouse will be available later this calendar year.  You can sign up to be notified when the Azure SQL Data Warehouse preview becomes available.

I will have a lot more blogs about this new service in the coming months, so stay tuned!

More info:

Microsoft BUILDs its cloud Big Data story

Microsoft Announces Azure SQL Database elastic database, Azure SQL Data Warehouse, Azure Data Lake

Introducing Azure SQL Data Warehouse

Short introduction video SQL Data Warehouse – YouTube

Top reasons why enterprises should choose Azure SQL Data Warehouse

Video on new SQL Server 2016 features and SQL DW (minute 59 with demo): The SQL Server Evolution

Video Microsoft Azure SQL Data Warehouse Overview

Video Azure SQL Data Warehouse: Deep Dive

Posted in Data warehouse, PDW/APS, SQLServerPedia Syndication | 10 Comments

TechEd is now Microsoft Ignite

In case you were wondering what happened to the TechEd conferences, Microsoft is now bringing together the best of previously individual events – the Management Summit, the Exchange, SharePoint, Lync, Project, and TechEd conferences – and then taking it to the next level, based on what customers and partners have asked for.

The new event is called Microsoft Ignite and is going on now (May 48, 2015, Chicago, IL).

Posted in SQLServerPedia Syndication | 1 Comment

Types of NoSQL databases

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  NoSQL is often interpreted as Not-only-SQL to emphasize that they may also support SQL-like query languages.  Most NoSQL databases are designed to store large quantities of data in a fault-tolerant way.

NoSQL is simply the term that is used to describe a family of databases that are all non-relational.  While the technologies, data types, and use cases vary wildly amount them, it is generally agreed that there are four types of NoSQL databases:

  • Key-value stores – These databases pair keys to values.  An analogy is a files system where the path acts as the key and the contents act as the file.  There are usually no fields to update, instead, the entire value other than the key must be updated if changes are to be made.  The simplicity of this scales well but it can limit the complexity of the queries and other advanced features.  Examples are: Dynamo, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS, HyperDex, Azure Table Storage (see Redis vs Azure)
  • Graph stores – These excel at dealing with interconnected data.  Graph databases consist of connections, or edges, between nodes.  Both nodes and their edges can store additional properties such as key-value pairs.  The strength of a graph database is in traversing the connections between the nodes.  But they generally require all data to fit on one machine, limiting their scalability.  Examples include: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog, Sesame
  • Column stores – Relational databases store all the data in a particular table’s rows together on-disk, making retrieval of a particular row fast.  Column-family databases generally serialize all the values of a particular column together on-disk, which makes retrieval of a large amount of a specific attribute fast.  This approach lends itself well to aggregate queries and analytics scenarios where you might run range queries over a specific field.  Examples include: Accumulo, Cassandra, Druid, HBase, Vertica
  • Document stores – These databases store records as “documents” where a document can generally be thought of as a grouping of key-value pairs (it has nothing to do with storing actual documents such as a Word document).  Keys are always strings, and values can be stored as strings, numeric, Booleans, arrays, and other nested key-value pairs.  Values can be nested to arbitrary depths.  In a document database, each document carries its own schema — unlike an RDBMS, in which every row in a given table must have the same columns.  Examples include: Lotus Notes, Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB, OrientDB, Qizx, Cloudant, Azure DocumentDB (see MongoDB vs. Azure DocumentDB and An Overview of Microsoft Azure DocumentDB)



The CAP Theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it succeeded or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

CAP_Diagram_dist copy

Since you can only pick two guarantees, here is a list of NoSQL system broken out by the two that they support:


Here is a quick summary of the most popular NoSQL products by group:

  • Key-value stores
    • Riak – Offers high availability, fault tolerance, operational simplicity, and scalability.  Riak is one of the more sophisticated data stores.  It offers most of the features found in others, then adds more control over duplication.  Although the basic structure stores pairs of keys and values, the options for retrieving them and guaranteeing their consistency are quite rich.
    • Redis – Like CouchDB and MongoDB, Redis stores documents or rows made up of key-value pairs.  Unlike the rest of the NoSQL world, it stores more than just strings or numbers in the value.  It will also include sorted and unsorted sets of strings as a value linked to a key, a feature that lets it offer some sophisticated set operations to the user.  There’s no need for the client to download data to compute the intersection when Redis can do it at the server.  Redis is also known for keeping the data in memory and only writing out the list of changes every once and a bit.  Some don’t even call it a database, preferring instead to focus on the positive by labeling it a powerful in-memory cache that also writes to disk.  Traditional databases are slower because they wait until the disk gets the information before signaling that everything is OK.  Redis waits only until the data is in memory, something that’s obviously faster but potentially dangerous if the power fades at the wrong moment.
  • Document stores
    • MongoDB – Is designed for scale, flexible data aggregation and to store files of any size.  It has rich querying, high availability and full indexing support and is fast being adopted by many businesses.  Uses GridFS instead of HDFS.  MongoDB is designed for OLTP workloads. It can do complex queries, but it’s not necessarily the best fit for reporting-style workloads.  Or if you need complex transactions, it’s not going to be a good choice.  However, MongoDB’s simplicity makes it a great place to start.  MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.  MongoDB is built to store data as an object in a dynamic schema, instead of a tabular database like SQL.
    • Coachbase – Can be used both as a document database that stores JSON documents or a pure key-value database.  Click here for how it compares to MongoDB.  CouchDB stores documents, each of which is made up of a set of pairs that link key with a value.  The most radical change is in the query. Instead of some basic query structure that’s pretty similar to SQL, CouchDB searches for documents with two functions to map and reduce the data.  One formats the document, and the other makes a decision about what to include.
  • Column stores
    • Cassandra – Born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, it is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.  It is essentially a hybrid between a key-value and a column-oriented (or tabular) database.  Most agree it is better than Hbase and MongoDB.  It can be used for both OLTP and data warehousing.  It replaces HDFS with the Cassandra File System (CFS).  Cassandra does not support joins or subqueries and emphasizes denormalization.
    • HBase – Patterned after Google BigTable, HBase is designed to provide fast, tabular access to the high-scale data stored on HDFS.  It is well suited for sparse data sets, which are common in many big data use cases.  HBase offers two broad use cases. First, it gives developers database-style access to Hadoop-scale storage, which means they can quickly read from or write to specific subsets of data without having to wade through the entire data store.  Most users and data-driven applications are used to working with the tables, columns, and rows of a database, and that’s what HBase provides.  Second, HBase provides a transactional platform for running high-scale, real-time applications. In this role, HBase is an ACID-compliant database that can run transactional applications.  That’s what conventional relational databases like Microsoft SQL Server are mostly used for, but HBase can handle the incredible volume, variety, and complexity of data encountered on the Hadoop platform.  Like other NoSQL databases, it doesn’t require a fixed schema, so you can quickly add new data even if it doesn’t conform to a predefined model.  It can be used for lightweight OLTP.  Tables are de-normalized for speed (so no joins), but updates can be slow.  HBase does not use Hadoop’s MapReduce capabilities directly, though HBase can integrate with Hadoop to serve as a source or destination of MapReduce jobs
  • Graph
    • Neo4j – Neo4J lets you fill up the data store with nodes and then add links between the nodes that mean things. Social networking applications are its strength.  The code base comes with a number of common graph algorithms already implemented.  If you want to find the shortest path between two people — which you might for a site like LinkedIn — then the algorithms are waiting for you.
    • OrientDB – It is a document-based database, but the relationships are managed as in graph databases with direct connections between records.  It supports schema-less, schema-full and schema-mixed modes.  It has a strong security profiling system based on users and roles and supports SQL as a query language.  OrientDB uses a new indexing algorithm called MVRB-Tree, derived from the red–black tree and from the B+ tree; this reportedly has benefits of having both fast insertions and fast lookups.

Untitled picture

But I’ll leave you with this note: Although NoSQL databases are becoming more popular, according to DB-Engines Ranking they only make up about 12% of the total database market when you include relational databases!

More info:

Visual Guide to NoSQL Systems

List Of NoSQL Databases

Thumbtack: NoSQL Database Comparison by Ben Engber

MongoDB, Cassandra, and HBase — the three NoSQL databases to watch

What is Apache Cassandra?

DB-Engines database popularity ranking

NoSQL showdown: MongoDB vs. Couchbase

NoSQL standouts: New databases for new applications

The Rise and Fall of the NoSQL Empire (2007–2013)

What’s better for your big data application, SQL or NoSQL?

Considerations for using NoSQL technology on your next IT project

Difference between SQL and NoSQL : Comparision

SQL vs NoSQL Database Differences Explained with few Example DB

NoSQL Introduction

Getting Acquainted with NoSQL on Windows Azure

Data Store Map (full PDF map) – 451 Research

Which NoSQL Solution is Right For You?

Four and a Half Types of NoSQL Databases, and When to Use Them

Understanding NoSQL on Microsoft Azure

IT pros talk top enterprise NoSQL architecture challenges

Posted in Hadoop, SQLServerPedia Syndication | 7 Comments

What is a data lake?

A “data lake” is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.  It’s a great place for investigating, exploring, experimenting, and refining data, in addition to archiving data.  There are various products that you can use to build a data lake, such as Microsoft’s Azure Data Lake repository.  Data lakes are becoming much more needed as there are now so many data sources that companies can use to make better business decisions, such as social networks, review web sites, online news, weather data, web logs, and sensor data.  All of these “big data” sources result in rapidly increasing data volumes and new data streams that all need to be analyzed. Some characteristics of a data lake include:

  • A place to store unlimited amounts of long-term data in any format inexpensively, as Hadoop is usually a much lower cost repository
  • Allows collection of data that you may or may not use later: “just in case”
  • Allows for easy integration of structured data, semi-structured data (e.g. XML, HTML), unstructured data (e.g. text, audio, video), and machine-generated data (e.g. sensor data)
  • A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read
  • Complements an Enterprise Data Warehouse (EDW) and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW
  • Frees up expensive EDW resources (storage and processing), especially for data refinement
  • Exploit the full power of a Hadoop cluster to speed up ETL processing over SMP data warehouse solutions
  • Allows for data exploration to be performed without waiting for the EDW team to model and load the data, adding the benefit that it may turn out after exploration the data is not useful saving the EDW team from wasting resources
  • An excellent area to run extreme analytics, such as running millions of scoring models concurrently on millions of accounts to detect fraud on credit cards, which is typically not a workload you would see running in a data warehouse
  • A place to land streaming data, for example, from IoT devices or Twitter.  This data can also be analyzed during ETL processing (i.e. scoring Twitter sentiment)
  • An on-line archive for data warehouse data that is no longer analyzed on a frequent basis
  • Some processing in better done on Hadoop than ETL tools like SSIS
  • Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)

To use some pictures to show the benefit of a data lake, here is the traditional approach for a data warehouse environment:


But as we introduce larger data volumes into this environment along with non-relational data, we run into problems:


The impact if we keep the current architecture:

  • Drop useful data by introducing ETL “bias”
  • Potentially insightful data is lost
  • Create latency as volumes increase and sources change
  • Duplicate data through staging environments to support ETL
  • Expensive “reactive” hardware to support processing scale requirements

So we take a new approach, where the non-relational data is copied to a data lake and refined, and then copied to the data warehouse.  Meanwhile, much of the relational data can keep being fed directly to the data warehouse using the current ETL, bypassing the data lake:


Note the use of ELT instead of ETL (loading the data into the data lake and then processing it).  This can speed up transformations as the data lake is usually in a Hadoop cluster that can transform data much faster than an ETL tool.  Many data warehouse solutions that use MPP technology have already switched to ELT and load data to staging tables in the MPP appliance and then transform the data to take advantage of the power of parallel SQL processing.  By changing the architecture for the analyst’s needs, we get the following benefits:

  • Entire “universe” of data is captured and maintained
  • Mining of data via transformation on read leaves all data in place
  • Refineries leverage the power of the cloud and traditional technologies
  • Integration with traditional data warehousing methodologies
  • Scale can be pushed to cloud for more horsepower
  • Orchestration of data is a reality (less rigid, more flexible, operational)
  • Democratization of predictive analytics, data sets, services and reports

A question I hear a lot is “Should we store relational data in the data lake?”.  In most cases it is not necessary to copy relational source data into the data lake and then into the data warehouse, especially when keeping in mind the effort to migrate existing ETL jobs that are already copying source data into the data warehouse, but there are some good uses cases to do just that:

  1. Wanting to offload the data refinement to Hadoop, so the processing and space on the EDW is reduced
  2. Wanting to use some Hadoop technologies/tools to refine/filter data that are not supported by your EDW (i.e. JSON, images, video)
  3. Landing zone for unstructured data, as it can ingest large files quickly and provide data redundancy
  4. ELT jobs on EDW are taking too long because of increasing data volumes and increasing rate of ingesting (velocity), so offload some of them to the Hadoop data lake
  5. There may be cases when you want to move EDW data to Hadoop, refine it, and move it back to EDW (offload processing, need to use Hadoop tools)
  6. The data lake is a good place for data that you “might” use down the road. You can land it in the data lake and have users use SQL via Polybase to look at the data and determine if it has value

Note there are technologies, such as PolyBase, that allow end-users to query data in a data lake using regular SQL, so they are not required to learn any Hadoop-related technologies.  In fact PolyBase allows the end-user to use SQL, or any reporting tool that uses SQL, to join data in a relational database with data in a Hadoop cluster.

As the diagram below shows, data from all types of sources can be brought into the data lake and transformed/filtered/cleaned/refined.  Within the data lake you will likely want to have various stages of the data, such as raw and cleaned.  There should also be governance on the data in the data lake, something I see many companies skipping (see Apache Atlas Project Proposed for Hadoop Governance).

The cleaned data is then copied to any analytic platform, as some analytical workloads go beyond the traditional analytical workloads seen in data warehouses and are more likely to be processed on platforms more suited to these kind of workloads, such as analysis of data in motion and graph analysis.  The spectrum of analytical workloads is now so broad they cannot all be dealt with in a single enterprise data warehouse.  But also keep in mind that production reports can be generated right from the data lake in additional to the data lake being an area for exploratory analysis.

Untitled picture

More info:

Hadoop and Data Warehouses

The Modern Data Warehouse

Analysts Warn of Data Lake ‘Fallacy’ in Big Data Analytics

Make Sure Your Data Lake is Both Just-in-Case and Just-in-Time

Top Five Differences between Data Lakes and Data Warehouses

Hadoop vs Data Warehouse: Apples & Oranges?

Martin Rennhackkamp: Data Lake, Data Lake vs Data Warehouse, Building a Data Lake, Revolt against the Data Lake as Staging Area

Design Tip #174 Does Your Organization Need an Analytic Sandbox?

The Data Lake: A Brief SWOT Analysis

The Hadoop Data Refinery and Enterprise Data Hub

Gartner Says Beware of the Data Lake Fallacy

Data Lake vs Data Warehouse: Key Differences

Why Do I Need A Data Lake?

Hadoop Data Lakes in the Modern Data Platform

Posted in Data warehouse, Hadoop, SQLServerPedia Syndication | 4 Comments

Creating a large data warehouse in Azure

Microsoft Azure provides you two options when hosting your SQL Server-based data warehouse: Microsoft Azure SQL Database and SQL Server in Azure Virtual Machine.  Which one is appropriate based on the size of the data warehouse?  What are some hardware features to choose from for an Azure VM for a large data warehouse?

Let’s look at each option.

Microsoft Azure SQL Database is a Platform-as-a-service (PaaS), or more specifically a relational database-as-a-service.  It is built on standardized hardware and software that is owned, hosted, and maintained by Microsoft.  You can develop directly on the service using built-in features and functionality.  When using Azure SQL Database, you pay-as-you-go with options to scale up or out for greater power.  Azure SQL Database has a max database size of 1TB (see Azure SQL Database Service Tiers and Performance Levels).  Also, not all the SQL Server 2014 Transact-SQL statements are supported in Azure SQL Database (see Azure SQL Database Transact-SQL Reference) and it does not support SQL Server instance level features (such as, SQL Server Agent, Analysis Services, Integration Services, or Reporting Services).  But there are some features that land first in Azure SQL Database before on-prem SQL Server, such as Row-Level Security and Dynamic Data Masking.  Note that Azure SQL Database has built-in fault tolerance infrastructure capabilities that enable high availability as well as business continuity options (see Azure SQL Database Business Continuity).  So this option is only appropriate if you have a relatively small data warehouse that does not require full SQL support.

The other option that will almost always be the correct choice for a large data warehouse is to create a Azure VM that has SQL Server 2014 installed, resulting in an Infrastructure-as-a-service (IaaS).  This allows you to run SQL Server inside a virtual machine in the cloud.  Similar to Azure SQL Database, it is built on standardized hardware that is owned, hosted, and maintained by Microsoft.  When using SQL Server in a VM, you can either bring your own SQL Server license to Azure (by uploading a Windows Server VHD to Azure) or use one of the preconfigured SQL Server images in the Azure portal.  If going with a preconfigured image you should choose “SQL Server 2014 Enterprise Optimized for Data Warehousing on Windows Server 2012 R2” (see VM Images Optimized for Transactional and DW workloads in Azure VM Gallery) which will attach 15 data disks (12 disks for a 12TB data pool and 3 disks for a 3TB log pool).  You will also need to choose the virtual machine size for your VM.  Note you can setup high availability and disaster recovery solutions (see High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines).  This resulting VM will be very similar to an on-prem SQL Server solution except for the various hardware configurations that you have to choose from for your virtual machine size.

If you look at the Azure Virtual Machines Pricing for SQL Server, here are the options you would want to consider for your virtual machine size:

  • A7, 8 cores, 56GB memory, 605GB max disk size, 16 Persistent 1TB Data Disks Max, $3/hr SQL Enterprise
  • A9, 16 cores, 112GB RDMA memory, 382GB max disk size, 40Gbit/s InfiniBand, 16 Persistent 1TB Data Disks Max, $6/hr SQL Enterprise
  • D14, 16 cores, 112GB memory, 800GB max disk size (SSD), 32 Persistent 1TB Data Disks Max, cpu 60% faster than A-series, $6/hr SQL Enterprise
  • G5, 32 cores, 448GB memory, 6596GB max disk size (SSD), 64 Persistent 1TB Data Disks Max, cpu Xeon E5 v3, $12/hr SQL Enterprise.  See Largest VM in the Cloud

The “Persistent 1TB Data Disks” refers to connecting external attached drives but be aware that network bandwidth can be a bottleneck.  From Virtual Machine and Cloud Service Sizes for Azure you can see for G5 you can add 64 of 1TB data disks yielding the potential total volume size of up to 64 TB.  See How to Attach a Data Disk to a Windows Virtual Machine to attach a drive and give it a new drive letter, or attach drives to use as a storage space which is a way to make multiple disks appear as one (see Windows Server 2012 Storage Virtualization Explained).

If this still is not enough disk size for your data warehouse, you will need to use an on-prem SQL Server solution or an MPP solution such as Microsoft’s Analytics Platform System.

There is a script you can download (see Deploy a SQL Server Data Warehouse in Windows Azure Virtual Machines) that allows a user to create a *Data Warehousing* optimized VM on Azure running SQL Server 2012 or SQL Server 2014 and will also attach empty disks to the VM to be used for SQL Server data and log files.


For info on how to connect to an Azure VM using SSMS, click here.  To connect SSMS to an Azure SQL database, click here.

If you are concerned about data throughput from on-prem to azure, check out the ExpressRoute service.  ExpressRoute enables dedicated, private, high-throughput network connectivity between Azure datacenters and your on-premises IT environments.  Using ExpressRoute, you can connect your existing datacenters to Azure without having to flow any traffic over the public Internet, and enable–guaranteed network quality-of-service and the ability to use Azure as a natural extension of an existing private network or datacenter.

More info:

Introduction to Automating Deployment of SQL Server in Azure IaaS Virtual Machines

Understanding Azure SQL Database and SQL Server in Azure VMs

Inside Microsoft Azure SQL Database

Automating Image-Based Deployment of SQL Server on Azure IaaS VMs – Preparing OS Image

Posted in Azure, Data warehouse, SQLServerPedia Syndication | 6 Comments

Operational Data Store (ODS) Defined

I see a lot of confusion on what exactly is an Operational Data Store (ODS).  While it can mean different things to different people, I’ll explain what I see as the most common definition.  First let me mention that an ODS is not a data warehouse or data mart.  A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting.  It acts as a central repository for many subject areas and contains the “single version of truth”.  A data mart serves the same purpose but comprises only one subject area.  Think of a data warehouse as containing multiple data marts.  See my other blogs that discuss this is more detail: Data Warehouse vs Data Mart,Building an Effective Data Warehouse Architecture, and The Modern Data Warehouse.

The purpose of an ODS is to integrate corporate data from different heterogeneous data sources in order to facilitate operational reporting in real-time or near real-time .  Usually data in the ODS will be structured similar to the source systems, although during integration the data can be cleaned, denormalized, and business rules applied to ensure data integrity.  This integration will happen at the lowest granular level and occur quite frequently throughout the day.  Normally an ODS will not be optimized for historical and trend analysis as this is left to the data warehouse.  And an ODS is frequently used as a data source for the data warehouse.

To summarize the differences between an ODS and a data warehouse:

  • An ODS is targeted for the lowest granular queries whereas a data warehouse is usually used for complex queries against summary-level or on aggregated data
  • An ODS is meant for operational reporting and supports current or near real-time reporting requirements whereas a data warehouse is meant for historical and trend analysis reporting usually on a large volume of data
  • An ODS contains only a short window of data, while a data warehouse contains the entire history of data
  • An ODS provides information for operational and tactical decisions on current or near real-time data while a data warehouse delivers feedback for strategic decisions leading to overall system improvements
  • In an ODS the frequency of data load could be every few minutes or hourly whereas in a data warehouse the frequency of data loads could be daily, weekly, monthly or quarterly

Major reasons for implementing an ODS include:

  • The limited reporting in the source systems
  • The desire to use a better and more powerful reporting tool than what the source systems offer
  • Only a few people have the security to access the source systems and you want to allow others to generate reports
  • A company owns many retail stores each of which track orders in its own database and you want to consolidate the databases to get real-time inventory levels throughout the day
  • You need to gather data from various source systems to get a true picture of a customer so you have the latest info if the customer calls customer service.  Custom data such as customer info, support history, call logs, and order info.  Or medical data to get a true picture of a patient so the doctor has the latest info throughout the day: outpatient department records, hospitalization records, diagnostic records, and pharmaceutical purchase records

More info:

Comparing Data Warehouse Design Methodologies for Microsoft SQL Server

Operational Data Stores (ODS)

The Operational Data Store

Defining the Purpose of the Operational Data Store

Operational data store – Implementation and best practices

Posted in Data warehouse, SQLServerPedia Syndication | 5 Comments