Relational databases vs Non-relational databases

I see a lot of confusion about the place and purpose of the many new database solutions (“NoSQL databases”) compared to the relational databases solutions that have been around for many years.  So let me try to explain the differences and best use cases for each.

First lets clarify these database solutions into two groups:

1) Relational databases, which can also be called relational database management systems (RDBMS) or SQL databases.  The most popular of these are Microsoft SQL Server, Oracle Database, MySQL, and IBM DB2.  These RDBMS’s are mostly used in large enterprise scenarios, with the exception of MySQL, which is mostly used to store data for web applications, typically as part of the popular LAMP stack (Linux, Apache, MySQL, PHP/ Python/ Perl).

2) Non-relational databases, also called NoSQL databases, the most popular being MongoDB, DocumentDB, Cassandra, Coachbase, HBase, Redis, and Neo4j.  These databases are usually grouped into four categories: Key-value stores, Graph stores, Column stores, and Document stores (see Types of NoSQL databases).

All relational databases can be used to manage transaction-oriented applications (OLTP), and most non-relational databases that are in the categories Document stores and Column stores can also be used for OLTP, adding to the confusion.  OLTP databases can be thought of as “Operational” databases, characterized by frequent, short transactions that include updates and that touch a small amount of data and where concurrency of thousands of transactions is very important (examples including banking applications and online reservations).  Integrity of data is very important so they support ACID transactions (Atomicity, Consistency, Isolation, Durability).  This is opposed to data warehouses, which are considered “Analytical” databases characterized by long, complex queries that touch a large amount of data and require a lot of resources.  Updates are infrequent.  An example is analysis of sales over the past year.

Relational databases usually work with structured data, while non-relational databases usually work with semi-structured data (i.e. XML, JSON).

Let’s look at each group in more detail:

Relational Databases

A relational database is organized based on the relational model of data, as proposed by E.F. Codd in 1970.  This model organizes data into one or more tables (or “relations”) of rows and columns, with a unique key for each row.  Generally, each entity type that is described in a database has its own table with the rows representing instances of that type of entity and the columns representing values attributed to that instance.  Since each row in a table has its own unique key, rows in a table can be linked to rows in other tables by storing the unique key of the row to which it should be linked (where such unique key is known as a “foreign key”).  Codd showed that data relationships of arbitrary complexity can be represented using this simple set of concepts.

Virtually all relational database systems use SQL (Structured Query Language) as the language for querying and maintaining the database.

The reasons for the dominance of relational databases are: simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data.

But to offer all of this, relational databases have to be incredibly complex internally.  For example, a relatively simple SELECT statement could have dozens of potential query execution paths, which a query optimizer would evaluate at run time.  All of this is hidden to users, but under the hood, the RDBMS determines the best “execution plan” to answer requests by using things like cost-based algorithms.

For large databases, especially ones used for web applications, the main concern is scalability.  As more and more applications are created in environments that have massive workloads (i.e. Amazon), their scalability requirements can change very quickly and grow very large.  Relational databases scale well, but usually only when that scaling happens on a single server (“scale-up”).  When the capacity of that single server is reached, you need to “scale-out” and distribute that load across multiple servers, moving into so-called distributed computing.  This is when the complexity of relational databases starts to cause problems with their potential to scale.  If you try to scale to hundreds or thousands of servers the complexities become overwhelming.  The characteristics that make relational databases so appealing are the very same that also drastically reduce their viability as platforms for large distributed systems.

Non-relational databases

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Motivations for this approach include:

  1. Simplicity of design.  Not having to deal with the “impedance mismatch” between the object-oriented approach to write applications and the schema-based tables and rows of a relational database.  For example, storing all the customer order info in one document as opposed to having to join many tables together, resulting in less code to write, debug, and maintain
  2. Better “horizontal” scaling to clusters of machines, which solves the problem when the number of concurrent users skyrockets for applications that are accessible via the web and mobile devices.  Using documents makes it much easier to scale-out as all the info for that customer order is contained in one place as opposed to being spread out on multiple tables.  NoSQL databases automatically spread data across servers without requiring application changes (auto-sharding), meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool.  Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption
  3. Finer control over availability.  Servers can be added or removed without application downtime.  Most NoSQL databases support data replication, storing multiple copies of data across the cluster or even across data centers, to ensure high availability and disaster recovery
  4. To easily capture all kinds of data “Big Data” which include unstructured and semi-structured data.  Allowing for a flexible database that can easily and quickly accommodate any new type of data and is not disrupted by content structure changes.  This is because document database are schemaless, allowing you to freely add fields to JSON documents without having to first define changes (schema-on-read instead of schema-on-write).  You can have documents with a different number of fields than other documents.  For example, a patient record that may or may not contain fields that list allergies
  5. Speed.  The data structures used by NoSQL databases (i.e. JSON documents) differ from those used by default in relational databases, making many operations faster in NoSQL than relational databases due to not having to join tables (at the cost of increased storage space due to duplication of data – but storage space is so cheap nowadays this is usually not an issue).  In fact, most NoSQL databases do not even support joins
  6. Cost.  NoSQL databases usually use clusters of cheap commodity servers, while RDBMS tend to rely on expensive proprietary servers and storage systems.  Also, the licenses for RDBMS systems can be quite expensive while many NoSQL databases are open source and therefore free

The particular suitability of a given NoSQL database depends on the problem it must solve.

NoSQL databases are increasingly used in big data and real-time web applications.  They became popular with the introduction of the web, when databases went from a max of a few hundred users on an internal company application to thousands or millions of users on a web application.  NoSQL systems are also called “Not only SQL” to emphasize that they may also support SQL-like query languages.

Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance.  Some reasons that block adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL.  Also, most NoSQL stores lack true ACID transactions or only support transactions in certain circumstances and at certain levels (e.g., document level).  Finally, RDBMS’s are usually much simpler to use as they have GUI’s where many NoSQL solution use a command-line interface.

Comparing the two

One of the most severe limitations of relational databases is that each item can only contain one attribute.  If we use a bank example, each aspect of a customer’s relationship with a bank is stored as separate row items in separate tables.  So the customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on.  All these tables are linked to each other through the use of relations such as primary keys and foreign keys.

Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model.  Key-value pairs allow you to store several related items in one “row” of data in the same table.  We place the word “row” in quotes because a row here is not really the same thing as the row of a relational table.  For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details.  All data relating to one customer would be conveniently stored together as one record.

This seems an obviously superior method of storing data, but it has a major drawback: key-value stores, unlike relational databases, cannot enforce relationships between data items.  For instance, in our key-value database, the customer details (name, social security, address, account number, loan processing number, etc.) would all be stored as one data record (instead of being stored in several tables, as in the relational model).  The customer’s transactions (account withdrawals, account deposits, loan repayments, bank charges, etc.) would also be stored as another single data record.

In the relational model, there is an built-in and foolproof method of ensuring and enforcing business logic and rules at the database layer, for instance that a withdrawal is charged to the correct bank account, through primary keys and foreign keys.  In key-value stores, this responsibility falls squarely on the application logic and many people are very uncomfortable leaving this crucial responsibility just to the application.  This is one reason why relational databases will continued to be used.

However, when it comes to web-based applications that use databases, the aspect of rigorously enforcing business logic is often not a top priorities.  The highest priority is the ability to service large numbers of user requests, which are typically read-only queries.  For example, on a site like eBay, the majority of users simply browse and look through posted items (read-only operations).  Only a fraction of these users actually place bids or reserve the items (read-write operations).  And remember, we are talking about millions, sometimes billions, of page views per day.  The eBay site administrators are more interested in quick response time to ensure faster page loading for the site’s users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes.

Relational-model databases can be tweaked and set up to run large-scale read-only operations through data warehousing, and thus potentially serve a large amount of users who are querying a large amount of data, especially when using relational MPP architectures like Analytics Platform System, Teradata, Oracle Exadata, or IBM Netezza, which all support scaling.  As mentioned before, data warehouses are distinct from typical databases in that they are used for more complex analysis of data.  This differs from the transactional (OLTP) database, whose main use is to support operational systems and offer day-to-day, small scale reporting.

However, the real challenge is the relational model’s lack of scalability when dealing with OLTP applications, or any solution with a lot of individual writes, which is the domain of relational SMP architectures.  This is where non-relational models can really shine.  They can easily distribute their data loads across dozens, hundreds and in extreme cases (think Google search) even thousands of servers.  With each server handling only a small percentage of the total requests from users, response time is very good for each individual user.  Although this distributed computing model can be built for relational databases, it is a real pain to implement, especially when there are a lot of writes (i.e OLTP), requiring techniques like sharding which usually requires significant coding outside of the application’s business logic.  This is because the relational model insists on data integrity at all levels, which must be maintained, even as the data is accessed and modified by several different servers.  This is the reason for the non-relational model as the architecture of choice for web applications such as cloud-computing and social networking.

So in summary, RDBMS’s suffer from no horizontal scaling for high transaction loads (millions of read-writes), while NoSQL databases solve high transaction loads but at the cost of data integrity and joins.

Keep in mind many solutions will use a combination of relational and non-relational databases (see What is Polyglot Persistence?).

Also keep in mind that you may not need the performance of a non-relational database and instead just going with storing files in HDFS and using Apache Hive will be enough (Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis that it provides via an SQL-like language called HiveQL).

And to end on a note that adds to the confusion, we have a another category forming called NewSQL: NewSQL is a class of modern RDBMS’s that seek to provide the same scalable performance of NoSQL systems for OLTP read-write workloads while still maintaining the ACID guarantees of a traditional relational database system.  The disadvantages is they are not for OLAP-style queries, and they are inappropriate for databases over a few terabytes.  Examples include VoltDB, NuoDB, MemSQL, SAP HANA, Splice Machine, Clustrix, and Altibase.

A picture showing the categories that many of the products fit into:

Untitled picture

More info:

MySQL vs MongoDB

MySQL vs. MongoDB: Looking At Relational and Non-Relational Databases

10 things you should know about NoSQL databases

Introduction to Databases

Difference between SQL and NoSQL : Comparision

SQL vs NoSQL Database Differences Explained with few Example DB

NoSQL, NewSQL, or RDBMS: How To Choose

NewSQL – RDBMS on Steroids

NoSQL vs NewSQL Databases Choose the Right Tool for the Right Job

Posted in SQLServerPedia Syndication | 1 Comment

Hadoop and Microsoft

In my Introduction to Hadoop I talked about the basics of Hadoop.  In this post, I wanted to cover some of the more common Hadoop technologies and tools and show how they work together, in addition to showing how they work well with Microsoft technologies and tools.  So you don’t have to choose between going with Open-source software (OSS) and going with Microsoft.  Instead, they can be combined and work together very nicely.

In short, Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.  The base Apache Hadoop framework is composed of the following modules:

  • Hadoop Common – Contains libraries and utilities needed by other Hadoop modules
  • Hadoop Distributed File System (HDFS) – A distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
  • Hadoop MapReduce – A programming model for large scale data processing.  It is designed for batch processing.  Although the Hadoop framework is implemented in Java, MapReduce applications can be written in other programming languages (R, Python, C# etc).  But Java is the most popular
  • Hadoop YARN – YARN is a resource manager introduced in Hadoop 2 that was created by separating the processing engine and resource management capabilities of MapReduce as it was implemented in Hadoop 1 (see Hadoop 1.0 vs Hadoop 2.0).  YARN is often called the operating system of Hadoop because it is responsible for managing and monitoring workloads, maintaining a multi-tenant environment, implementing security controls, and managing high availability features of Hadoop

The term “Hadoop” has come to refer not just to the base modules above, but also to the “ecosystem”, or collection of additional software packages that can be installed on top of or alongside Hadoop.  There are hundreds of such software packages (see the Apache Projects Directory) that are frequently updated, making it difficult to build a Hadoop solution that uses multiple software packages.  Making things much easier are companies that provide “Hadoop in a box”.  These are ready-to-use platforms that contain numerous Hadoop software packages pre-installed and where all the packages are tested to ensure they all work together.  These include Microsoft’s HDInsight, Hortonworks Data Platform (HDP), Cloudera’s CDH, and MapR’s Distribution.

Microsoft’s HDInsight is simply a managed Hadoop service on the cloud (via Microsoft Azure) built on Apache Hadoop that is available for Windows and Linux.  Microsoft Azure is a cloud computing platform and infrastructure, created by Microsoft, for building, deploying and managing applications and services through a global network of Microsoft-managed and Microsoft partner hosted datacenters.  Behind the covers HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution.  Although there are many software packages installed by default on a HDInsight distribution, you can add any additional packages if you wish via a Script Action (see Customize HDInsight clusters using Script Action).  With HDInsight you can easily build a Hadoop cluster in the cloud in minutes.  The cluster will consist of virtual machines with the data stored separately in Azure Storage Blobs instead of HDFS (see Why use Blob Storage with HDInsight on Azure).  The great thing about this is you can shut down the cluster and the data will remain (or you can store the data in the native HDFS file system that is local to the compute nodes but would lose the data if you deleted your cluster).  This can result in substantial savings as you can do things like shut down the cluster on weekends if it is not needed.  Since HDInsight is 100 percent Apache-based and not a special Microsoft version, this means as Hadoop evolves, Microsoft will embrace the newer versions.  Moreover, Microsoft is a major contributor to the Hadoop/Apache project and has provided a great deal of time and effort into improving various Hadoop tools.

Shown below is the software packages and their versions included in the Hortonworks Data Platform:

Asparagus-Chart

HDInsight is currently using HDP 2.2 and will be upgraded in the next couple of months to HDP 2.3 (see What’s new in the Hadoop cluster versions provided by HDInsight?).  Microsoft also has a new partner program for Hadoop independent software vendors (ISVs) to expand their offerings on to HDInsight (see Azure HDInsight launches ISV Partner Program at WPC).

Pictured below is the HDInsight platform that shows how the most popular Hadoop software packages fit together:

  • Hive – A data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.  The query language is similar to standard SQL statements and is called Hive Query Language (HQL).  It converts queries to MapReduce, Apache Tez and Spark jobs (all three execution engines can run in Hadoop YARN).  This makes querying data much easier as you are writing SQL instead of Java code
  • Storm – A distributed real-time computation system for processing fast, large streams of data, adding real-time data processing to Apache Hadoop
  • HBase – A scalable, distributed NoSQL database that supports structured data storage for large tables
  • Tez – Provides API’s to write YARN apps.  It is a replacement for MapReduce since it is much faster.  It will process data for both batch and interactive use-cases
  • Spark – Spark is intended to be a drop in replacement for Hadoop MapReduce providing the main benefit of improved performance.  It does not require YARN.  It is up to a 100x faster than MapReduce and can also perform real-time analysis of data and supports interactive processing.  It can run on HDFS as well as the Cassandra File System (CFS)

Untitled picture

Data in a Hadoop solution can be processed either in a batch or in real time.

For batch processing, your choices are MapReduce or Spark.  MapReduce was the first and only way for a long time, but now Spark has become the best choice.  Spark has a number of benefits over MapReduce such as performance, a unified programming model (can be used for both batch and real-time data processing), richer and simper API, and multiple datastore support.

For real-time processing, the most popular tools to use are Kafka and Storm for event queuing, and Cassandra or HBase for storage.  These are all OSS solutions but there are many Microsoft equivalent products should you choose to go that route (with most of the Microsoft products easier to use than the OSS products).  The equivalent on the Microsoft stack is Azure Event Hub for Kafka, Azure Stream Analytics for Storm, and Azure blob storage/Azure Data Lake for HBase.

AzureStreamAnalytics_Page

Note that Spark is quickly becoming the tool of choice over Storm.  The difference between Hadoop and Storm in the way they work is that Storm is processing an event at a time versus Hadoop or Spark, which is processing a batch at a time.  The batch can be large, the batch can be small, or the batch can be really small, which is called a micro-batch.  The paradigm in Spark that processes a micro-batch is called Spark streaming.

Storm & Trident (Trident is a high-level abstraction for doing real-time computing on top of Storm) were previously the most popular solution for real-time streaming, but now the most popular is Spark Streaming (an abstraction on Spark to perform stateful stream processing).  Spark is also supported on HDInsight (see Announcing Spark for Azure HDInsight public preview).

spark1_png-550x0

I hope this has provided a clearer picture to you on which Hadoop technologies are the most popular for building solutions today and how Microsoft supports those technologies via HDInsight.  You can learn more about HDInsight and try it for free by clicking here.

Posted in Hadoop, SQLServerPedia Syndication | 1 Comment

Analytics Platform System (APS) AU4 released

The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), has just released an appliance update (AU4), which is sort of like a service pack, except that it includes many new features.  Below is what is new in this release:

AU4 continues to improve how SQL Server customers can migrate to APS and existing customers can upgrade by enhancing T-SQL Compatibility, adding support for next generation hardware from our IHV partners (Dell, Quanta, and HP), improved data return rates to power advanced analytics scenarios, and workload level upgrades reducing upgrade durations by 50%.  These features combine to offer additional functionality, reliability, and compatibility for appliance customers while reducing the downtime when moving to AU4.

What’s New in the Release:

TSQL compatibility improvements to reduce migration friction from SQL SMP

  • Implements scalar user defined functions – CREATE/DROP FUNCTION statements
  • Implements PIVOT/UNPIVOT relational operators
  • Introduce Round Robin as a third table type for hybrid performance
  • Query plan stability to favor hash joins over nested loop joins in DW scenarios
  • Addition of a new DBCC command – DBCC DROPCLEANBUFFERS
  • Merged PDW Migration Advisor and Migration Utility into a single utility for DDL validation and creation with rules and best practices integration

Performance Improvement

  • 1.5x data return rate improvement for SELECT * queries improves external analytics (SAS, R) integration

Polybase/Hadoop Enhancements

  • Enabled Cloudera 5.1 and Hortonworks Data Platform 2.1 and 2.2 support
  • Extended partial aggregate pushdowns into Hadoop
  • Add Unit for HDInsight regions within the appliance

Install, Upgrade, and Servicing Improvements

  • Reduces end-to-end upgrade time from AU3 to AU4 by 50%
  • Top hitter upgrade reliability fixes
  • Platform update to SQL Server CU8 for latest stability and reliability improvements
  • Integrated Windows Server patches for minimize patching workflows
  • Upgrade support for appliances with HDInsight regions
  • Skip level upgrade support from previous versions to enable direct upgrades to AU4

Next Generation Hardware

  • Dell, Quanta, and HP (Gen9) next generation server support
    • 2x Intel Haswell Processors, 256 GB (16x16Gb) 2133MHz memory
  • Updated HP 5900 series switches including High Availability improvements
  • Mixed hardware version support allows extended appliance lifetime and expandability

More info:

APS AU4 Feature Review: 1.5x data return rate improvement

Analytics Platform System Appliance Update 4 Documentation and Client Tools

Posted in PDW/APS | Leave a comment

Cortana Analytics Suite

Yesterday at the Microsoft World Wide Partner Conference in Orlando Microsoft announced the Cortana Analytics Suite, which is a new package of data storage, information management, machine learning, and business intelligence software in a single convenient monthly subscription.  Microsoft’s Cortana personal digital assistant, until now available to consumers on mobile devices and PCs, is part of the new package, designed to give non-technical users a simple interface that can let them ask questions of data and receive answers in response.

In short, it is a fully managed big data and advanced analytics suite that transforms your data into intelligent action.

Cortana-Analytics-Suite

Here’s a video demonstrating some capabilities of the new Cortana Analytics Suite.

It will be available later this fall as an integrated offering with a simple monthly subscription model that takes advantage of all the relevant analytics and big data capabilities to make better decisions as well as intelligently automate operations.  It includes a comprehensive set of cloud services:

Capabilities Service
Preconfigured solutions Business scenarios Recommendations, Forecasting, Churn, etc.
Personal Digital Assistant Personal Digital Assistant Cortana
Perceptual intelligence Recognition of human interactions and intent Face, Vision, Speech and Text Analytics
Dashboards and visualizations Dashboards and visualizations Power BI
Machine learning & analytics Machine learning
Hadoop
Complex event processing
Azure Machine Learning
Azure HDInsight (Hadoop)
Azure Stream Analytics
Big Data Stores Big Data repository
Elastic data warehouse
Azure Data Lake
Azure SQL Data Warehouse
Information management Data orchestration
Data catalog
Event ingestion
Azure Data Factory
Azure Data Catalog
Event Hubs

Note that all of these services are available now so you can start using them immediately.  There will be an on-premises version of Cortana Analytics sometime in the future via the Microsoft Azure Stack (which provides enterprises with the ability to create private clouds with the same bits Microsoft uses to build Azure itself).  Cortana Analytics includes capabilities to both discover and connect to a variety of data sources including SQL Server and Azure SQL Database.

For more information on Cortana Analytics Suite visit www.microsoft.com/cortanaanalytics

More info:

Announcing Cortana Analytics Suite and New Partner Investments at WPC 2015

Posted in SQLServerPedia Syndication | 1 Comment

Microsoft Azure Data Catalog

Just announced is the Microsoft Azure Data Catalog, which is an enterprise metadata catalog / portal for the self-service discovery of data sources.  It becomes available on Monday next week, July 13, 2015.  Check out this short video on it.  My response to this is – woo hoo!  I have been waiting years for Microsoft to come up with a tool to catalog metadata and I’m excited this day has finally arrived.

From the Microsoft blog post announcement:

Businesses of every size face the challenge of sifting through their myriad data sources and discovering the right ones for a given problem. Although businesses collect and store tons of data as part of their everyday activities, too often they fail to reap the full benefit of all the data that’s being gathered. Employees too often end up spending more time searching for data than they actually do working with the data itself.

To address these problems, Azure Data Catalog uses a crowdsourced approach. Any user, for instance an analyst, data scientist or data developer, can register, enrich, discover, understand and consume data sources. Every user is empowered to register the data sources that they use. Registration extracts the structural metadata from the data source and stores it in the cloud-based Catalog, while the data itself remains in the data source.

Crowdsourced annotations let users who are knowledgeable about the data assets registered in the Catalog to enrich the system at any time. This helps others understand the data more readily, including its intended purpose and how it’s being used within the business.

Azure Data Catalog also lets users discover data sources by searching and filtering. Users can then connect to data sources using any tool of their choice, and they can similarly work with the data that they need using the tools with which they are already familiar.

Azure Data Catalog bridges the gap between IT and the business – it encourages the community of data producers, data consumers and data experts to share their business knowledge while still allowing IT to maintain control and oversight over all the data sources in their constantly evolving systems.

Untitled picture

There will be open API’s to allow 3rd parties to integrate directly with the Data Catalog for both registration and discovery of data sources.  There will be both a free and standard edition (the free version has limits on the max number of users and max number of catalog objects).

The Azure Data Catalog is an evolution of the existing Data Catalog that today ships as a feature of Power BI for Office 365.  Soon the two catalogs will merge into a single service.  The exact timings and details for how the Azure Data Catalog might integrate into the Power BI service and Power BI Designer is still to be decided.

More info:

Announcing the Public Preview of Azure Data Catalog

Posted in SQLServerPedia Syndication | Leave a comment

What is Polyglot Persistence?

Polyglot Persistence is a fancy term to mean that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application.  Different kinds of data are best dealt with different data stores.  In short, it means picking the right tool for the right use case.  It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems.

Looking at a Polyglot Persistence example, an e-commerce platform will deal with many types of data (i.e. shopping cart, inventory, completed orders, etc).  Instead of trying to store all this data in one database, which would require a lot of data conversion to make the format of the data all the same, store the data in the database best suited for that type of data.  So the e-commerce platform might look like this:

pp

So we are using a mixture of RDBMS solutions (i.e. SQL Server) with NoSQL solutions of which there are four types: Key-Value, Document, Graph, Column (see Types of NoSQL databases).  A guideline on the database type to use based on the functionality of the data:

Functionality Considerations Database Type
User Sessions Rapid Access for reads and writes.  No need to be durable. Key-Value
Financial Data Needs transactional updates.  Tabular structure fits data. RDBMS
POS Data Depending on size and rate of ingest.  Lots of writes, infrequent reads mostly for analytics. RDBMS (if modest), Key Value or Document (if ingest very high) or Column if analytics is key.
Shopping Cart High availability across multiple locations.  Can merge inconsistent writes. Document, (Key Value maybe)
Recommendations Rapidly traverse links between friends, product purchases, and ratings. Graph, (Column if simple)
Product Catalog Lots of reads, infrequent writes.  Products make natural aggregates. Document
Reporting SQL interfaces well with reporting tools RDBMS, Column
Analytics Large scale analytics on large cluster Column
User activity logs, CSR logs, Social Media analysis High volume of writes on multiple nodes Key Value or Document

With an application that uses many types of data, a web service can be created to send the data request to the appropriate database:

pp2

This will come at a cost in complexity, as each data storage solution means learning a new technology.  But the benefits will be worth it, as when relational databases are using inappropriately, they will cause a significant slowdown in application development and performance.  Another benefit is many NoSQL database are designed to operate over clusters and can handle large volumes of data, so it gives you horizontal scaling (scale-out) as opposed to the limitation with most relational databases that use vertical scaling (scale-up).

More info:

Polyglot Persistence – Two Great Tastes That Taste Great Together

Introduction to Polyglot Persistence: Using Different Data Storage Technologies for Varying Data Storage Needs

Webinar: What, Where, and How of Polyglot Persistence

Spring Polyglot Persistent Applications Part 1

The Rise of NoSQL and Polyglot Persistence

Polyglot Persistence: Choosing the Right Azure Storage Mix

What’s better for your big data application, SQL or NoSQL?

Difference between SQL and NoSQL : Comparision

Polyglot Persistence?

Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence

PolyglotPersistence

Posted in Big Data, SQLServerPedia Syndication | Leave a comment

What is Microsoft Azure Stream Analytics?

Microsoft Azure Stream Analytics (ASA) is a fully managed cloud service for real-time processing of streaming data.  ASA makes it easy to set up real-time analytic computations on data flowing in from devices, sensors, web sites, applications and infrastructure systems.  It supports a powerful high-level SQL-like language that dramatically simplifies the logic to visualize, alert, or act in near real-time.  ASA makes it simpler to build a wide range of Internet-of-Things (IoT) applications such as real-time remote device management, and to monitor and gain analytic insights from connected devices of all types including mobile phones and connected cars.

It was made generally available on April 16, 2015 (read).

ASA-blog

ASA supports two different types of inputs, either stream data or reference data, and two different input data sources, either Azure Event Hubs or files from Azure Blob Storage.

The ingestor, or data source, for stream analytics is usually an Azure Event Hub.  Event Hubs is a highly scalable publish-subscribe data integrator capable of consuming large volumes of events per second, enabling Azure to process vast amounts of data from connected applications or devices.  It provides a unified collection point for a broad array of platforms and devices, and as such, abstracts the complexity of ingesting multiple different input streams directly into the streaming analytics engine.  ASA has an Event Hubs adapter built into the offering.

ASA supports five different types of outputs: Blob storage, Event Hub, Power BI, SQL Database or Table Storage.  ASA also enhances SQL by supporting groupings by time (see Windowing).  ASA provides a native connector for SQL Database for consuming events that are output from the stream.

One of the common presentation use cases for ASA is to analyze high volume streaming data in real-time and get the insight in a live dashboard (a dashboard that updates in real-time without user having to refresh the browser).  You can build a live dashboard using Power BI as an output for your ASA job (see Azure Stream Analytics & Power BI: Live dashboard for real-time analytics of streaming data).

It is real easy to build an ASA solution as it is all done thru the Azure web portal.  There is no need to create a VM and remote into it.  It is also possible to easily integrate with Azure ML.

There are also a couple of tutorials you can check out if you want to build a end-to-end solution.

Microsoft has two other stream processing platforms: StreamInsight and Azure HDInsight Storm.

More info:

Stream Analytics documentation

Reference Architecture: Real-time event processing with Microsoft Azure Stream Analytics

Video An Introduction to Azure Stream Analytics

Video Gaining Real-Time IoT Insights using Azure Stream Analytics, AzureML and PowerBI

Azure Stream Analytics Team Blog

Video Azure Stream Analytics Demo

Building an IoT solution with Azure Event Hubs and Stream Analytics – Part 3

How to Process Google Data in Real Time with Azure Stream Analytics

Microsoft Azure Stream Analytics

Posted in Azure, SQLServerPedia Syndication | Leave a comment

How an MPP appliance solution can improve your future

Massive parallel processing (MPP) is the future for data warehousing.

So what is MPP?  SQL Server is a Symmetric Multiprocessing (SMP) solution, which essentially means it uses one server.  MPP provides scalability and query performance by running independent servers in parallel.  That is the quick definition.  For more details, read What MPP means to SQL Server Parallel Data Warehouse.

Microsoft has an MPP appliance called the Analytics Platform System (APS).  If you are building a new data warehouse that will be of any decent size in the next few years (i.e. 1TB or greater), it is really a “no brainer” to purchase a MPP solution over a SMP solution,

Looking at the value over time for a MPP appliance vs a SMP solution:

500px-MPPvsDW_Figure1
Comparison of Cumulative Cash Flows of Customer Experience Project using a Big Data Solution (MPP) vs. a Traditional Data Warehouse Appliance
Source: Wikibon 2011

The financial metrics of the two approaches were overwhelmingly in favor of the Big Data Solution (MPP) approach:

  • Big Data Approach:
    • Cumulative 3-year Cash Flow – $152M,
    • Net Present Value – $138M,
    • Internal Rate of Return (IRR) – 524%,
    • Breakeven – 4 months.
  • Traditional DW Appliance Approach:
    • Cumulative 3-year Cash Flow – $53M,
    • Net Present Value – $46M,
    • Internal Rate of Return (IRR) – 74%,
    • Breakeven – 26 months.

The bottom line is that for big data projects, the traditional data warehouse approach is more expensive in IT resources, takes much longer to do, and provides a less attractive return-on-investment.

Getting into the specific reasons to choose MPP (i.e. APS) over SMP when building a data warehouse:

  • Be proactive instead of reactive (solve future performance problems now and not when they rear their ugly head)
  • The hardware cost to upgrade to powerful servers that are clustered (for high availability) can be more than a quarter rack of APS (which has high availability built-in)
  • You can use the same SQL Server EE licenses for APS
  • You can use your current premium support – just add hours for APS
  • So you may need to just need to purchase hardware and there are three vendors to choose from: HP, Dell, and Quanta.  Each solution uses commodity hardware than can be de-racked and repurposed if for whatever reason the MPP solution did not work out
  • It makes data warehouse design easier as you don’t need the typical “band aids” and work-arounds with a SMP solution to get the required performance: star schema’s, aggregate tables, cubes, data marts, multiple data warehouse servers, etc.  This also reduces the complexity of ETL and therefore makes it easier to maintain
  • Data warehouse development is quicker because of APS speed: http://www.jamesserra.com/archive/2014/10/non-obvious-apspdw-benefits/
  • Knocks down any potential hardware barriers down the road as it is long-term solution where you can easily scale your hardware by sliding in a new rack instead of fork-lifting to a new server (i.e. purchase and build and tune, backup and restore to the new hardware, move over security, repoint users to the new server)
  • You get 30x-100x performance improvement over SMP and when scaling your hardware you get linear performance benefits (i.e. double your hardware you double your performance) as opposed to the 20-30% performance improvement when fork-lifting to a new server
  • Use can use your existing SQL Server skillset as APS is very much like SQL Server so little new training is needed
  • You don’t have to say “no” anymore to end-users when they ask for more data because of the reasons: we don’t have room on the SAN, we can’t give you the query performance, we are bumping up into our maintenance window
  • You need non-relational data down the road.  Use can use Hadoop which is a platform designed and optimized for new forms of data, and then use PolyBase for easy access.  You can create a data lake in Hadoop for the end-user to mine data and let you know what is useful
  • If your data warehouse has been around a while this may be a good time to re-engineer and development will be so much quicker with MPP
  • Even if you don’t have performance problems now but see the value in big data and analytics and want to be ready for it, especially IoT
  • Tons of additional benefits that you don’t get with SMP: See Parallel Data Warehouse (PDW) benefits made simple

If you are currently housing a data warehouse on SMP, it will almost always be worth the migration effort to switch to a MPP solution.  Remember that old saying: “pay me now or pay me later”!

But there are some reasons where MPP may not be a good fit or should be supplemented with a SMP solution and/or a SSAS cube:

  • High volume transactional workloads (OLTP).  MPP is for data warehousing (heavy reads and batch writes)
  • Small company with small budget
  • Thousands of concurrent users
  • Super-fast dashboard query performance
  • Chatty workload
  • 24/7 SLA
  • Need replication (true real-time updates)

Another benefit is APS is an appliance solution, meaning it is prebuilt with software, hardware, and networking components already installed.  Think of it as “Big data in a box”.  Some of the appliance benefits:

  • Deploy in hours, not weeks
  • Save time by implementing a turnkey solution complete with hardware and software
  • Gain confidence and piece-of-mind by deploying a pre-tested and tuned data warehouse optimized for your specific workload
  • Reduce operational costs and simplify management
  • Reduce energy costs and environmental impact through balanced infrastructure and performance engineering
  • Maximize reliability through the use of industry standard infrastructure and software components
  • React quickly with unprecedented database agility
  • No individualized patching of servers
  • Much lower overall TCO

More info:

Financial Comparison of Big Data MPP Solution and Data Warehouse Appliance

Posted in Big Data, PDW/APS, SQLServerPedia Syndication | 4 Comments

SQL Server 2016 public preview arriving this summer

SQL Server 2016 was recently announced.  Top new features include:

  • Always Encrypted protects data at rest and in motion.  With Always Encrypted, SQL Server can perform operations on encrypted data and best of all, the encryption key resides with the application in the customers trusted environment.  Encryption and decryption of data happens transparently inside the application which minimizes the changes that have to be made to existing applications.
  • Stretch Database allows you to dynamically stretch your warm and cold transactional data to Microsoft Azure, so your operational data is always at hand, no matter the size, and you benefit from the low cost of using Microsoft Azure.  You can use Always Encrypted with Stretch Database to extend your data in a more secure manner for greater peace of mind
  • Real-time Operational Analytics & In-Memory: OLTP For In-Memory OLTP, which customers today are using for up to 30x faster transactions than disk based systems, you will now be able to apply this technology tuned for transactional performance to a significantly greater number of applications as well as benefit from increased concurrency.  With these enhancements, you also have the unique capability to use in-memory columnstore delivering 100X faster queries with in-memory OLTP for in-memory performance and real-time operational analytics
  • Built-in Advanced Analytics, PolyBase and Mobile BI deliver advanced analytics as an additional workload in SQL Server with proven technology from Revolution Analytics.  Advanced analytics is more accessible and has increased performance for your advanced analytic workloads by bringing R processing closer to the data and building advanced analytic capabilities right into SQL Server.  Additionally, Microsoft has built PolyBase into SQL Server, expanding the power to extract value from unstructured and structured data using your existing T-SQL skills.
  • Additional security enhancements for Row-level Security and Dynamic Data Masking
  • Improvements to AlwaysOn for more robust availability and disaster recovery with multiple synchronous replicas and secondary load balancing
  • Native JSON support to offer better performance and support for your many types of your data
  • SQL Server Enterprise Information Management (EIM) tools and Analysis Services get an upgrade in performance, usability and scalability
  • Faster hybrid backups, high availability and disaster recovery scenarios to backup and restore your on-premises databases to Azure and place your SQL Server AlwaysOn secondaries in Azure

Check out the data sheet for more details.  You can sign up to be notified of the preview availability this summer

More info:

SQL Server 2016 public preview coming this summer

‘Always Encrypted’ Leads New SQL Server 2016 Features

SQL Server 2016 is coming….

Reading the SQL Server 2016 Data Sheet

Video on new SQL Server 2016 features: The SQL Server Evolution and The SQL Server Evolution: Deep Dive

SQL Server 2016 Security Roadmap Session Notes #MSIgnite

SQL Server 2016

SSAS Related Enhancements in SQL Server 2016

Microsoft Breathes Life back into SQL Server BI On-Prem

Top 7 Features Coming to SQL Server 2016

Posted in SQL Server, SQLServerPedia Syndication | Leave a comment

Azure Data Lake

At the recent Microsoft Build Developer Conference, Executive Vice President Scott Guthrie announced the Azure Data Lake.  It is a new flavor of Azure Storage which can handle streaming data (low latency, high volume, short updates), is geo-distributed, data-locality aware and allows individual files to be sized at petabyte scale.

Azure Data Lake is built to solve for restrictions found in traditional analytics infrastructure and realize the idea of a “data lake” – a single place to store every type of data in its native format with no fixed limits on account size or file size, high throughput to increase analytic performance and native integration with the Hadoop ecosystem.

I have previously blogged about the benefits of a data lake (here).  To review, a data lake is an enterprise wide repository of every type of data collected in a single place prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics.  Data lakes remove many of the restrictions constraining traditional analytics infrastructure like the pre-definition of schema, the cost of storing large datasets, and the propagation of different silos where data is located.  Once captured in the data lake, different computational engines like Hadoop can be used to analyze and mine the raw data to discover new insights.  Data Lakes can also act as a lower cost data preparation location prior to moving curated data into a data warehouse.  In these cases, customers would load data into the data lake prior to defining any transformation logic.

Capabilities of the Azure Data Lake include:

  • HDFS for the cloud: Azure Data Lake is Microsoft’s implementation of a Hadoop File System compatible with HDFS that works with the Hadoop ecosystem including Azure HDInsight, Hortonworks, and Cloudera
  • Unlimited storage, petabyte files: Azure Data Lake has unbounded scale with no limits to how much data can be stored in a single account (Azure blobs have a 500TB limit per account).  Azure Data Lake can also store very large files in the petabyte-range with immediate read/write access and high throughput (Azure blobs have a 5TB limit for individual files)
  • Optimized for massive throughput: Azure Data Lake is built for running large analytic systems that require massive throughput to query and analyze petabytes of data.  You need only focus on the application logic and throughput can be tuned to meet the needs of the application
  • High frequency, low latency, read immediately: Azure Data Lake is built to handle high volumes of small writes at low latency making it optimized for near real-time scenarios like website analytics, Internet of Things (IoT), analytics from sensors, and others
  • Store data in its native format: Azure Data Lake is built as a distributed file store allowing you to store unstructured, semi-structured and structured data without transformation or schema definition.  This allows you to store all of your data and analyze them in their native format
  • Integration with Azure Active Directory: Azure Data Lake is integrated with Azure Active Directory for identity and access management over all of your data
  • Automatically replicates your data – 3 copies within a single data center
  • Uses Azure AD for security
  • Up and running in a few clicks: no hardware to purchase or install or tune or maintain.  Scale out on demand

Azure Data Lake can be addressed with Azure Storage APIs and it’s also compatible with the Hadoop Distributed File System (HDFS). That means the same range of Hadoop clusters can use it as PolyBase can use in reverse.

Untitled picture

Answers to common questions:

What are the differences between Azure Data Lake and Azure Storage?  In summary, this includes petabyte file sizes, high throughput, and built-in Hadoop integration.  Azure Storage is a generic store for many use cases whereas Azure Data Lake is optimized for doing big data analytics.

Will Azure Data Lake integrate with Azure HDInsight?  Yes, Azure HDInsight will be able to access Azure Data Lake as a data source similar to how it accesses Azure Blobs today.  This is available immediately at public preview allowing HDInsight customers to leverage a hyper scale big data repository in conjunction with Hadoop.

How is Azure Data Lake different from HDFS?  Azure Data Lake is an implementation of HDFS in the cloud and leverages the WebHDFS REST interface.  WebHDFS Rest APIs have a subset of the APIs available for HDFS.

When is Microsoft Azure Data Lake available?  Today, Azure Data Lake is only available as a private preview.  Public preview will be available post-Build conference.  At public preview, Data Lake will be available at US East 2 data center.

What technologies can use the Azure Data Lake?  Any HDFS compliant projects can use azure data lake (Spark, Storm, Flume, Sqoop, Kafka, R, etc).  The idea is to put all your data in the Azure Data Lake and then later use any technology on top of it (Azure ML, Azure Data Factory, HDInsights, Azure Stream Analytics, DocumentDB, Hortonworks, Cloudera, Azure Event Hubs, Qubole, etc).

Couldn’t this be done before with Azure blob storage?  The data lake is dramatically different that azure blog storage in a few areas.  It is built specifically to run massive parallel queries, with a large improvement in throughput performance and scale and the ability to store large files.  Also there is much bigger vision for the data lake and it will continue to differentiate more and more over blob storage.

You can sign up to be notified when the Azure Data Lake preview becomes available.

More info:

Microsoft BUILDs its cloud Big Data story

Azure Data Lake: Why you might want one

Microsoft Announces Azure SQL Database elastic database, Azure SQL Data Warehouse, Azure Data Lake

6 Key Features from Microsoft’s Azure Data Lake

Posted in Big Data, SQLServerPedia Syndication | 2 Comments