Cortana Analytics Suite

Yesterday at the Microsoft World Wide Partner Conference in Orlando Microsoft announced the Cortana Analytics Suite, which is a new package of data storage, information management, machine learning, and business intelligence software in a single convenient monthly subscription.  Microsoft’s Cortana personal digital assistant, until now available to consumers on mobile devices and PCs, is part of the new package, designed to give non-technical users a simple interface that can let them ask questions of data and receive answers in response.

In short, it is a fully managed big data and advanced analytics suite that transforms your data into intelligent action.

Cortana-Analytics-Suite

Here’s a video demonstrating some capabilities of the new Cortana Analytics Suite.

It will be available later this fall as an integrated offering with a simple monthly subscription model that takes advantage of all the relevant analytics and big data capabilities to make better decisions as well as intelligently automate operations.  It includes a comprehensive set of cloud services:

Capabilities Service
Preconfigured solutions Business scenarios Recommendations, Forecasting, Churn, etc.
Personal Digital Assistant Personal Digital Assistant Cortana
Perceptual intelligence Recognition of human interactions and intent Face, Vision, Speech and Text Analytics
Dashboards and visualizations Dashboards and visualizations Power BI
Machine learning & analytics Machine learning
Hadoop
Complex event processing
Azure Machine Learning
Azure HDInsight (Hadoop)
Azure Stream Analytics
Big Data Stores Big Data repository
Elastic data warehouse
Azure Data Lake
Azure SQL Data Warehouse
Information management Data orchestration
Data catalog
Event ingestion
Azure Data Factory
Azure Data Catalog
Event Hubs

Note that all of these services are available now so you can start using them immediately.  There will be an on-premises version of Cortana Analytics sometime in the future via the Microsoft Azure Stack (which provides enterprises with the ability to create private clouds with the same bits Microsoft uses to build Azure itself).  Cortana Analytics includes capabilities to both discover and connect to a variety of data sources including SQL Server and Azure SQL Database.

For more information on Cortana Analytics Suite visit www.microsoft.com/cortanaanalytics

More info:

Announcing Cortana Analytics Suite and New Partner Investments at WPC 2015

Posted in SQLServerPedia Syndication | 1 Comment

Microsoft Azure Data Catalog

Just announced is the Microsoft Azure Data Catalog, which is an enterprise metadata catalog / portal for the self-service discovery of data sources.  It becomes available on Monday next week, July 13, 2015.  Check out this short video on it.  My response to this is – woo hoo!  I have been waiting years for Microsoft to come up with a tool to catalog metadata and I’m excited this day has finally arrived.

From the Microsoft blog post announcement:

Businesses of every size face the challenge of sifting through their myriad data sources and discovering the right ones for a given problem. Although businesses collect and store tons of data as part of their everyday activities, too often they fail to reap the full benefit of all the data that’s being gathered. Employees too often end up spending more time searching for data than they actually do working with the data itself.

To address these problems, Azure Data Catalog uses a crowdsourced approach. Any user, for instance an analyst, data scientist or data developer, can register, enrich, discover, understand and consume data sources. Every user is empowered to register the data sources that they use. Registration extracts the structural metadata from the data source and stores it in the cloud-based Catalog, while the data itself remains in the data source.

Crowdsourced annotations let users who are knowledgeable about the data assets registered in the Catalog to enrich the system at any time. This helps others understand the data more readily, including its intended purpose and how it’s being used within the business.

Azure Data Catalog also lets users discover data sources by searching and filtering. Users can then connect to data sources using any tool of their choice, and they can similarly work with the data that they need using the tools with which they are already familiar.

Azure Data Catalog bridges the gap between IT and the business – it encourages the community of data producers, data consumers and data experts to share their business knowledge while still allowing IT to maintain control and oversight over all the data sources in their constantly evolving systems.

Untitled picture

There will be open API’s to allow 3rd parties to integrate directly with the Data Catalog for both registration and discovery of data sources.  There will be both a free and standard edition (the free version has limits on the max number of users and max number of catalog objects).

The Azure Data Catalog is an evolution of the existing Data Catalog that today ships as a feature of Power BI for Office 365.  Soon the two catalogs will merge into a single service.  The exact timings and details for how the Azure Data Catalog might integrate into the Power BI service and Power BI Designer is still to be decided.

More info:

Announcing the Public Preview of Azure Data Catalog

Posted in SQLServerPedia Syndication | Leave a comment

What is Polyglot Persistence?

Polyglot Persistence is a fancy term to mean that when storing data, it is best to use multiple data storage technologies, chosen based upon the way data is being used by individual applications or components of a single application.  Different kinds of data are best dealt with different data stores.  In short, it means picking the right tool for the right use case.  It’s the same idea behind Polyglot Programming, which is the idea that applications should be written in a mix of languages to take advantage of the fact that different languages are suitable for tackling different problems.

Looking at a Polyglot Persistence example, an e-commerce platform will deal with many types of data (i.e. shopping cart, inventory, completed orders, etc).  Instead of trying to store all this data in one database, which would require a lot of data conversion to make the format of the data all the same, store the data in the database best suited for that type of data.  So the e-commerce platform might look like this:

pp

So we are using a mixture of RDBMS solutions (i.e. SQL Server) with NoSQL solutions of which there are four types: Key-Value, Document, Graph, Column (see Types of NoSQL databases).  A guideline on the database type to use based on the functionality of the data:

Functionality Considerations Database Type
User Sessions Rapid Access for reads and writes.  No need to be durable. Key-Value
Financial Data Needs transactional updates.  Tabular structure fits data. RDBMS
POS Data Depending on size and rate of ingest.  Lots of writes, infrequent reads mostly for analytics. RDBMS (if modest), Key Value or Document (if ingest very high) or Column if analytics is key.
Shopping Cart High availability across multiple locations.  Can merge inconsistent writes. Document, (Key Value maybe)
Recommendations Rapidly traverse links between friends, product purchases, and ratings. Graph, (Column if simple)
Product Catalog Lots of reads, infrequent writes.  Products make natural aggregates. Document
Reporting SQL interfaces well with reporting tools RDBMS, Column
Analytics Large scale analytics on large cluster Column
User activity logs, CSR logs, Social Media analysis High volume of writes on multiple nodes Key Value or Document

With an application that uses many types of data, a web service can be created to send the data request to the appropriate database:

pp2

This will come at a cost in complexity, as each data storage solution means learning a new technology.  But the benefits will be worth it, as when relational databases are using inappropriately, they will cause a significant slowdown in application development and performance.  Another benefit is many NoSQL database are designed to operate over clusters and can handle large volumes of data, so it gives you horizontal scaling (scale-out) as opposed to the limitation with most relational databases that use vertical scaling (scale-up).

More info:

Polyglot Persistence – Two Great Tastes That Taste Great Together

Introduction to Polyglot Persistence: Using Different Data Storage Technologies for Varying Data Storage Needs

Webinar: What, Where, and How of Polyglot Persistence

Spring Polyglot Persistent Applications Part 1

The Rise of NoSQL and Polyglot Persistence

Polyglot Persistence: Choosing the Right Azure Storage Mix

What’s better for your big data application, SQL or NoSQL?

Difference between SQL and NoSQL : Comparision

Polyglot Persistence?

Data Access for Highly-Scalable Solutions: Using SQL, NoSQL, and Polyglot Persistence

PolyglotPersistence

Posted in Big Data, SQLServerPedia Syndication | Leave a comment

What is Microsoft Azure Stream Analytics?

Microsoft Azure Stream Analytics (ASA) is a fully managed cloud service for real-time processing of streaming data.  ASA makes it easy to set up real-time analytic computations on data flowing in from devices, sensors, web sites, applications and infrastructure systems.  It supports a powerful high-level SQL-like language that dramatically simplifies the logic to visualize, alert, or act in near real-time.  ASA makes it simpler to build a wide range of Internet-of-Things (IoT) applications such as real-time remote device management, and to monitor and gain analytic insights from connected devices of all types including mobile phones and connected cars.

It was made generally available on April 16, 2015 (read).

ASA-blog

ASA supports two different types of inputs, either stream data or reference data, and two different input data sources, either Azure Event Hubs or files from Azure Blob Storage.

The ingestor, or data source, for stream analytics is usually an Azure Event Hub.  Event Hubs is a highly scalable publish-subscribe data integrator capable of consuming large volumes of events per second, enabling Azure to process vast amounts of data from connected applications or devices.  It provides a unified collection point for a broad array of platforms and devices, and as such, abstracts the complexity of ingesting multiple different input streams directly into the streaming analytics engine.  ASA has an Event Hubs adapter built into the offering.

ASA supports five different types of outputs: Blob storage, Event Hub, Power BI, SQL Database or Table Storage.  ASA also enhances SQL by supporting groupings by time (see Windowing).  ASA provides a native connector for SQL Database for consuming events that are output from the stream.

One of the common presentation use cases for ASA is to analyze high volume streaming data in real-time and get the insight in a live dashboard (a dashboard that updates in real-time without user having to refresh the browser).  You can build a live dashboard using Power BI as an output for your ASA job (see Azure Stream Analytics & Power BI: Live dashboard for real-time analytics of streaming data).

It is real easy to build an ASA solution as it is all done thru the Azure web portal.  There is no need to create a VM and remote into it.  It is also possible to easily integrate with Azure ML.

There are also a couple of tutorials you can check out if you want to build a end-to-end solution.

Microsoft has two other stream processing platforms: StreamInsight and Azure HDInsight Storm.

More info:

Stream Analytics documentation

Reference Architecture: Real-time event processing with Microsoft Azure Stream Analytics

Video An Introduction to Azure Stream Analytics

Video Gaining Real-Time IoT Insights using Azure Stream Analytics, AzureML and PowerBI

Azure Stream Analytics Team Blog

Video Azure Stream Analytics Demo

Building an IoT solution with Azure Event Hubs and Stream Analytics – Part 3

How to Process Google Data in Real Time with Azure Stream Analytics

Microsoft Azure Stream Analytics

Posted in Azure, SQLServerPedia Syndication | Leave a comment

How an MPP appliance solution can improve your future

Massive parallel processing (MPP) is the future for data warehousing.

So what is MPP?  SQL Server is a Symmetric Multiprocessing (SMP) solution, which essentially means it uses one server.  MPP provides scalability and query performance by running independent servers in parallel.  That is the quick definition.  For more details, read What MPP means to SQL Server Parallel Data Warehouse.

Microsoft has an MPP appliance called the Analytics Platform System (APS).  If you are building a new data warehouse that will be of any decent size in the next few years (i.e. 1TB or greater), it is really a “no brainer” to purchase a MPP solution over a SMP solution,

Looking at the value over time for a MPP appliance vs a SMP solution:

500px-MPPvsDW_Figure1
Comparison of Cumulative Cash Flows of Customer Experience Project using a Big Data Solution (MPP) vs. a Traditional Data Warehouse Appliance
Source: Wikibon 2011

The financial metrics of the two approaches were overwhelmingly in favor of the Big Data Solution (MPP) approach:

  • Big Data Approach:
    • Cumulative 3-year Cash Flow – $152M,
    • Net Present Value – $138M,
    • Internal Rate of Return (IRR) – 524%,
    • Breakeven – 4 months.
  • Traditional DW Appliance Approach:
    • Cumulative 3-year Cash Flow – $53M,
    • Net Present Value – $46M,
    • Internal Rate of Return (IRR) – 74%,
    • Breakeven – 26 months.

The bottom line is that for big data projects, the traditional data warehouse approach is more expensive in IT resources, takes much longer to do, and provides a less attractive return-on-investment.

Getting into the specific reasons to choose MPP (i.e. APS) over SMP when building a data warehouse:

  • Be proactive instead of reactive (solve future performance problems now and not when they rear their ugly head)
  • The hardware cost to upgrade to powerful servers that are clustered (for high availability) can be more than a quarter rack of APS (which has high availability built-in)
  • You can use the same SQL Server EE licenses for APS
  • You can use your current premium support – just add hours for APS
  • So you may need to just need to purchase hardware and there are three vendors to choose from: HP, Dell, and Quanta.  Each solution uses commodity hardware than can be de-racked and repurposed if for whatever reason the MPP solution did not work out
  • It makes data warehouse design easier as you don’t need the typical “band aids” and work-arounds with a SMP solution to get the required performance: star schema’s, aggregate tables, cubes, data marts, multiple data warehouse servers, etc.  This also reduces the complexity of ETL and therefore makes it easier to maintain
  • Data warehouse development is quicker because of APS speed: http://www.jamesserra.com/archive/2014/10/non-obvious-apspdw-benefits/
  • Knocks down any potential hardware barriers down the road as it is long-term solution where you can easily scale your hardware by sliding in a new rack instead of fork-lifting to a new server (i.e. purchase and build and tune, backup and restore to the new hardware, move over security, repoint users to the new server)
  • You get 30x-100x performance improvement over SMP and when scaling your hardware you get linear performance benefits (i.e. double your hardware you double your performance) as opposed to the 20-30% performance improvement when fork-lifting to a new server
  • Use can use your existing SQL Server skillset as APS is very much like SQL Server so little new training is needed
  • You don’t have to say “no” anymore to end-users when they ask for more data because of the reasons: we don’t have room on the SAN, we can’t give you the query performance, we are bumping up into our maintenance window
  • You need non-relational data down the road.  Use can use Hadoop which is a platform designed and optimized for new forms of data, and then use PolyBase for easy access.  You can create a data lake in Hadoop for the end-user to mine data and let you know what is useful
  • If your data warehouse has been around a while this may be a good time to re-engineer and development will be so much quicker with MPP
  • Even if you don’t have performance problems now but see the value in big data and analytics and want to be ready for it, especially IoT
  • Tons of additional benefits that you don’t get with SMP: See Parallel Data Warehouse (PDW) benefits made simple

If you are currently housing a data warehouse on SMP, it will almost always be worth the migration effort to switch to a MPP solution.  Remember that old saying: “pay me now or pay me later”!

But there are some reasons where MPP may not be a good fit or should be supplemented with a SMP solution and/or a SSAS cube:

  • High volume transactional workloads (OLTP).  MPP is for data warehousing (heavy reads and batch writes)
  • Small company with small budget
  • Thousands of concurrent users
  • Super-fast dashboard query performance
  • Chatty workload
  • 24/7 SLA
  • Need replication (true real-time updates)

Another benefit is APS is an appliance solution, meaning it is prebuilt with software, hardware, and networking components already installed.  Think of it as “Big data in a box”.  Some of the appliance benefits:

  • Deploy in hours, not weeks
  • Save time by implementing a turnkey solution complete with hardware and software
  • Gain confidence and piece-of-mind by deploying a pre-tested and tuned data warehouse optimized for your specific workload
  • Reduce operational costs and simplify management
  • Reduce energy costs and environmental impact through balanced infrastructure and performance engineering
  • Maximize reliability through the use of industry standard infrastructure and software components
  • React quickly with unprecedented database agility
  • No individualized patching of servers
  • Much lower overall TCO

More info:

Financial Comparison of Big Data MPP Solution and Data Warehouse Appliance

Posted in Big Data, PDW/APS, SQLServerPedia Syndication | 4 Comments

SQL Server 2016 public preview arriving this summer

SQL Server 2016 was recently announced.  Top new features include:

  • Always Encrypted protects data at rest and in motion.  With Always Encrypted, SQL Server can perform operations on encrypted data and best of all, the encryption key resides with the application in the customers trusted environment.  Encryption and decryption of data happens transparently inside the application which minimizes the changes that have to be made to existing applications.
  • Stretch Database allows you to dynamically stretch your warm and cold transactional data to Microsoft Azure, so your operational data is always at hand, no matter the size, and you benefit from the low cost of using Microsoft Azure.  You can use Always Encrypted with Stretch Database to extend your data in a more secure manner for greater peace of mind
  • Real-time Operational Analytics & In-Memory: OLTP For In-Memory OLTP, which customers today are using for up to 30x faster transactions than disk based systems, you will now be able to apply this technology tuned for transactional performance to a significantly greater number of applications as well as benefit from increased concurrency.  With these enhancements, you also have the unique capability to use in-memory columnstore delivering 100X faster queries with in-memory OLTP for in-memory performance and real-time operational analytics
  • Built-in Advanced Analytics, PolyBase and Mobile BI deliver advanced analytics as an additional workload in SQL Server with proven technology from Revolution Analytics.  Advanced analytics is more accessible and has increased performance for your advanced analytic workloads by bringing R processing closer to the data and building advanced analytic capabilities right into SQL Server.  Additionally, Microsoft has built PolyBase into SQL Server, expanding the power to extract value from unstructured and structured data using your existing T-SQL skills.
  • Additional security enhancements for Row-level Security and Dynamic Data Masking
  • Improvements to AlwaysOn for more robust availability and disaster recovery with multiple synchronous replicas and secondary load balancing
  • Native JSON support to offer better performance and support for your many types of your data
  • SQL Server Enterprise Information Management (EIM) tools and Analysis Services get an upgrade in performance, usability and scalability
  • Faster hybrid backups, high availability and disaster recovery scenarios to backup and restore your on-premises databases to Azure and place your SQL Server AlwaysOn secondaries in Azure

Check out the data sheet for more details.  You can sign up to be notified of the preview availability this summer

More info:

SQL Server 2016 public preview coming this summer

‘Always Encrypted’ Leads New SQL Server 2016 Features

SQL Server 2016 is coming….

Reading the SQL Server 2016 Data Sheet

Video on new SQL Server 2016 features: The SQL Server Evolution and The SQL Server Evolution: Deep Dive

SQL Server 2016 Security Roadmap Session Notes #MSIgnite

SQL Server 2016

SSAS Related Enhancements in SQL Server 2016

Microsoft Breathes Life back into SQL Server BI On-Prem

Top 7 Features Coming to SQL Server 2016

Posted in SQL Server, SQLServerPedia Syndication | Leave a comment

Azure Data Lake

At the recent Microsoft Build Developer Conference, Executive Vice President Scott Guthrie announced the Azure Data Lake.  It is a new flavor of Azure Storage which can handle streaming data (low latency, high volume, short updates), is geo-distributed, data-locality aware and allows individual files to be sized at petabyte scale.

Azure Data Lake is built to solve for restrictions found in traditional analytics infrastructure and realize the idea of a “data lake” – a single place to store every type of data in its native format with no fixed limits on account size or file size, high throughput to increase analytic performance and native integration with the Hadoop ecosystem.

I have previously blogged about the benefits of a data lake (here).  To review, a data lake is an enterprise wide repository of every type of data collected in a single place prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics.  Data lakes remove many of the restrictions constraining traditional analytics infrastructure like the pre-definition of schema, the cost of storing large datasets, and the propagation of different silos where data is located.  Once captured in the data lake, different computational engines like Hadoop can be used to analyze and mine the raw data to discover new insights.  Data Lakes can also act as a lower cost data preparation location prior to moving curated data into a data warehouse.  In these cases, customers would load data into the data lake prior to defining any transformation logic.

Capabilities of the Azure Data Lake include:

  • HDFS for the cloud: Azure Data Lake is Microsoft’s implementation of a Hadoop File System compatible with HDFS that works with the Hadoop ecosystem including Azure HDInsight, Hortonworks, and Cloudera
  • Unlimited storage, petabyte files: Azure Data Lake has unbounded scale with no limits to how much data can be stored in a single account (Azure blobs have a 500TB limit per account).  Azure Data Lake can also store very large files in the petabyte-range with immediate read/write access and high throughput (Azure blobs have a 5TB limit for individual files)
  • Optimized for massive throughput: Azure Data Lake is built for running large analytic systems that require massive throughput to query and analyze petabytes of data.  You need only focus on the application logic and throughput can be tuned to meet the needs of the application
  • High frequency, low latency, read immediately: Azure Data Lake is built to handle high volumes of small writes at low latency making it optimized for near real-time scenarios like website analytics, Internet of Things (IoT), analytics from sensors, and others
  • Store data in its native format: Azure Data Lake is built as a distributed file store allowing you to store unstructured, semi-structured and structured data without transformation or schema definition.  This allows you to store all of your data and analyze them in their native format
  • Integration with Azure Active Directory: Azure Data Lake is integrated with Azure Active Directory for identity and access management over all of your data
  • Automatically replicates your data – 3 copies within a single data center
  • Uses Azure AD for security
  • Up and running in a few clicks: no hardware to purchase or install or tune or maintain.  Scale out on demand

Azure Data Lake can be addressed with Azure Storage APIs and it’s also compatible with the Hadoop Distributed File System (HDFS). That means the same range of Hadoop clusters can use it as PolyBase can use in reverse.

Untitled picture

Answers to common questions:

What are the differences between Azure Data Lake and Azure Storage?  In summary, this includes petabyte file sizes, high throughput, and built-in Hadoop integration.  Azure Storage is a generic store for many use cases whereas Azure Data Lake is optimized for doing big data analytics.

Will Azure Data Lake integrate with Azure HDInsight?  Yes, Azure HDInsight will be able to access Azure Data Lake as a data source similar to how it accesses Azure Blobs today.  This is available immediately at public preview allowing HDInsight customers to leverage a hyper scale big data repository in conjunction with Hadoop.

How is Azure Data Lake different from HDFS?  Azure Data Lake is an implementation of HDFS in the cloud and leverages the WebHDFS REST interface.  WebHDFS Rest APIs have a subset of the APIs available for HDFS.

When is Microsoft Azure Data Lake available?  Today, Azure Data Lake is only available as a private preview.  Public preview will be available post-Build conference.  At public preview, Data Lake will be available at US East 2 data center.

What technologies can use the Azure Data Lake?  Any HDFS compliant projects can use azure data lake (Spark, Storm, Flume, Sqoop, Kafka, R, etc).  The idea is to put all your data in the Azure Data Lake and then later use any technology on top of it (Azure ML, Azure Data Factory, HDInsights, Azure Stream Analytics, DocumentDB, Hortonworks, Cloudera, Azure Event Hubs, Qubole, etc).

Couldn’t this be done before with Azure blob storage?  The data lake is dramatically different that azure blog storage in a few areas.  It is built specifically to run massive parallel queries, with a large improvement in throughput performance and scale and the ability to store large files.  Also there is much bigger vision for the data lake and it will continue to differentiate more and more over blob storage.

You can sign up to be notified when the Azure Data Lake preview becomes available.

More info:

Microsoft BUILDs its cloud Big Data story

Azure Data Lake: Why you might want one

Microsoft Announces Azure SQL Database elastic database, Azure SQL Data Warehouse, Azure Data Lake

6 Key Features from Microsoft’s Azure Data Lake

Posted in Big Data, SQLServerPedia Syndication | 1 Comment

Azure SQL Data Warehouse

Analytics Platform System (APS) is Microsoft’s massively parallel processing (MPP) data warehouse technology.  This has only been available as an on-prem solution (see video Overview of Microsoft Analytics Platform System).  Until now.  At the recent Microsoft Build Developer Conference, Executive Vice President Scott Guthrie announced the Azure SQL Data Warehouse (SQL DW).  This is a cloud data warehouse-as-a-service (DWaaS) that will compete with Amazon’s Redshift.

But it has some additional benefits over Redshift:

  • With Redshift you must scale your data warehouse by increasing both the compute and storage units.  With SQL DW, compute and storage is decoupled so you can scale them individually.  This is a very different economic model that can save customers a lot of money as you don’t have to purchase additional storage when you just need more compute power, or vice-versa
  • The ability to pause compute when not in use so you only pay for storage, as opposed to Redshift in which you are billed 24/7 for all the VM’s that make up the nodes in your cluster
  • With Redshift you have to pick a pre-defined size and it can take hours to days to resize.  With SQL DW you can start small and grow or shrink in seconds
  • And keep in mind with Microsoft you can have a hybrid architecture that can use an on-prem APS combined with a SQL DW, allowing you to keep sensitive data on-prem and non-sensitive data in the cloud if you wish.  With Redshift you only have the option of keeping all your data in the cloud
  • There is also a lot more compatibility with SQL DW as it supports many features that Redshift does not, such as indexes, stored procs, SQL UDFs, partitioning, and constraints

SQL DW is built with the same technology as APS, except that instead of using SQL Server 2014 it uses version 12 of Azure SQL Database.  It also includes PolyBase.  PolyBase allows APS and SQL DW to query data in a Hadoop cluster, either directly or by pushing some of the work to Hadoop itself so the query is actually run using the Hadoop clusters CPU’s.  The Hadoop data is made to look as if it were local to the data warehouse, so that end-users can use their existing skill sets to query it via SQL or any reporting tool that using SQL (like Excel, SSRS, Power BI, etc).  PolyBase can integrate with Hadoop in this manner via a Microsoft HDInsight cluster that can either be inside APS or in the cloud, or via a Hortonworks or Cloudera cluster.

dpi-apr29-2

SQL DW will work with existing data tools including Power BI for data visualization, Azure Machine Learning for advanced analytics, Azure Data Factory for data orchestration and Azure HDInsight.

The preview for Azure SQL Data Warehouse will be available later this calendar year.  You can sign up to be notified when the Azure SQL Data Warehouse preview becomes available.

I will have a lot more blogs about this new service in the coming months, so stay tuned!

More info:

Microsoft BUILDs its cloud Big Data story

Microsoft Announces Azure SQL Database elastic database, Azure SQL Data Warehouse, Azure Data Lake

Introducing Azure SQL Data Warehouse

Short introduction video SQL Data Warehouse – YouTube

Top reasons why enterprises should choose Azure SQL Data Warehouse

Video on new SQL Server 2016 features and SQL DW (minute 59 with demo): The SQL Server Evolution

Video Microsoft Azure SQL Data Warehouse Overview

Video Azure SQL Data Warehouse: Deep Dive

Posted in Data warehouse, PDW/APS, SQLServerPedia Syndication | 6 Comments

TechEd is now Microsoft Ignite

In case you were wondering what happened to the TechEd conferences, Microsoft is now bringing together the best of previously individual events – the Management Summit, the Exchange, SharePoint, Lync, Project, and TechEd conferences – and then taking it to the next level, based on what customers and partners have asked for.

The new event is called Microsoft Ignite and is going on now (May 48, 2015, Chicago, IL).

Posted in SQLServerPedia Syndication | 1 Comment

Types of NoSQL databases

A NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  NoSQL is often interpreted as Not-only-SQL to emphasize that they may also support SQL-like query languages.  Most NoSQL databases are designed to store large quantities of data in a fault-tolerant way.

NoSQL is simply the term that is used to describe a family of databases that are all non-relational.  While the technologies, data types, and use cases vary wildly amount them, it is generally agreed that there are four types of NoSQL databases:

  • Key-value stores – These databases pair keys to values.  An analogy is a files system where the path acts as the key and the contents act as the file.  There are usually no fields to update, instead, the entire value other than the key must be updated if changes are to be made.  The simplicity of this scales well but it can limit the complexity of the queries and other advanced features.  Examples are: Dynamo, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS, HyperDex, Azure Table Storage (see Redis vs Azure)
  • Graph stores – These excel at dealing with interconnected data.  Graph databases consist of connections, or edges, between nodes.  Both nodes and their edges can store additional properties such as key-value pairs.  The strength of a graph database is in traversing the connections between the nodes.  But they generally require all data to fit on one machine, limiting their scalability.  Examples include: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog, Sesame
  • Column stores – Relational databases store all the data in a particular table’s rows together on-disk, making retrieval of a particular row fast.  Column-family databases generally serialize all the values of a particular column together on-disk, which makes retrieval of a large amount of a specific attribute fast.  This approach lends itself well to aggregate queries and analytics scenarios where you might run range queries over a specific field.  Examples include: Accumulo, Cassandra, Druid, HBase, Vertica
  • Document stores – These databases store records as “documents” where a document can generally be thought of as a grouping of key-value pairs (it has nothing to do with storing actual documents such as a Word document).  Keys are always strings, and values can be stored as strings, numeric, Booleans, arrays, and other nested key-value pairs.  Values can be nested to arbitrary depths.  In a document database, each document carries its own schema — unlike an RDBMS, in which every row in a given table must have the same columns.  Examples include: Lotus Notes, Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB, OrientDB, Qizx, Cloudant, Azure DocumentDB (see MongoDB vs. Azure DocumentDB and An Overview of Microsoft Azure DocumentDB)

0247_image_090D49834477_image_523FC8B1

7026_image_644C0C3E2311_image_42DC40D5

The CAP Theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it succeeded or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

CAP_Diagram_dist copy

Since you can only pick two guarantees, here is a list of NoSQL system broken out by the two that they support:

medium_media_httpfarm5static_mevIk

Here is a quick summary of the most popular NoSQL products by group:

  • Key-value stores
    • Riak – Offers high availability, fault tolerance, operational simplicity, and scalability.  Riak is one of the more sophisticated data stores.  It offers most of the features found in others, then adds more control over duplication.  Although the basic structure stores pairs of keys and values, the options for retrieving them and guaranteeing their consistency are quite rich.
    • Redis – Like CouchDB and MongoDB, Redis stores documents or rows made up of key-value pairs.  Unlike the rest of the NoSQL world, it stores more than just strings or numbers in the value.  It will also include sorted and unsorted sets of strings as a value linked to a key, a feature that lets it offer some sophisticated set operations to the user.  There’s no need for the client to download data to compute the intersection when Redis can do it at the server.  Redis is also known for keeping the data in memory and only writing out the list of changes every once and a bit.  Some don’t even call it a database, preferring instead to focus on the positive by labeling it a powerful in-memory cache that also writes to disk.  Traditional databases are slower because they wait until the disk gets the information before signaling that everything is OK.  Redis waits only until the data is in memory, something that’s obviously faster but potentially dangerous if the power fades at the wrong moment.
  • Document stores
    • MongoDB – Is designed for scale, flexible data aggregation and to store files of any size.  It has rich querying, high availability and full indexing support and is fast being adopted by many businesses.  Uses GridFS instead of HDFS.  MongoDB is designed for OLTP workloads. It can do complex queries, but it’s not necessarily the best fit for reporting-style workloads.  Or if you need complex transactions, it’s not going to be a good choice.  However, MongoDB’s simplicity makes it a great place to start.  MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.  MongoDB is built to store data as an object in a dynamic schema, instead of a tabular database like SQL.
    • Coachbase – Can be used both as a document database that stores JSON documents or a pure key-value database.  Click here for how it compares to MongoDB.  CouchDB stores documents, each of which is made up of a set of pairs that link key with a value.  The most radical change is in the query. Instead of some basic query structure that’s pretty similar to SQL, CouchDB searches for documents with two functions to map and reduce the data.  One formats the document, and the other makes a decision about what to include.
  • Column stores
    • Cassandra – Born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, it is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.  It is essentially a hybrid between a key-value and a column-oriented (or tabular) database.  Most agree it is better than Hbase and MongoDB.  It can be used for both OLTP and data warehousing.  It replaces HDFS with the Cassandra File System (CFS).  Cassandra does not support joins or subqueries and emphasizes denormalization.
    • HBase – Patterned after Google BigTable, HBase is designed to provide fast, tabular access to the high-scale data stored on HDFS.  It is well suited for sparse data sets, which are common in many big data use cases.  HBase offers two broad use cases. First, it gives developers database-style access to Hadoop-scale storage, which means they can quickly read from or write to specific subsets of data without having to wade through the entire data store.  Most users and data-driven applications are used to working with the tables, columns, and rows of a database, and that’s what HBase provides.  Second, HBase provides a transactional platform for running high-scale, real-time applications. In this role, HBase is an ACID-compliant database that can run transactional applications.  That’s what conventional relational databases like Microsoft SQL Server are mostly used for, but HBase can handle the incredible volume, variety, and complexity of data encountered on the Hadoop platform.  Like other NoSQL databases, it doesn’t require a fixed schema, so you can quickly add new data even if it doesn’t conform to a predefined model.  It can be used for lightweight OLTP.  Tables are de-normalized for speed (so no joins), but updates can be slow.  HBase does not use Hadoop’s MapReduce capabilities directly, though HBase can integrate with Hadoop to serve as a source or destination of MapReduce jobs
  • Graph
    • Neo4j – Neo4J lets you fill up the data store with nodes and then add links between the nodes that mean things. Social networking applications are its strength.  The code base comes with a number of common graph algorithms already implemented.  If you want to find the shortest path between two people — which you might for a site like LinkedIn — then the algorithms are waiting for you.
    • OrientDB – It is a document-based database, but the relationships are managed as in graph databases with direct connections between records.  It supports schema-less, schema-full and schema-mixed modes.  It has a strong security profiling system based on users and roles and supports SQL as a query language.  OrientDB uses a new indexing algorithm called MVRB-Tree, derived from the red–black tree and from the B+ tree; this reportedly has benefits of having both fast insertions and fast lookups.

Untitled picture

But I’ll leave you with this note: Although NoSQL databases are becoming more popular, according to DB-Engines Ranking they only make up about 12% of the total database market when you include relational databases!

More info:

Visual Guide to NoSQL Systems

List Of NoSQL Databases

Thumbtack: NoSQL Database Comparison by Ben Engber

MongoDB, Cassandra, and HBase — the three NoSQL databases to watch

What is Apache Cassandra?

DB-Engines database popularity ranking

NoSQL showdown: MongoDB vs. Couchbase

NoSQL standouts: New databases for new applications

The Rise and Fall of the NoSQL Empire (2007–2013)

What’s better for your big data application, SQL or NoSQL?

Considerations for using NoSQL technology on your next IT project

Difference between SQL and NoSQL : Comparision

SQL vs NoSQL Database Differences Explained with few Example DB

NoSQL Introduction

Getting Acquainted with NoSQL on Windows Azure

Data Store Map (full PDF map) – 451 Research

Which NoSQL Solution is Right For You?

Posted in Hadoop, SQLServerPedia Syndication | 4 Comments