Making sense of Microsoft technology

In my role as a Data Platform Solution Architect (DPSA) at Microsoft, part of my responsibility is to keep up with all the Microsoft on-prem and cloud data-related technology and trends, as well as non-Microsoft technology and trends in areas such as Hadoop and NoSQL.  I work with Microsoft clients by first understanding their current data-related architectures and then educating them on which technologies and products they should consider in order to update their current architectures or to build new solutions.  There is a lot of knowledge transfer as most clients are so busy keeping what they have running that they are not aware of many of the products Microsoft has and how they all work together (I often say “they don’t know what they don’t know”).  I like to think of it as I help them put all the pieces of the puzzle together.  And as I mentioned in my previous blog, I try to show the clients The art of possible with the cloud.

It is a daunting task keeping up with all the technology as it changes so often.  Even though I spend half my days learning, I can barely keep my head above water, and that is with me just focusing on data-related products and not all the other Azure products such as networking, web and mobile app services, media services, etc. (we have “cloud solution architects” that cover those products).  To narrow down the technologies a client should consider, I will learn about their environment and ask a bunch of questions.  To help readers of my blog learn about the Microsoft technologies and which one’s might be a good fit, I wanted to list a few documents and blog posts:

Azure Quick Start Guide by me.  This is a short overview with helpful links to most of the Azure data platform and analytics products

Microsoft BI and IM Design Guidance by Rod Colledge (Data Platform Solution Architect at Microsoft).  This document contains a detailed description of the data platform and analytics products for Azure and on-prem and includes example architectures.  This is an excellent document that will give you a true understanding of many of the Microsoft products and when best to use each

Ivan Kosyakov (Data Platform Technical Architect at Microsoft) blog: Decision Tree for Big Data solutions and Decision Tree for Machine Learning.  Also check out his glossary.  These are great blogs to help you narrow down which products to use based on your use case

Azure Info Hub: An excellent list of all the Azure products that is updated frequently.  Includes a short description of each product and the latest news, along with training videos, e-books, whitepapers, tools, and even StackOverflow discussions

Hear are other useful blogs and presentations of mine:


Azure SQL Database vs SQL Data Warehouse

Relational databases vs Non-relational databases

Why use a data lake?


Relational databases vs Non-relational databases

Should I move my database to the cloud?

How does Microsoft solve Big Data?

Posted in Azure, SQLServerPedia Syndication | Leave a comment

The art of possible with the cloud

One thing I try to do in my role with Microsoft is to get clients to think of possible use cases for building solutions in the Azure cloud.  To set off light bulbs in their heads.  Sometimes this is accomplished by showing them demo’s of existing solutions.  It’s that old saying “they don’t know what they don’t know”, so I try to open up their minds to ideas they never thought of.  There are so many technologies and tools that it is easy to miss some of them, so I thought I would list the most popular one’s for building solutions in Azure:

  • Power BI: The most well-known tool for building reports and dashboards, but what is not so well-known is there are “content packs” that allow you to connect to hundreds of services such as Salesforce, Microsoft Dynamics, and Google Analytics and automatically create reports and dashboards in seconds.  See Connect to services with content packs for Power BI (65 content packs as of 9/15/16).  Also, to give you ideas on the types of reports you can build in Power BI, check out these samples broken out by department or by industry.  In addition, there is a gallery of custom visuals built by the Power BI community that you can use.  You can keep up with the Power BI updates via these links: monthly Power BI Desktop updates, monthly mobile apps for Power BI, near-weekly Power BI Service updates, and Power BI Desktop data sources (63 data sources as of 9/15/16)
  • Azure Machine Learning (Azure ML): A very easy way for designing, developing, testing and deploying predictive models.  What exactly does that mean?  Check out these use cases to get a better idea
  • Cognitive Services: A way to add vision, speech, language, knowledge, and search capabilities to your applications using intelligence APIs and SDKs.  How cool is that!  There are a ton of API’s that you can check out here
  • HoloLens: Augmented reality that enables you to interact with high‑definition holograms in your world.  In short, HoloLens is a smart-glasses headset in which the live presentation of physical real-world elements is incorporated with that of virtual elements (“holograms”) such that they are perceived to exist together in a shared environment.  We are all going to want to build solutions with this!  Here is one video and another to show you the art of possible.  Then check out some use cases.  Then get blown away by checking out holoportation
  • Microsoft Bots: Bots are bits of software that use artificial intelligence to converse in human terms.  Imagine a bot in Skype that performs a variety of tasks, like adding items to your calendar, booking travel or hotel rooms, or even pre-populating conversations to friends with text.  They are easy to build with the Microsoft Bot Framework, and you can even chat with Spock
  • Cortana Intelligence Gallery: The Cortana Intelligence Suite (CIS) is a set of tools that allow you to build end-to-end “big data” solutions.  The Cortana Intelligence Gallery is a community of developers and data scientists, including Microsoft, who have shared their analytics solutions built on CIS.  It’s a great way to see the art of possible, as well as a way to quickly build a solution by using an existing solution in the gallery as a starting point.  The gallery is categorized by Solutions, Experiments, Machine Learning APIs, Notebooks, Competitions, Tutorials, and Collections.  You can also browse by industry
  • Azure IoT Suite preconfigured solutions: IoT is the internetworking of physical devices, vehicles, buildings and other items—embedded with electronics, software, sensors, actuators, and network connectivity that enable these objects to collect and exchange data (see case studies).  The Azure IoT Suite preconfigured solutions are implementations of common IoT solution patterns that you can deploy to Azure.  You can use the preconfigured solutions as a starting point for your own IoT solutions or to learn about common patterns in IoT solution design and development
  • Microsoft Open Source:  Who would have ever thought Microsoft would embrace open source?  Well they do, big time!  You can build solutions in Azure using Linux as part of the Open Cloud.  There are over a thousand pre-configured software images in the Virtual Machines Marketplace and hundreds of Azure Quickstart templates that you can use to create a VM, many which include open source software.  In a matter of minutes you can be building solutions such as a data lake in Hadoop, or combine open source software with Microsoft technologies to create a modern data warehouse
Posted in Azure, Power BI, SQLServerPedia Syndication | 2 Comments

SQL Server 2016 real-time operational analytics

SQL Server 2016 introduces a very cool new feature called real-time operational analytics, which is the ability to run both analytics (OLAP) and OLTP workloads on the same database tables at the same time.  This allows you to eliminate the need for ETL and a data warehouse in some cases (using one system for OLAP and OLTP instead of creating two separate systems).  This will help to reduce complexity, cost, and data latency.

Real-time operational analytics targets the scenario of a single data source such as an enterprise resource planning (ERP) application on which you can run both the operational and the analytics workload.  This does not replace the need for a separate data warehouse when you need to integrate data from multiple sources before running the analytics workload or when you require extreme analytics performance using pre-aggregated data such as cubes.

Real-time operational analytics uses an updatable nonclustered columnstore index (NCCI).  The columnstore index maintains a copy of the data, so the OLTP and OLAP workloads run against separate copies of the data.  This minimizes the performance impact of both workloads running at the same time.  SQL Server automatically maintains index changes so that OLTP changes are always up-to-date for analytics.  This makes it possible and practical to run analytics in real-time on up-to-date data. This works for both disk-based and memory-optimized tables.

To accomplish this, all you need to do is to create an NCCI on one or more tables that are needed for analytics.  SQL Server query optimizer automatically chooses NCCI for analytics queries while your OLTP workload continues to run using the same btree indexes as before.


The analytics query performance with real-time operational analytics will not be as fast as you can get with a dedicated data warehouse but the key benefit is the ability to do analytics in real-time.  Some businesses may choose to do real-time operational analytics while still maintaining a dedicated data warehouse for extreme analytics as well as incorporating data from other sources.

More info:

Get started with Columnstore for real time operational analytics

Real-Time Operational Analytics: DML operations and nonclustered columnstore index (NCCI) in SQL Server 2016

Real-Time Operational Analytics – Overview nonclustered columnstore index (NCCI)

Real-Time Operational Analytics Using In-Memory Technology

SQL Server 2016 Operational Analytics (video)

Real Time Operational Analytics in SQL Server 2016 (video)

Real-time Operational Analytics in SQL Server 2016 – Part 1

Posted in SQL Server 2016, SQLServerPedia Syndication | 1 Comment

Azure SQL Database new performance level

A new performance level for Azure SQL Database was recently announced, called P15.  This new offering is more than two times more powerful than the next best offering, P11.  P15 offers 4000 database transaction units (DTU) where P11 offered 1750 DTU’s.  Also, the max concurrent workers and concurrent logins increased from 2,400 to 6,400; the max concurrent sessions stayed the same at 32,000; and the max In-memory OLTP storage increased from 14GB to 32GB.

More info:

Azure SQL Database new premium performance level P15 generally available

Posted in Azure SQL Database, SQLServerPedia Syndication | Leave a comment

Azure SQL Database vs SQL Data Warehouse

I am sometimes asked to compare Azure SQL Database (SQL DB) to Azure SQL Data Warehouse (SQL DW).  The most important thing to remember is SQL DB is for OLTP (i.e. applications with individual updates, inserts, and deletes) and SQL DW is not as it’s strictly for OLAP (i.e. data warehouses).  So if your going to build a OLTP solution, you would choose SQL DB.  However, both products can be used for building a data warehouse (OLAP).  With that in mind, here is a list of the differences:

I have other blogs that cover SQL DB and SQL DW.

Posted in Azure SQL Database, Azure SQL DW, SQLServerPedia Syndication | 2 Comments

What is the Lambda Architecture?

Lambda architecture is a data-processing architecture designed to handle massive quantities of data (i.e. “Big Data”) by using both batch-processing and stream-processing methods.  This idea is to balance latency, throughput, scaling, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data.  The two view outputs may be joined before presentation.

This allows for a way to bridge the gap between the historical single version of the truth and the highly sought after “I want it now” real-time solution.  By combining traditional batch processing systems with stream consumption tools the needs of both can be achieved with one solution.

The high-level overview of the Lambda architecture is expressed here:

Untitled picture

A brief explanation of each layer:

Data Consumption: This is where you will import the data from all the various source systems, some of which may be streaming the data.  Others may only provide data once a day.

Stream Layer: It provides for incremental updating, making it the more complex layer.  It trades accuracy for low latency, looking at only recent data.  Data in here may be only seconds behind, but the trade-off is the data may not be clean.

Batch Layer: It looks at all the data at once and eventually corrects the data in the stream layer.  It is the single version of the truth, the trusted layer, where there is usually lots of ETL and a traditional data warehouse.  This layer is built using a predefined schedule, usually once or twice a day, including importing the data currently stored in the stream layer.

Presentation Layer: Think of it as the mediator, as it accepts queries and decides when to use the batch layer and when to use the speed layer.  Its preference would be the batch layer as that has the trusted data, but if you ask it for up-to-the-second data, it will pull from the stream layer.  So it’s a balance of retrieving what we trust versus what we want right now.

A lambda architecture solution using Azure tools might look like this, using a vehicle with IoT sensors as an example:


In the above diagram, Event Hubs is used to ingest millions of events in real-time.  Stream Analytics is used for 1) real-time aggregations on data and 2) spool data into long-term storage (SQL Data Warehouse) for batch.  Machine Learning is used in real-time for anomaly detection on tire pressure, oil level, engine temp, etc, to predict vehicles requiring maintenance.  The data in the Azure Data Lake Storage is used for rich analytics using HDInsight and Machine Learning, orchestrated by the Azure Data Factory (for e.g. aggressive driving analysis over past year).  Power BI and Cortana are used for the presentation layer, and the Azure Data Catalog is the metadata repository for all the data sets.

Using Hadoop technologies might provide a solution that looks like this:


Be aware this is a complicated architecture.  It will need a number of hardware resources and difference code bases for each layer, with each possibly using different technologies/tools.  The complexity of the code can be 3-4 times a traditional data warehouse architecture.  So you will have to weigh the costs versus the benefit of being able to use data slightly newer than a standard data warehouse solution.

More info:

The Lambda architecture: principles for architecting realtime Big Data systems

How to beat the CAP theorem

Questioning the Lambda Architecture

Lambda Architecture: Low Latency Data in a Batch Processing World

Lambda Architecture for the DWH

Lambda Architecture: Design Simpler, Resilient, Maintainable and Scalable Big Data Solutions

The Lambda Architecture and Big Data Quality

Posted in SQLServerPedia Syndication | 2 Comments

Multi-tenant databases in the cloud

For companies that sell an on-prem software solution and are looking to move that solution to the cloud, a challenge arises on how to architect that solution in the cloud.  For example, say you have a software solution that stores patient data for hospitals.  You sign up hospitals, install the hardware and software and the associated databases on-prem (at the hospital or a co-location facility), and load their patient data.  Think of each hospital as a “tenant”.  Now you want to move this solution to the cloud and get the many benefits that come with it, the biggest being the time to get a hospital up and running, which can go from months on-prem to hours in the cloud.  Now you have some choices: keep each hospital separate with their own VMs and databases (“single tenant”), or combining the data for each hospital into one database (“multi-tenant”).  For another example, you can simply be creating a PaaS application similar to Salesforce.  Here I’ll describe the various cloud strategies using Azure SQL Database, which is for OLTP applications, and Azure SQL Data Warehouse, which is for OLAP applications (see Azure SQL Database vs SQL Data Warehouse):

Separate Servers\VMs

You create VMs for each tenant, essentially doing a “lift and shift” of the current on-premise solution.  This provides the best isolation possible and it’s regularly done on-premises, but it’s also the one that doesn’t enable cutting costs, since each tenant has it’s own server, sql, license and so on.  Sometimes this is the only allowable option if you have in your client contract that their data will be virtual machine-isolated from other clients.  Some cons: table updates must be replicated across all the servers (i.e. updating reference tables), there is no resource sharing, and you need multiple backup strategies across all the servers.

Separate Databases

A new database is created and assigned when a tenant is provisioned.  You can land a number of the databases on each VM (i.e. each VM handles ten tenants), or create a database using Azure SQL Database.  This is often used in order if  you need to provide isolation for each customer, because we can associate different logins, permissions and so on to each database.  If using Azure SQL Database, be aware the database size limit is 1TB.  If you have a client database that will exceed that, you can use sharding (via Elastic Database Tools) or use cross-database queries (see Scaling Azure SQL Database and Cross-database queries in Azure SQL Database) with row-level security (see Multi-tenant applications with elastic database tools and row-level security).  The lower service tier for SQL Database has a max database size of 2GB, so you might be paying for storage that you don’t really use.  If using Azure SQL Data Warehouse, you have no limit on database size.  Some other cons: A different connection pool is required per database, updates must be replicated across all the databases, there is no resource sharing (unless using Elastic Database Pools) and you need multiple backup strategies across all the databases.

Separate Schemas

Also a very good way to achieve multi-tenancy but at the same time share some resources since everything is inside the same database, but the schemas used are different, having a separate schema for each tenant.  That allows you to even customize a specific tenant without affecting others.  And you save costs by only paying for one database (which can fit on SQL Data Warehouse not matter what the size) or a handful of databases if using SQL Database (i.e. ten tenants per database).  Some of the cons: You need to replicate all the database objects in every schema, so the number of objects can increase indefinitely, updates must be replicated across all the schemas, the connection pool for the database must maintain a different connection per tenant (or set of credentials), a different user is required per tenant (which is stored at server level) and you have to backup that user independently.

A variation of this using SQL Database is to split the tenants over multiple databases, but not to use separate schemas for performance reasons.  The is done by assigning a distinct set of tenants to each database using a partitioning strategy such as hash, range or list partitioning.  This data distribution strategy is oftentimes referred to as sharding.

Row Isolation

Everything is shared in this option, server, database and even schema.  All the data for the tenants are within the same tables in one database.  The only way they are differentiated is based on a TenantId or some other column that exists on the table level.  Another big benefit is code changes: with this option you only have one spot to change code (i.e. table structure).  With the other options you will have to roll out code changes to many spots.  You will need to use row-level security or something similar when you need to limit the results to an individual tenant.  Or you can create views or use stored procedures to filter tenants.  You also have the benefit of ease-of-use and performance when you need to aggregate results over multiple tenants.  Azure SQL Data Warehouse is a great solution for this, as there is no limit to the database size.

But be aware that there is a limit of 32 concurrent queries and 1,024 concurrent connections, so if you have thousands of users who will be hitting the database at the same time, you may want to create data marts in Azure SQL Database or create SSAS cubes.  This was a limit imposed since there is no resource governor or CPU query scheduler like there is in SQL Server.  But the benefit is each query gets its own resources and it won’t affect other queries (i.e. you don’t have to worry about a query taking all resources and blocking everyone else).  There are also resource classes that allow more memory and CPU cycles to be allocated to queries run by a given user so they run faster, with the trade-off that it reduces the number of concurrent queries that can run.

A great article that discusses the various multi-tenant models in detail and how multi-tenancy is supported with Azure SQL Database is Design Patterns for Multi-tenant SaaS Applications with Azure SQL Database.

As you can see, there are lot’s of options to consider!  It becomes a balance of cost, performance, ease-of-development, east-of-use, and security.

More info:

Tips & Tricks to Build Multi-Tenant Databases with SQL Databases

Multitenancy in SQL Azure

Choosing a Multi-Tenant Data Architecture

Multi-Tenant Data Architecture

Multi-Tenant Data Isolation in SQL Azure

Multi Tenancy and Windows Azure

Posted in Azure, Azure SQL Database, Azure SQL DW, SQLServerPedia Syndication | 1 Comment

Azure SQL Data Warehouse is now GA

The Azure SQL Data Warehouse (SQL DW), that I blogged about here, is now generally available.  Here is the official announcement.

In brief, SQL DW is a fully managed data-warehouse-as-a-service that you can provision in minutes and scale up in seconds.  With SQL DW, storage and compute scale independently.  You can dynamically deploy, grow, shrink, and even pause compute, allowing for cost savings.  Also, SQL DW uses the power and familiarity of T-SQL so you can integrate query results across relational data in your data warehouse and non-relational data in Azure blob storage or Hadoop using PolyBase.  SQL DW offers an availability SLA of 99.9%, the only public cloud data warehouse service that offers an availability SLA to customers.

SQL DW uses an elastic massively parallel processing (MPP) architecture built on top of the SQL Server 2016 database engine.  It allows you to interactively query and analyze data using existing SQL-based tools and business intelligence applications.  It uses column stores for high performance analytics and storage compression, a rich collection of aggregation capabilities of SQL Server, and state of the art query optimization capabilities.

Two customer case studies using SQL DW in production were just published: AGOOP and P:Cubed.

Also note that until recently, you had to use SSDT to connect to SQL DW.  But with the July 2016 update of SSMS, you can now connect to SQL DW using SSMS (see Finally, SSMS will talk to Azure SQL DW).

More info:

Azure SQL Data Warehouse Hits General Availability

Introduction to Azure SQL Data Warehouse (video)

Microsoft Azure SQL Data Warehouse Overview (video)

Azure SQL Data Warehouse Overview with Jason Strate (video)

A Developers Guide to Azure SQL Data Warehouse (video)

Posted in Azure SQL DW, SQLServerPedia Syndication | 2 Comments

Virtualization does not equal a private cloud

I see a lot of confusion when it comes to creating a private cloud.  Many seem to think that by virtualizing your servers in your local on-premise data center, you have created a private cloud.  But by itself, virtualization falls far short of the characteristics of a private cloud and is really just server consolidation and data center automation.  The original five cloud characteristics, as stated by the National Institute of Standards and Technology (NIST), are what defines a cloud:

  1. On-demand self-service: A user can provision a virtual machine (or other resources such as storage or networks) as needed automatically without intervention from IT
  2. Broad network access: Virtualization is available for all internal customers that are on the network through various client platforms (e.g., mobile phones, tablets, laptops, and workstations)
  3. Resource pooling (multitenancy): Computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand.  There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources (storage, processing, memory, and network bandwidth) but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter)
  4. Rapid elasticity (hyper-scaling): Resources can be scaled up or down manually or in some cases automatically to commensurate with demand.  To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.  There is no need for IT to provision VMs individually and install the OS and software
  5. Measured service (operations): As opposed to IT charging costs back to other departments based on traditional budgeting, costs are based on actual usage.  Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts).  Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service

Since virtualization only solves #3, a lot more should be done to create a private cloud.  Also, a cloud should also support Platform-as-a-service (PaaS) to allow for application innovation.  Fortunately there are products to add the other characteristics to give you a private cloud, such as Microsoft’s Azure Stack.  And of course you can always use a public cloud.

The main difference between a private cloud and a public cloud is in scaling.  A public cloud like Azure has no “ceiling” so resources can be added with no limits (i.e. hyper-scaling).  A private cloud has a ceiling and you may have to wait for more hardware to be installed before you can scale.  Comparing a public cloud vs a private cloud vs a hybrid approach will be the topic of another blog.

More info:

Private cloud vs. virtualization

Which Types of Workloads to Run in Public, Private Clouds

Posted in Azure, SQLServerPedia Syndication | 1 Comment

Azure SQL Data Warehouse pricing

The pricing for Azure SQL Data Warehouse (SQL DW) consists of a compute charge and a storage charge.  For compute, it is not based on hardware configuration but rather by data warehouse units (DWU).  DWU can be scaled up or down via a sliding bar in just a couple of minutes with no down time.  You pay for DWU blocks, based on up time (you can pause your SQL DW database and not have to pay for compute while paused).  When paused, storage is still available and can be used by other resources.

You must pay for storage, even when paused, but there is no limit to the amount of data you can put into storage.

Below is some examples of the compute pricing based in the East US region.  The pricing comes from the Azure pricing calculator:

SQL Data Warehouse, 100 DWU, $1,125/month
SQL Data Warehouse, 500 DWU, $5,625/month
SQL Data Warehouse, 1000 DWU, $11,250/month
SQL Data Warehouse, 1500 DWU, $16,875/month
SQL Data Warehouse, 2000 DWU, $22,500/month
SQL Data Warehouse, 3000 DWU, $33,750/month
SQL Data Warehouse, 6000 DWU, $67,500/month

Data warehouses will use Premium Disk storage (P30).  Below is some examples of the storage pricing based in the East US region:

SQL Data Warehouse, RA-GRS Page Blob, 1TB = $135/month
SQL Data Warehouse, RA-GRS Page Blob, 10TB = $1,351/month
SQL Data Warehouse, RA-GRS Page Blob, 100TB = $13,517/month
SQL Data Warehouse, RA-GRS Page Blob, 1PB = $135,170/month

Storage transactions are not billed; customers only pay for data stored, not storage transactions.  Inbound data transfers are free.  Outbound data transfers are charged at regular data transfer rates.  Note you could store lesser-accessed data in Azure Blob Storage and access it via PolyBase to save storage costs.

More info:

SQL Data Warehouse Pricing

Understanding Windows Azure Storage Billing – Bandwidth, Transactions, and Capacity

Understand your bill for Microsoft Azure

Posted in Azure SQL DW, SQLServerPedia Syndication | 1 Comment