Use cases of various products for a big data cloud solution

There are a tremendous amount of Microsoft products that are cloud-based for building big data solutions.  It’s great that there are so many products to choose from, but it does lead to confusion on what are the best products to use for particular use cases and how do all the products fit together.  My job as a Microsoft Cloud Solution Architect is to help companies know about all the products and to help them in choosing the best products to use in building their solution.  Based on a recent architect design session with a customer I wanted to list the products and use cases that we discussed for their desire to build a big data solution in the cloud (focusing on compute and data storage products and not ingestion/ETL, real-time streaming, advanced analytics, or reporting; also, only PaaS solutions are included – no IaaS):

  • Azure Data Lake Store (ADLS): Is a high throughput distributed file system built for cloud scale storage.  It is capable of ingesting any data type from videos and images to PDFs and CSVs. This is the “landing zone” for all data.  It is HDFS compliant, meaning all products that work against HDFS will also work against ADLS.  Think of ADLS as the place all other products will use as the source of their data.  All data will be sent here including on-prem data, cloud-based data, and data from IoT devices.  This landing zone is typically called the Data Lake and there are many great reasons for using a Data Lake (see Data lake details and Why use a data lake? and the presentation Big data architectures and the data lake)
  • Azure HDInsight (HDI):  Under the covers, HDInsight is simply Hortonworks HDP 2.6 that contains 22 open source products such as Hadoop (Common, YARN, MapReduce), Spark, HBase, Storm, and Kafka.  You can use any of those or install any other open source products that can all use the data in ADLS (HDInsight just connects to ADLS and uses that as its storage source)
  • Azure Data Lake Analytics (ADLA): This is a distributed analytics service built on Apache YARN that lets you submit a job to the service where the service will automatically run it in parallel in the cloud and scale to process data of any size.  Included with ADLA is U-SQL, which has a scalable distributed query capability enabling you to efficiently analyze data whether it be structured (CSV) or not (images) in the Azure Data Lake Store and across Azure Blob Storage, SQL Servers in Azure, Azure SQL Database and Azure SQL Data Warehouse.  Note that U-SQL supports batch queries and does not support interactive queries, and does not handle persistence or indexing.
  • Azure Analysis Services (AAS): This is a PaaS for SQL Server Analysis Services (SSAS).  It allows you to create an Azure Analysis Services Tabular Model (i.e. cube) which allows for much faster query and reporting processing compared to going directly against a database or data warehouse.  A key AAS feature is vertical scale-out for high availability and high concurrency.  It also creates a semantic model over the raw data to make it much easier for business users to explore the data.  It pulls data from the ADLS and aggregates it and stores it in AAS.  The additional work required to add a cube to your solution involves the time to process the cube and slower performance for ad-hoc queries (not pre-determined), but there are additional benefits of a cube – see Why use a SSAS cube?
  • Azure SQL Data Warehouse (SQL DW): This is a SQL-based, fully-managed, petabyte-scale cloud data warehouse. It’s highly elastic, and it enables you to set up in minutes and scale capacity in seconds. You can scale compute and storage independently, which allows you to burst compute for complex analytical workloads.  It is an MPP technology that shines when used for ad-hoc queries in relational format.  It requires data to be copied from ADLS into SQL DW but this can be done quickly using PolyBase.  Compute and storage are separated so you can pause SQL DW to save costs (see
    SQL Data Warehouse reference architectures)
  • Azure Cosmos DB: This is a globally distributed, multi-model (key-value, graph, and document) database service.  It fits into the NoSQL camp by having a non-relational model (supporting schema-on-read and JSON documents) and working really well for large-scale OLTP solutions (it also can be used for a data warehouse when used in combination with Apache Spark – a later blog).  See Distributed Writes and the presentation Relational databases vs Non-relational databases.  It requires data to be imported into it from ADLS using Azure Data Factory
  • Azure Search: This is a search-as-a-service cloud solution that gives developers APIs and tools for adding a rich full-text search experience over your data.  You can store indexes in Azure Search with pointers to objects sitting in ADLS.  Azure Search is rarely used in data warehouse solutions but if queries are needed such as getting the number of records that contain “win”, then it may be appropriate.  Azure Search supports a pull model that crawls a supported data source such as Azure Blob Storage or Cosmos DB and automatically uploads the data into your index.  It also supports the push model for other data sources such as ADLS to programmatically send the data to Azure Search to make it available for searching.  Note that Azure Search is built on top of ElasticSearch and uses the Lucene query syntax
  • Azure Data Catalog: This is an enterprise-wide metadata catalog that makes data asset discovery straightforward.  It’s a fully-managed service that lets you register, enrich, discover, understand, and consume data sources such as ADLS.  It is a single, central place for all of an organization’s users to contribute their knowledge and build a community and culture of data.  Without using this product you will be in danger having a lot of data duplication and wasted effort

In addition to ADLS, Azure Blob storage can be used instead of ADLS or in combination with it.  When comparing ADLS with Blob storage, Blob storage has the advantage of lower cost since there are now three Azure Blob storage tiers: Hot, Cool, and Archive, that are all less expensive than ADLS.  The advantage of ADLS is that there are no limits on account size and file size (Blob storage has a 5 PB account limit and a 4.75 TB file limit).  ADLS is also faster as files are auto-sharded/chunked where in Blob storage they remain intact.  ADLS supports Active Directory while Blob storage supports SAS keys.  ADLS also supports WebHDFS while Blob storage does not (it supports WASB which is a thin layer over Blob storage that exposes it as a HDFS file system).  Finally, while Blob storage is in all Azure regions, ADLS is only in two US regions (East, Central) and North Europe (other regions coming soon).  See Comparing Azure Data Lake Store and Azure Blob Storage.

Now that you have a high-level understanding of all the products, the next step is to determine the best combination to use to build a solution.  If you want to use Hadoop and don’t need a relational data warehouse the product choices may look like this:

Most companies will use a combination of HDI and ADLA.  The main advantage with ADLA over HDI is there is nothing you have to manage (i.e. performance tuning), you only incur costs when running the jobs where HDI clusters are always running and incurring costs regardless if you are processing data or not, and you can scale individual queries independently of each other instead of having queries fight for resources in the same HDIinsight cluster (so predictable vs unpredictable performance).  In addition, ADLA is always available so there is no startup time to create the cluster like with HDI.  HDI has an advantage in that it has more products available with it (i.e. Kafka) and you can customize it (i.e. install additional software) where in ADLS you cannot.  When submitting a U-SQL job under ADLA you specify the resources to use via a Analytics Unit (AU).  Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM and you can go as high as 450 AU’s.  For HDI you can give more resources to your query by increasing the number of worker nodes in a cluster (limited by the region max core count per subscription but you can contact billing support to increase your limit).

Most of the time a relational data warehouse as part of your solution, with the biggest reasons being familiarity with relational databases by the existing staff and the need to present an easier to understand presentation layer to the end-user so they can create their own reports (self-service BI).  A solution that adds a relational database may look like this:

The Data Lake technology can be ADLS or blob storage, or even Cosmos DB.  The main reason against using Cosmos DB as your Data Lake is cost and having to convert all files to JSON.  A good reason for using Cosmos DB as a Data Lake is that it enables you to have a single underlying datastore that serves both operational queries (low latency, high concurrency, low compute queries – direct from Cosmos DB) as well as analytical queries (high latency, low concurrency, high compute queries – via Spark on Cosmos DB).  By consolidating to a single data store you do not need to worry about data consistency issues between maintaining multiple copies across multiple data stores.  Additionally, Cosmos DB has disaster recovery built-in by easily allowing you to replicate data across Azure regions with automatic failover (see How to distribute data globally with Azure Cosmos DB), while ADLS requires replication and failover to be done manually (see Disaster recovery guidance for data in Data Lake Store).  Blob storage has disaster recovery built-in via Geo-redundant storage (GRS) but requires manual failover by Microsoft (see Redundancy Options in Azure Blob Storage).

An option to save costs is to put “hot” data in Cosmos DB, and warm/cold data in ADLS or Blob storage while using the same reporting tool, Power BI, to access the data from either of those sources as well as many others (see Data sources in Power BI Desktop and Power BI and Excel options for Hadoop).

If Cosmos DB is your data lake or used as your data warehouse (instead of SQL DW/DB in the picture above), you can perform ad-hoc queries using familiar SQL-like grammar over JSON documents (including aggregate functions like SUM) without requiring explicit schemas or creation of secondary indexes.  This is done via the REST API, JavaScript, .NET, Node.js, or Python.  Querying can also be done via Apache Spark on Azure HDInsight, which provides additional benefits such as faster performance and SQL statements such as GROUP BY (see Accelerate real-time big-data analytics with the Spark to Azure Cosmos DB connector).  Check out the Query Playground to run sample queries on Cosmos DB using sample data.  Note the query results are in JSON instead of rows and columns.

You will need to determine if your solution will have dashboard and/or ad-hoc queries.  Your choice of products in your solution will depend on the need to support one or both of those queries.  For ad-hoc queries, you have to determine what the acceptable performance is for those queries as that will determine if you need a SMP or MPP solution (see Introducing Azure SQL Data Warehouse).  For dashboard queries (i.e. from PowerBI) it’s usually best to have those queries go against AAS to get top-notch performance and because SQL DW has a 32-concurrent query limit (and one dashboard can have a dozen or so queries).  Complex queries, sometimes referred to as “last mile” queries, may be too slow for a SMP solution (i.e. SQL Server, Azure SQL Database) and require a MPP solution (i.e. SQL DW).

The diagram above shows SQL DW or Azure SQL Database (SQL DB) as the data warehouse.  To decide which is the best option, see Azure SQL Database vs SQL Data Warehouse.  With a clustered column store index SQL DB competes very well in the big data space, and with the addition of R/Python stored procedures, it becomes one of the fastest performing machine learning solutions available.  But be aware that the max database size for SQL DB is 4 TB, but there will soon be an option called SQL DB Managed Instance that supports a max database size much higher.  See the presentations Should I move my database to the cloud? and Introducing Azure SQL Database.

You will also need to determine if you solution will have batch and/or interactive queries.  All the products support batch queries, but ADLA does not support interactive queries (so you could not use the combination of Power BI and ADLA).  If you want to stay within the Hadoop world you can use the HDInsight cluster types of Spark on HDInsight or HDInsight Interactive Query (Hive LLAP) for interactive queries against ADLS or Blob Storage (see General availability of HDInsight Interactive Query – blazing fast queries on hyper-scale data) and can use AtScale instead of AAS to build cubes/OLAP within Hadoop.  AtScale will work against data in ADLS and Blob Storage via HDInsight.

Whether to have users report off of ADLS or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data.  The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question (falling into the old mistake that the data lake does not need data governance and will magically make all the data come out properly – not understanding that HDFS is just a glorified file folder).  Another risk in the first case is performance because the data is not laid out efficiently.  Most solutions incorporate both to allow “power users” to access the data quickly via ADLS while allowing all the other users to access the data in a relational database or cube.

Digging deeper, if you want to run reports straight off of data in ADLS, be aware it is file-based security and so you may want to create a cube for row-level security and also for faster performance since ADLS is a file system and does not have indexes (although you can use a products such Jethro Data to create indexes within ADLS/HDFS).  Also, running reports off of ADLS compared to a database has disadvantages such as limited support of concurrent users; lack of indexing, metadata layer, query optimizer, and memory management; no ACID support or data integrity; and security limitations.

The decisions on which products to use is a balance between having multiple copies of the data and the additional costs that incurs and the maintaining and learning of multiple products versus less flexibility in reporting of data and slower performance.  Also, while incorporating more products into a solution means it takes longer to build, the huge benefit of that is you “future proof” your solution to be able to handle any data in the future no matter what the size, type, or frequency.

The bottom line is there are so many products with so many combinations of putting them together that a blog like this can only help so much – you may wind up needing a solution architect like me to help you make sense of it all 🙂

More info:

My presentation Choosing technologies for a big data solution in the cloud

Using Azure Analysis Services on Top of Azure Data Lake Store

Understanding WASB and Hadoop Storage in Azure

Presentation Data on Azure: The big picture

Presentation Understanding big data on Azure – structured, unstructured and streaming

Presentation Architect your big data solutions with SQL Data Warehouse and Azure Analysis Services

Presentation Building Petabyte scale Interactive Data warehouse in Azure HDInsight

Presentation Building modern data pipelines with Spark on Azure HDInsight

Presentation Azure Blob Storage: Scalable, efficient storage for PBs of unstructured data

Presentation Modernizing ETL with Azure Data Lake: Hyperscale, multi-format, multi-platform, and intelligent

Presentation Azure Cosmos DB: The globally distributed, multi-model database

Posted in Azure Analysis Services, Azure Data Lake, Azure SQL DW, HDInsight, SQLServerPedia Syndication | 1 Comment

What’s new in SQL Server 2017 presentation

I just uploaded a new presentation called “What’s new in SQL Server 2017”.  It covers all the new features in SQL Server 2017 (which went GA this past Monday), as well as details on upgrading and migrating to SQL Server 2017 or to Azure SQL Database.  Check out the slides (note there is a “download” button if you wish to have the PowerPoint presentation).

This is a good time to mention my other related presentations: Should I move my database to the cloud?Implement SQL Server on an Azure VMIntroducing Azure SQL Databaseand HA/DR options with SQL Server in Azure and hybrid.

Hope you find these useful!

Posted in SQL Server, SQL Server 2017, SQLServerPedia Syndication | Leave a comment

Microsoft Ignite Announcements

Many product announcements were made this week at Microsoft Ignite, and I wanted to give a quick overview of all the data platform related announcements:

  • SQL Server 2017 on Linux, Windows, and Docker, generally available on October 2nd.  SQL Server 2017 is being released simultaneously for Windows and various flavors of Linux: Red Hat Enterprise Linux 7.3, SUSE Linux Enterprise Server 12, Ubuntu and Docker. The official Docker image is based on Ubuntu 16.04.  The performance of SQL Server on Linux vs Windows is “basically the same”.  However, not everything has been ported. There are no Reporting Services or Analysis Services, nor Machine Learning Services, transactional replication, Stretch DB, or File Table (see Unsupported features and services).  Management tools remain for the most part Windows only, though command-line tools work.  The major new features are graph query support, Python in Machine Learning Services, SSIS scale-out, and Adaptive Query Processing and Automatic Tuning for better query optimization.  Learn more and see What’s new in SQL Server 2017
  • Azure Database Migration Service (DMS) and Azure SQL Database Managed Instance, public preview.  New Managed Instance offering within SQL Database offers near-complete SQL Server compatibility and network isolation for easiest lift and shift to Azure.  DMS offers a fully managed, first party Azure service that enables customers to easily migrate their on-premises SQL server databases to Azure SQL Database Managed Instance and SQL Server in Azure Virtual Machines with minimal to no downtime.  Customers can maximize existing license investments with discounted rates on Managed Instance using a new Azure Hybrid Benefit for SQL Server.  Sign up for news on availability
  • Azure Machine Learning, new capabilities in public preview.  Updates connect every element of the data science process with enhanced productivity and collaboration for AI developers and data scientists at any scale.  Enables them to start building right away with their choice of tools and frameworks.  The updated platform includes a enhanced data cleansing and prepping tool called ML Workbench to start the modeling process sooner.  It is a client application that runs on Windows and Mac and is targeted at data scientists who are not users of Visual Studio and integrates with popular open source data science toolkits such as Python Scikit Learn, Jupyter Notebooks and Matplotlib.  It integrates with the cloud by seamlessly moving the heavy lifting to the GPU-powered VMs in Azure.  Other new capabilities include The Azure Machine Learning Experimentation service allows developers and data scientists to increase their rate of experimentation; The Model Management service provides deployment, hosting, versioning, management, and monitoring for models in Azure, on-premises, and to IOT Edge devices.  These new features will help data scientists develop, deploy, and manage machine learning and AI models at any scale wherever data lives: in the cloud, on-premises, and edge.  Learn more on the Azure Machine Learning page and Diving deep into what’s new with Azure Machine Learning
  • Microsoft Cognitive Services updates.  Includes general availability of Text Analytics API, a cloud-based service for language processing such as sentiment analysis, key phrase extraction and language detection.  In October, we will also make generally available Bing Custom Search to create customized search experience for a section of the web, and Bing Search APIs v7 for searching the entire web for more relevant results using Bing Web, News, Video & Image search.  Read the announcement blog post
  • Announcing the preview of Machine Learning Services with R support in Azure SQL Database.  You can evaluate this preview functionality in any server/database created in the West Central US Region.  More info
  • Azure Data Factory (ADF) – announcing new capabilities in public preview.  These new capabilities in ADF will enable you to build hybrid data integration at scale.  Now you can create, schedule, and orchestrate your ETL/ELT workflows, wherever your data lives, in the cloud or on any self-hosted network.  Meet security and compliance needs while taking advantage of extensive capabilities and paying only for what you use.  Accelerate your data integration with multiple data source connectors natively available in-service.  SQL Server Integration Services (SSIS) customers will benefit from easily lifting their SSIS packages into the cloud using new managed SSIS hosting capabilities in Data Factory.  We have taken the first steps to separate Control Flow and Data Flow within ADF to provide greater control over complex orchestrations that now facilitate looping, branching, and conditional structures within Control Flow.  We have added new flexibility to scheduling by enabling triggering with wall-clock timers or on-demand via event generation.  Parameters can now be defined and passed while invoking pipelines to enable incremental data loads.  If you want to move your SSIS workloads, you can create a data factory version 2, and provision an Azure-SSIS Integration Runtime (IR).  The Azure-SSIS IR is a fully managed cluster of Azure VMs (nodes) dedicated to run your SSIS packages in the cloud.  For step-by-step instructions, see the tutorial: deploy SSIS packages to Azure.  Full details of the release and features can be found on the Azure Data Factory service page. We encourage you to try these new capabilities, available at public preview pricing
  • Announcing the preview for the Azure Data Box.  A hardware appliance that companies can use to load their data for shipping to the closest Microsoft Azure data center.  The 45-lb box, which is tamper proof, holds up to 100 terabytes (TB) of data.  It plugs into a corporate network for downloads, and then into Azure’s own high-speed networks to upload its contents.  Companies will be able to rent it, fill it, and ship it while tracking its progress.  Data on the device will be encrypted throughout the journey.  More info
  • Introducing Azure Availability Zones for resiliency and high availability.  Availability Zones are fault-isolated locations within an Azure region, providing redundant power, cooling, and networking.  Availability Zones allow customers to run mission-critical applications with higher availability and fault tolerance to datacenter failures.  More info
  • Public preview: Virtual network service endpoints for Azure Storage and SQL Database.  You can now secure Azure Storage and Azure SQL Database to only your virtual networks, by using virtual network service endpoints.  Endpoints provide a direct connection from your virtual network to the Azure services, extending your virtual network’s private address space and identity to the services.  Traffic from your virtual network to the services will always remain on the Microsoft Azure network backbone.  More info
  • Intelligent insights for Azure SQL Database.  Azure SQL Database built-in intelligence continuously monitors database usage through artificial intelligence and detects disruptive events that cause poor performance.  Once detected, a detailed analysis is performed generating a diagnostic log with intelligent assessment of the issue.  This assessment consists of a root cause analysis of the database performance issue and where possible recommendations for performance improvements.  More info
  • Read replicas for Azure Database for MySQL.  Read replicas will allow customers using MySQL on-premises or on other cloud service providers to create replicas of their instance in Azure.  They can then choose to upgrade the replica to master in Azure Database for MySQL, and connect their apps directly to the new database instance.  If you are interested in understanding the functionality of this private preview visit the Azure blog or more information
  • Renamed R Server to Machine Learning Server.  Announced was the renaming of Microsoft R Server to Microsoft Machine Learning Server and SQL Server R Services to SQL Server Machine Learning Services.  The additional language support aligns the Advanced Analytics workload to Machine Learning capabilities and focus on AI.  With Python support in addition to R and Microsoft ML libraries we enhance Machine Learning capabilities and offer the ability to develop new intelligent applications combining the best of open source and enterprise capabilities of SQL Server 2017.  More info
  • Azure SQL Database: Vulnerability Assessment.  SQL Vulnerability Assessment (currently in preview) is an easy to configure tool that can discover, track, and remediate potential database vulnerabilities.  Use it to proactively improve your database security.  More info
  • The Power BI team announced a much-awaited feature; automatic updates to the Power BI Desktop.   Through the Windows Store, you can now install the Power BI Desktop once and get updates automatically every month.  Read this blog post on http://aka.ms/biatmicrosoft to learn more
  • Faster compute optimized performance tier for Azure SQL Data Warehouse.  The compute optimized performance tier brings several benefits to your analytics workloads.  The first benefit can be seen through dramatically improved query performance.  Individual query execution times have improved by as much as 10x.  We’ve also seen some fantastic results with customer workloads and benchmarks where queries are completing twice as fast on average.  The compute and storage scalability has also been dramatically increased with this performance tier.  You can now provision 5x the computing power and store an unlimited amount of columnar data, empowering you to run your largest and most complex analytics workloads.  More info
  • Azure free account, now available.  A best-in-industry offer, the Azure free account helps customers try Azure.  It comes with 12-months free access to compute, storage, database, and networking services, along with 25+ always-free services, including Azure App Service and Functions.  It also includes a $200 credit allowing customers to try any Azure product for the first 30 days. More information at azure.com/free and Azure Free Account FAQ
  • Azure Stack, now shipping through Dell EMC, HPE, and Lenovo.  Azure Stack is an extension of Azure, allowing customers to uniquely meet hybrid requirements like compliance, latency, and true consistency as a part of their hybrid cloud strategy.  Cisco and Wortmann will start taking orders soon.  Customers can also buy Azure Stack as a managed service from Avanade, Rackspace, and several MSP partners.  Azure Stack certification for IT Professionals materials are available now, and certifications exams will start Q1 2018.  More information on azure.com/azurestack
  • Azure Reserved Virtual Machine Instances.  When available later in 2017, customers will be able to reserve virtual machines on Azure for a one- or three-year term with significant cost savings of up to 82% over pay-as-you-go prices when combined with Azure Hybrid Benefit and up to 72% on all VMs.  Customers select the VM type, term, and datacenter region, so the compute resources are available when and where needed.  Improve budgeting with a single up-front payment while maintaining the flexibility to exchange or cancel at any time.  Details on Azure.com
  • Native integration between Azure Cosmos DB and Azure Functions. We’re bringing the power of Azure Cosmos DB to our serverless offering, Azure Functions.  With this integration, developers can write serverless apps backed by Cosmos DB, with just a few lines of code.  They can innovate faster by reacting in real-time to changes happening in the database to drive more engaging and personalized customer experiences.  Using Azure Functions and Azure Cosmos DB, customers can create and deploy event-driven, planet-scale serverless apps with extremely low-latency access against very rich data.  Read the blog
  • GA of HDInsight Interactive Query (Hive LLAP).  This is an Azure HDInsight cluster type.  It supports in-memory caching, which makes Hive queries faster and much more interactive.  More info
  • Microsoft is now offering Blob storage accounts with up to 5PB (petabytes) of maximum capacity, a 10x increase.  Both incoming and outgoing data can now move at up to 50Gbps (gigabits per second) and users can expect 50,000 TPS/IOPS (transactions per second/input output operations per second) performance, a 2.5x jump.  More info
  • Announcing new Azure VM sizes for more cost-effective database workloads.  We are excited to announce the latest versions of our most popular VM sizes (DS, ES, GS, and MS), which constrain the vCPU count to one half or one quarter of the original VM size, while maintaining the same memory, storage and I/O bandwidth. We have marked these new VM sizes with a suffix that specifies the number of active vCPUs to make them easier for you to identify.  For example, the current VM size Standard_GS5 comes with 32 vCPUs, 448GB mem, 64 disks (up to 256 TB), and 80,000 IOPs or 2 GB/s of I/O bandwidth. The new VM sizes Standard_GS5-16 and Standard_GS5-8 comes with 16 and 8 active vCPUs respectively, while maintaining the rest of the specs of the Standard_GS5 in regards to memory, storage, and I/O bandwidth.  More info
  • New in Stream Analytics: Output to Azure Functions, built-in anomaly detection, etc.  Announced the preview of several new and compelling capabilities in Azure Stream Analytics.  These include built-in inline machine learning based anomaly detection, egress to Azure functions, support for compressed data formats, JavaScript User defined aggregates, and support for CI/CD in Visual Studio tooling.  These new features will start rolling out over the course of the next several weeks.  More info
  • Announcing Azure Migrate.  A new service that provides the guidance, insights, and mechanisms needed to assist you in migrating on-premises virtual machines and servers to Azure.  More info
Posted in SQLServerPedia Syndication | 2 Comments

Distributed Writes

In SQL Server, scaling out reads (i.e. using Active secondary replicas via AlwaysOn Availability Groups) is a lot easier than scaling out writes.  So what are your options when you have a tremendous amount of writes that scaling up will not handle, no matter how big your server is?  There are a number of options that allow you to write to many servers (instead of writing to one master server) that I’ll call distributed writes.  Here are some ideas:

The one option out of all the above options that does not require coding and can support a large number of writes per second is Azure Cosmos DB.  All the other options can require significant coding and/or can only handle a limited amount of writes per second.  This is because Cosmos DB uses documents (JSON files) where all the information needed is included in that document so no joins are needed and documents can be spread on multiple servers (see Partition and scale in Azure Cosmos DB and A technical overview of Azure Cosmos DB).  This is opposed to relational databases that use multiple tables that must be joined.  If the tables are on different nodes that will cause a lot of data shuffling causing performance problems.

To go into greater detail on the benefits of Cosmos DB over SQL Server for distributed writes:

  • Consistency
    • Peer2Peer SQL Replication introduces pains around data consistency and conflict resolution
  • Availability
    • Sharding with SQL introduces pains around maintaining availability when increasing/decreasing the degree of scale-out.  Frequently, downtime is involved due to needs to re-balancing data across shards
    • SQL requires rigid schemas and indices to be defined upfront.  Every time schema and index updates are needed – you will incur a heavy operational cost of running Create Index and Alter Tables scripts across all database shards and replicas.  Furthermore, this introduces availability issues as schemas are being altered.
  • Handling sustained heavy write ingestion
    • Putting queueing mechanisms in front of SQL only gives you a buffer for handling spikes in writes, but at the end of the day, the database itself needs to support sustained heavy write ingestion in order to consume the buffered events.  What happens if events come in to the buffer faster than you drain it?  You will need a database specifically designed for heavy write ingestion

Azure Cosmos DB solves these by:

  • Providing 5 well-defined consistency models to help developers tune the right Consistency vs Performance tradeoffs for their scenario
  • Scale on demand and support for flexible data model while maintaining high availability (99.99% availability SLA).  Scaling out and partition management is taken care of by the service on behalf of the user
  • Use of log-structured techniques to be a truly latch-free database to sustain heavy write ingestion with durable persistence

In the end, eliminating schema, index management, and JOINs are a necessary byproduct of scale out that Azure Cosmos DB provides.

After the initial post of this blog, I received the question “Why not just use SQL 2016 in-Memory tables for heavy write systems (info)?” and received a great reply from a Cosmos DB product manager:

SQL in-memory is only viable when:

  • Request and data volume are small enough to fit on a single machine.  You still have the fundamental problem of hard limits due to scale-up model.
  • Scenario does not need durability, reliability, or availability – which are requirements for >99% of mission critical customer scenarios.

Durability

  • If data is kept in only in-memory, you experience data loss upon any intermittent failure that requires computer to restart (e.g. os crash, power outage, os decides it wants to reboot to update, etc.).  In order for data to be durable, it needs to be persisted to disk.  In order to offer resiliency against disk failures, you will need to replicate to a number of disks
  • For durable scenarios – memory only acts as a buffer to absorb spikes.  In order to achieve sustained write ingestion – you will need to flush the buffer as fast as you input into the buffer.  Now you have a bottleneck on disk i/o unless you scale-out
  • This is why they immediately have to address that this is for “applications where durability is not required”; durability is a requirement for >99% of data scenarios
  • Data loss and data corruption should be treated as cardinal sin for any data product

Scale

  • This model is still a scale-up model – in which there are many hard limits
  • What happens for data volume that doesn’t fit in memory (which tends to be very small size relative to disk storage)?  You need to scale-out
  • What happens for request volume that memory bandwidth is inadequate?  You need to scale out
  • This is why the throughput numbers in the blog are orders of magnitude smaller than what customers are doing everyday on Cosmos DB, and talking about storage size is quietly ignored

Expensive

  • Memory is 100x more expensive than SSD.  Achieving high storage in a scale-out system will yield not only better scale and durability characteristics – but incur much lower costs for any large-scale scenarios

More info:

Database Sharding

Posted in SQL Server, SQLServerPedia Syndication | 9 Comments

Azure Archive Blob Storage

Last week Microsoft released a public preview of a new service called Azure Archive Blob Storage, offering customers a lower-cost cloud storage solution for rarely accessed data.  This allows for storage tiering, where organizations can place their critical data on expensive, high-performance storage and then move it down the line as it winds up being accessed less frequently over time.

Last year Microsoft introduced Azure Cool Blob storage, which cost customers a penny per GB per month in some Azure regions.  Now, users have another, lower-cost option in Azure Archive Blob Storage, along with new Blob-Level Tiering data lifecycle management capabilities.  So there are now three Azure blog storage tiers: Hot, Cool, and Archive.

Azure Archive Blob Storage costs 0.18 cents per GB per month when the service is delivered through its cloud data center in the East US 2 (for comparison, in the same region hot is 1.8 cents and cool is 1.0 cents per GB per month) .  Customers can expect a 99 percent availability SLA (service level agreement) when the service makes its way out of the preview stage.

Complementing the new service is a new Blob-level Tiering feature that will allow customers to change the access tier of blob storage objects among Hot, Cool or Archive.  Also in preview, it enables users to match costs to usage patterns without moving data between accounts.

Archive storage has the lowest storage cost and highest data retrieval costs compared to hot and cool storage.

While a blob is in archive storage, it cannot be read, copied, overwritten, or modified.  Nor can you take snapshots of a blob in archive storage.  However, you may use existing operations to delete, list, get blob properties/metadata, or change the tier of your blob.  To read data in archive storage, you must first change the tier of the blob to hot or cool.  This process is known as rehydration and can take up to 15 hours to complete for blobs less than 50 GB.  Additional time required for larger blobs varies with the blob throughput limit.

During rehydration, you may check the “archive status” blob property to confirm if the tier has changed.  The status reads “rehydrate-pending-to-hot” or “rehydrate-pending-to-cool” depending on the destination tier.  Upon completion, the “archive status” blob property is removed, and the “access tier” blob property reflects the hot or cool tier.

Example usage scenarios for the archive storage tier include:

  • Long-term backup, archival, and disaster recovery datasets
  • Original (raw) data that must be preserved, even after it has been processed into final usable form. (For example, Raw media files after transcoding into other formats)
  • Compliance and archival data that needs to be stored for a long time and is hardly ever accessed. (For example, Security camera footage, old X-Rays/MRIs for healthcare organizations, audio recordings, and transcripts of customer calls for financial services)

More info:

Announcing the public preview of Azure Archive Blob Storage and Blob-Level Tiering

Microsoft Unveils Cost-Cutting Archival Cloud Storage Option

Posted in Azure, SQLServerPedia Syndication | 2 Comments

Data Virtualization vs Data Warehouse

Data virtualization goes by a lot of different names: logical data warehouse, data federation, virtual database, and decentralized data warehouse.  Data virtualization allows you to integrate data from various sources, keeping the data in-place, so that you can generate reports and dashboards to create business value from the data.  It is an alternative to building a data warehouse, where you collect data from various sources and store a copy of the data in a new data store.

The main advantage of data virtualization is speed-to-market, where we can build a solution in a fraction of the time it takes to build a data warehouse.  This is because you don’t need to design and build the data warehouse and the ETL to copy the data into it, and also don’t need to spend as much time testing.  Copying the data means more hardware costs, more software licenses, more ETL flows to build and maintain, more data inconsistencies and more data governance costs, so using data virtualization can also save you a lot of money.

Some of the more popular data virtualization products are Cisco Data Virtualization (previously called Composite Software), Denodo Platform for Data Virtualization and Informatica Data Virtualization.

Along the same lines of data virtualization vs data warehouse is federated queries vs data lake.  Such technologies as PolyBase, Metanautix, and U-SQL in Azure Data Lake Analytics provide for federated queries.

But there are some major drawbacks to data virtualization and federated queries, so you have to ask the following questions when you are thinking about using it:

  • Speed.  Is this something I could use for a Power BI dashboard where I wanted to slice and dice data with sub-second response times?  Or is this more for operational type reporting?
  • How much will this affect the performance of the source system?  Could a query consume all the resources of a server with a data source that I’m querying against?  Does it push down the query in the same way PolyBase does?
  • Do I need to install something on each server that contains a data source I want to use?
  • Does it use the indexes of each technology on the data store, or does it create its own indexes?
  • How is security handled for giving users access to each data source?
  • How is master data management handled (i.e. the same customer in multiple data sources but the customer name spelled differently)?
  • Where and how will the data be cleaned?

And there are some very valid reasons why a physical data warehouse is required:

  • Many production systems don’t keep track of historical data.  This data must be stored somewhere for historical analysis of the data. The physical data warehouse is, in this case, the most obvious solution
  • Accessing production systems directly for reporting and analytics can lead to too much interference on those systems and to performance degradation.  Note that this was once the reason why physical data warehouses were developed in the first place
  • Speed: I data warehouse is optimized for read access while a source system is usually optimized for writes
  • In building a data warehouse you will be restructuring, renaming, and joining data (i.e. creating star schemas) to make it easy for users to create reports
  • A data warehouse protects users against source system upgrades

A word about Views, such as those in SQL Server, can be thought of as a “lightweight” data virtualization solution: When users need access to operational data, views can be defined directly on the operational data store or the production databases.  But views have their own issues: operations for correcting and transforming data must be added to the view definitions, because the original operations are implemented in the physical data warehouse environment and are now bypassed.  They now have to be virtualized.

More info:

IT pros reveal benefits, drawbacks of data virtualization software

Experts Reconsider the Data Warehouse

Clearly Defining Data Virtualization, Data Federation, and Data Integration

DATA VIRTUALIZATION

DATA FEDERATION

Mark Beyer, Father of the Logical Data Warehouse, Guest Post

The Logical Data Warehouse: Smart Consolidation for Smarter Warehousing

data federation technology (data virtualization technology or data federation services)

Logical Data Warehousing for Big Data

The Many Uses of Data Federation

Data Federation

Enterprise Data Management, Part 1

How Data Federation Can Co-exist with an EDW

Demystifying Data Federation for SOA

Federation Supplements The Data Warehouse – Not Either/Or, Never Was

Posted in Data Lake, Data warehouse, SQLServerPedia Syndication | Comments Off on Data Virtualization vs Data Warehouse

Microsoft Big Data Certification Exams

I previously blogged about Microsoft certification changes, and since then there have been some new Microsoft Big Data certifications exams released:

Already live:

70-773: Analyzing Big Data with Microsoft R Server

70-774: Perform Cloud Data Science with Azure Machine Learning

70-775: Perform Data Engineering on Microsoft Azure HDInsight

Now in beta:

70-776: Perform Big Data Engineering with Microsoft Cloud Services (Azure SQL Data Warehouse, Azure Data Lake Analytics, Azure Data Factory, Azure Stream Analytics)

70-778: Visualizing Data with Power BI

A few other related notes:

Posted in Certification, SQLServerPedia Syndication | 3 Comments

Azure Analysis Services web designer

Microsoft has released a preview of the Azure Analysis Services web designer.  This is a browser-based experience that will allow developers to start creating and managing Azure Analysis Services (AAS) semantic models quickly and easily.  SQL Server Data Tools (SSDT) and SQL Server Management Studio (SSMS) will still be the primary tools for development, but this new designer gives you another option for creating a new model or to do things such as adding a new measure to a development or production AAS model.

A highly requested feature is that you can import a Power BI Desktop file (.pbix) into an Analysis Services database.  And once imported you can reverse engineer to Visual Studio.  Note for PBIX import only Azure SQL Database, Azure SQL Data warehouse, Oracle, and Teradata are supported at this time and Direct Query models are not yet supported for import (Microsoft will be adding new connection types for import every month).

This initial release includes three major capabilities, model creation, model editing, and doing queries.  The model can be created directly from Azure SQL Database or SQL Data Warehouse or as mentioned imported from Power BI Desktop PBIX files.  When creating from a database, you can choose which tables to include and the tool will create a DirectQuery model to your data source (DirectQuery is where the query to obtain the model is passed down into the data source and executed there).  Then you can view the model metadata, edit the Tabular Model Scripting Language (TMSL) JSON, and add measures.  There are shortcuts to open a model in Power BI Desktop, Excel, or open a Visual Studio project which is created from the model on the fly.  You can also create simple queries against the model to see the data or test out a new measure without having to use SSMS.

Keep in mind that if you save changes to the TMSL, it is updating the model in the cloud.  It is a best practice to make changes to a development server and propagate changes to production with a tool such as BISM Normalizer.

More info:

Introducing the Azure Analysis Services web designer

Azure Analysis Services Web Designer

Posted in Azure Analysis Services, SQLServerPedia Syndication, SSAS | 1 Comment

Azure SQL Data Sync 2.0

Azure SQL Data Sync has been ignored for quite some time, but has finally gotten an update (it’s in public preview).  This release includes several major improvements to the service including new Azure portal support, PowerShell and REST API support, and enhancements to security and privacy.

SQL Azure Data Sync is a Microsoft Windows Azure web service that provides data synchronization capabilities for SQL databases.  SQL Azure Data Sync allows data to be synchronized between on-premises SQL Server databases and Azure SQL databases; in addition, it can also keep multiple Azure SQL databases in sync (see Azure SQL Data Sync technical documentation).

SQL Data Sync targets the reference data replication scenario.  Its key capabilities are:

Sync between SQL Server (2005 SP2 and later) and Azure SQL databases, or between Azure SQL databases:

  • One-way and bi-directional sync
  • One-to-one and hub-spoke
  • Table filter and column filter
  • Scheduled and on-demand
  • Eventual consistency

Active Geo-Replication, in contrast, targets the GeoDR scenario for Azure SQL Database by replicating the database to another region.  It only supports one-way replication (secondaries are read-only), replication is at database granularity, there is no database or column/row filter support, and it is only available for Premium service tier.

Comparing Data Sync to transactional replication: if you only need to do one way replication, both solution work.  Transactional replication can provide better latency and transactional consistency.  But it requires more complex setup and maintenance.  It doesn’t require the latest SQL Server as publisher as SQL Server 2012 or later will work.  Data sync is an Azure Database feature which requires less maintenance and is easier to setup.  But it has a minimum five minute latency and a more significant performance impact to the source database (trigger based change tracking).  It is optimized for bi direction sync.

Data Sync is now available in the new Azure portal.  If you would like to try Data Sync refer to this tutorial.  Existing users will be migrated to the new service starting June 1, 2017.  For more information on migration look at the blog post “Migrating to Azure SQL Data Sync 2.0.”

More info:

Azure SQL Data Sync Refresh

Sync data across multiple cloud and on-premises databases with SQL Data Sync

Getting Started with Azure SQL Data Sync (Preview)

Posted in Azure, SQLServerPedia Syndication | Comments Off on Azure SQL Data Sync 2.0

4TB disk sizes for Azure IaaS VMs available

Microsoft has introduced two new disk sizes for Azure IaaS VMs in P40 (2TB) and P50 (4TB) for both managed and unmanaged Premium Disks and S40 (2TB) and S50 (4TB) for both managed and unmanaged Standard Disks.  This enables customers to add 4x more disk storage capacity per VM.  Customers can now provision up to a total of 256TB disk storage on a GS5 VM using 64 disks with 4TB capacity.  This means the max SQL Server database size on a Azure VM goes from 64TB to 256TB!

Premium Disks Standard Disks
Managed Disks P40, P50 S40, S50
Unmanaged Disks P40, P50 Max up to 4,095GB

Larger Premium Disks P40 and P50 will support your IO intensive workload and therefore offer higher provisioned disk performance. The maximum Premium Disk IOPS and bandwidth is increased to 7,500 IOPS and 250 MBps respectively.  Standard Disks, of all sizes, will offer up to 500 IOPS and 60 MBps.

P40 P50 S40 S50
Disk Size 2048GB 4095GB 2048GB 4095GB
Disk IOPS 7,500 IOPS 7,500 IOPS Up to 500 IOPS Up to 500 IOPS
Disk Bandwidth 250 MBps 250 MBps Up to 60 MBps Up to 60 MBps

You can create a larger disk or resize existing disks to larger disk sizes with your existing Azure tools through Azure Resource Manager (ARM) or the Azure Portal.  Azure Backup and Azure Site Recovery support for larger disks is coming soon (current timeline is before end of July).

You can visit the Managed Disk Pricing and unmanaged Disk Pricing pages for more details about pricing.

More info:

Azure increases the maximum size and performance of Azure Disks

Posted in Azure, SQLServerPedia Syndication | 2 Comments