Public preview of Azure SQL Database Managed Instance

Microsoft has announced the public preview of Azure SQL Database Managed Instance.  I blogged about this before.  This will lead to a title wave of on-prem SQL Server database migrations to the cloud.  In summary:

Managed Instance is an expansion of the existing SQL Database service, providing a third deployment option alongside single databases and elastic pools. It is designed to enable database lift-and-shift to a fully-managed service, without re-designing the application.  SQL Database Managed Instance provides the broadest SQL Server engine compatibility and native virtual network (VNET) support so you can migrate your SQL Server databases to SQL Database without changing your apps.  It combines the rich SQL Server surface area with the operational and financial benefits of an intelligent, fully-managed service.

Two other related items that are available:

  • Azure Hybrid Benefit for SQL Server on Azure SQL Database Managed Instance. The Azure Hybrid Benefit for SQL Server is an Azure-based benefit that enables customers to use their SQL Server licenses with Software Assurance to save up to 30% on SQL Database Managed Instance. Exclusive to Azure, the hybrid benefit will provide an additional benefit for highly-virtualized Enterprise Edition workloads with active Software Assurance: for every 1 core a customer owns on-premises, they will receive 4 vCores of Managed Instance General Purpose. This makes moving virtualized applications to Managed Instance highly cost-effective.
  • Database Migration Services for Azure SQL Database Managed Instance. Using the fully-automated Database Migration Service (DMS) in Azure, customers can easily lift and shift their on-premises SQL Server databases to a SQL Database Managed Instance. DMS is a fully managed, first party Azure service that enables seamless and frictionless migrations from heterogeneous database sources to Azure Database platforms with minimal downtime. It will provide customers with assessment reports that guide them through the changes required prior to performing a migration. When the customer is ready, the DMS will perform all the steps associated with the migration process.

More info:

Migrate your databases to a fully managed service with Azure SQL Database Managed Instance

What is Azure SQL Database Managed Instance?

Video Introducing Azure SQL Database Managed Instance

Azure SQL Database Managed Instance – the Good, the Bad, the Ugly

Posted in Azure SQL Database, SQLServerPedia Syndication | 4 Comments

It’s all about the use cases

There is no better way to see the art of the possible with the cloud than in use cases/customer stories and sample solutions/architectures.  Many of these are domain-specific which resonates best with the business decision makers:

Use cases/customer stories

Microsoft IoT customer stories: Explore Internet of Things (IoT) examples and IoT use cases to learn how Microsoft IoT is already transforming your industry.  The industry’s are broken out by: Manufacturing, Smart Infrastructure, Transportation, Retail, and Healthcare.

Customer stories: Dozens of customer stories of solutions built in Azure that you can filter on by language, industry, product, organization size, and region.

Case studies: See the amazing things people are doing with Azure broken out by industry, product, solution, and customer location.

Sample solutions/architectures

Azure solution architectures: These architectures to help you design and implement secure, highly-available, performant and resilient solutions on Azure.

Pre-configured AI solutions: These serve as a great starting point when building an AI solution.  Broken out by Retail, Manufacturing, Banking, and Healthcare.

Internet of Things (IoT) solutions: Great IoT sample solutions such as: connected factory, remote monitoring, predictive maintenance, connected field service, connected vehicle, and smart buildings.

Posted in Azure, SQLServerPedia Syndication | 1 Comment

My latest presentations

I frequently present at user groups, and always try to create a brand new presentation to keep things interesting.  We all know technology changes so quickly so there is no shortage of topics!  There is a list of all my presentations with slide decks.  Here are the new presentations I created the past year:

Differentiate Big Data vs Data Warehouse use cases for a cloud solution

It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together.  In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn’t, in order for you to position, design and deliver the proper adoption use cases for each with your customers.  We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS  as well as high-level concepts such as when to use a data lake.  We will also review the most common reference architectures (“patterns”) witnessed in customer adoption. (slides)

Introduction to Azure Databricks

Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project.  It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib).  It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark. (slides)

Azure SQL Database Managed Instance

Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer.  It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into a Azure SQL Database Managed Instance).  Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it’s features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc.  So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes. (slides)

What’s new in SQL Server 2017

Covers all the new features in SQL Server 2017, as well as details on upgrading and migrating to SQL Server 2017 or to Azure SQL Database. (slides)

Microsoft Data Platform – What’s included

The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it.  My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source. (slides)

Learning to present and becoming good at it

Have you been thinking about presenting at a user group?  Are you being asked to present at your work?  Is learning to present one of the keys to advancing your career?  Or do you just think it would be fun to present but you are too nervous to try it?  Well take the first step to becoming a presenter by attending this session and I will guide you through the process of learning to present and becoming good at it.  It’s easier than you think!  I am an introvert and was deathly afraid to speak in public.  Now I love to present and it’s actually my main function in my job at Microsoft.  I’ll share with you journey that lead me to speak at major conferences and the skills I learned along the way to become a good presenter and to get rid of the fear.  You can do it! (slides)

Microsoft cloud big data strategy

Think of big data as all data, no matter what the volume, velocity, or variety.  The simple truth is a traditional on-prem data warehouse will not handle big data.  So what is Microsoft’s strategy for building a big data solution?  And why is it best to have this solution in the cloud?  That is what this presentation will cover.  Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it.  My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution. (slides)

Choosing technologies for a big data solution in the cloud

Has your company been building data warehouses for years using SQL Server?  And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”?  What technologies and tools should use?  That is what this presentation will help you answer.  First we will level-set what big data is and other definitions, cover questions to ask to help decide which technologies to use, go over the new technologies to choose from, and then compare the pros and cons of the technologies.  Finally we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data?  Should I use a data lake?  Do I still need a cube?  What about Hadoop/NoSQL?  Do I need the power of MPP?  Should I build a “logical data warehouse”?  What is this lambda architecture?  And we’ll close with showing some architectures of real-world customer big data solutions.  Come to this session to get started down the path to making the proper technology choices in moving to the cloud. (slides)

Posted in Presentation, SQLServerPedia Syndication | 3 Comments

Azure Data Architecture Guide (ADAG)

The Azure Data Architecture Guide has just been released!  Check it out:

Think of it as a menu or syllabus for data professionals.  What service should you use, why, and when would you use it.  I had a small involvement in its creation, but there were a large number of people within Microsoft and from 3rd parties that put it together over many months.  Hopefully you find this clears up some of the confusion caused by so many technologies and products.

“This guide presents a structured approach for designing data-centric solutions on Microsoft Azure.  It is based on proven practices derived from customer engagements.”

You can even download a PDF version (106 pages!).

The guide is structured around a basic pivot: The distinction between relational data and non-relational data:

Within each of these two main categories, the Data Architecture Guide contains the following sections:

  • Concepts. Overview articles that introduce the main concepts you need to understand when working with this type of data.
  • Scenarios. A representative set of data scenarios, including a discussion of the relevant Azure services and the appropriate architecture for the scenario.
  • Technology choices. Detailed comparisons of various data technologies available on Azure, including open source options.  Within each category, we describe the key selection criteria and a capability matrix, to help you choose the right technology for your scenario.

The table of contents looks like this:

Traditional RDBMS



Big data and NoSQL



Cross-cutting concerns

Posted in Big Data, Data warehouse, SQLServerPedia Syndication | 1 Comment

Conversations with Data Warehouse Experts – Podcast

In this podcast I talk with Mike Rabinovici of Dimodelo Solutions about data being the new currency, the importance of showing customers the art of the possible, and last but not least my go to TV show.  Click here to listen.  Also check out the podcasts of other data warehouse experts.

Posted in Podcast, SQLServerPedia Syndication | 2 Comments

Data Virtualization vs. Data Movement

I have blogged about Data Virtualization vs Data Warehouse and wanted to blog on a similar topic: Data Virtualization vs. Data Movement.

Data virtualization integrates data from disparate sources, locations and formats, without replicating or moving the data, to create a single “virtual” data layer that delivers unified data services to support multiple applications and users.

Data movement is the process of extracting data from source systems and bringing it into the data warehouse and is commonly called ETL, which stands for extraction, transformation, and loading.

If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer on top of the source data and keep it where it is?

The most common scenario where you would want to do data movement is if you will aggregate/transform one time and query the results many times.  Another common scenario is if you will be joining data sets from multiple sources frequently and the performance needs to be super fast.  These turn out to be the scenarios for most data warehouse solutions.  But there could be cases where you will have many ad-hoc queries that don’t need to be super fast.  And you could certainty have a data warehouse that uses data movement for some tables and data virtualization for others.

Here is a comparison of both:

Other data virtualization benefits:

  • Provides complete data lineage from the source to the presentation layer
  • Additional data sources can be added without having to change transformation packages or staging tables
  • All data presented through the data virtualization software is available through a common SQL interface regardless of the source (i.e. flat files, Excel, mainframe, SQL Server, etc)

While this table gives some good benefits of data virtualization over data movement, it may not be enough to overcome the sacrifice in performance or other drawbacks listed at Data Virtualization vs Data Warehouse.  Also keep in mind the virtualization tool you choose may not support some of your data sources.

The better data virtualization tools provide such features as query optimization, query pushdown, and caching (i.e. Denodo) that may help with performance.  You may see tools with these features called “data virtualization” and tools without these features called “data federation” (i.e. PolyBase).

More info:


Developing a Bi-Modal Logical Data Warehouse Architecture Using Data Virtualization

Posted in SQLServerPedia Syndication, Virtualization | 2 Comments

Reference architecture for enterprise reporting in Azure

As I mentioned in my recent blog Use cases of various products for a big data cloud solution, with so many products it can be difficult to know the best products to use when building a solution.  When it comes to building an enterprise reporting solution, there is a recently released reference architecture to help you in choosing the correct products.  It will also help you get started quickly as it includes an implementation component in Azure.  The blog post announcement is here.

This reference architecture is focused solely on reporting, for those use cases where you will have a lot of users building dashboards via Power BI and operational reports via SSRS.  You can certainly expand the capabilities to add more features such as machine learning as well as enhancing the purpose of certain products, such as using Azure SQL Data Warehouse (SQL DW) to accept large ad-hoc queries from users.  The reference architecture is also for a batch-type environment (i.e. loading data every hour) and not a real-time environment (i.e. handling thousands of events per second).

Key features and benefits include:

  • Pre-built based on selected and stable Azure components proven to work in enterprise BI and reporting scenarios
  • Easily configured and deployed to an Azure subscription within a few hours
  • Bundled with software to handle all the operational essentials for a full-fledged production system
  • Tested end-to-end against large workloads
  • You can operationalize the infrastructure using the steps in the User’s Guide, and explore component level details from the Technical Guides.  Also, check out the FAQ

You can one-click deploy the infrastructure implementation from one of these two locations, which also go into details on each step in the above diagram:

The idea is you are deploying a base architecture, then you will modify as needed to fit all your needs.  But the hard work of choosing the right products and building the starting architecture is done for you, reducing your risk and shortening development time.  However, this does not mean you should use these chosen products in every situation.  For example, if you are comfortable with Hadoop technologies you can use Azure Data Lake Store and HDInsight instead of SQL DW, or use Azure Analysis Services (AAS) instead of SQL Server Analysis Services (SSAS) in a VM (AAS did not support VNETs when this reference architecture was created).  But for many who just need an enterprise reporting solution, this will do the job with little modification.

Note the Cortana Intelligence Gallery has many others solutions so be sure to check them out and avoid “reinventing the wheel”.

Posted in Power BI, SQLServerPedia Syndication, SSAS, SSRS | 2 Comments

Is the traditional data warehouse dead?

There have been a number of enhancements to Hadoop recently when it comes to fast interactive querying with such products as Hive LLAP and Spark SQL which are being used over slower interactive querying options such as Tez/Yarn and batch processing options such as MapReduce (see Azure HDInsight Performance Benchmarking: Interactive Query, Spark and Presto).

This has led to a question I have started to see from customers: Do I still need a data warehouse or can I just put everything in a data lake and report off of that using Hive LLAP or Spark SQL?  Which leads to the argument: “Is the data warehouse dead?”

I think what is confusing is the argument should not be over whether the “data warehouse” is dead but clarified if the “traditional data warehouse” is dead, as the reasons that a “data warehouse” is needed are greater than ever (i.e. integrate many sources of data, reduce reporting stress on production systems, data governance including cleaning and mastering and security, historical analysis, user-friendly data structure, minimize silos, single version of the truth, etc – see Why You Need a Data Warehouse).  And what is meant by a “traditional” data warehouse is usually referring to a relational data warehouse built using SQL Server (if using Microsoft products) and when a data lake is mentioned it is usually one that is built in Hadoop using Azure Data Lake Store (ADLS) and HDInsight (which has cluster types for Spark SQL and Hive LLAP that is also called Interactive Query).

I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution?  The short answer is you should use both.  The rest of this post will dig into the reasons why.

I touched on this ultimate question in a blog that is now over a few years old at Hadoop and Data Warehouses so this is a good time to provide an update.  I also touched on this topic in my blogs Use cases of various products for a big data cloud solutionData lake detailsWhy use a data lake? and What is a data lake? and my presentation Big data architectures and the data lake.  

The main benefits I hear of a data lake-only approach: Don’t have to load data into another system and therefore manage schemas across different systems, data load times can be expensive, data freshness challenges, operational challenges of managing multiple systems, and cost.  While these are valid benefits, I don’t feel they are enough to warrant not having a relational data warehouse in your solution.

First lets talk about cost and dismiss the incorrect assumption that Hadoop is cheaper: Hadoop can be 3x cheaper for data refinement, but to build a data warehouse in Hadoop it can be 3x more expensive due to the cost of writing complex queries and analysis (based on a WinterCorp report and my experiences).

Understand that a “big data” solution does not mean just using Hadoop-related technologies, but could mean a combination of Hadoop and relational technologies and tools.  Many clients will build their solution using just Microsoft products, while others use a combination of both Microsoft and open source (see Microsoft Products vs Hadoop/OSS Products).  Building a data warehouse solution on the cloud or migrating to the cloud is often the best idea (see To Cloud or Not to Cloud – Should You Migrate Your Data Warehouse?) and you can often migrate to the cloud without retooling technology and skills.

I have seen Hadoop adopters typically falling into two broad categories: those who see it as a platform for big data innovation, and those who dream of it providing the same capabilities as an enterprise data warehouse but at a cheaper cost.  Big data innovators are thriving on the Hadoop platform especially when used in combination with relational database technologies, mining and refining data at volumes that were never before possible.  However, most of those who expected Hadoop to replace their enterprise data warehouse have been greatly disappointed, and in response have been building complex architectures that typically do not end up meeting their business requirements.

As far as reporting goes, whether to have users report off of a data lake or via a relational database and/or a cube is a balance between giving users data quickly and having them do the work to join, clean and master data (getting IT out-of-the-way) versus having IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this.  The risk in the first case is having users repeating the process to clean/join/master data and cleaning/joining/mastering it wrong and getting different answers to the same question.  Another risk in the first case is slower performance because the data is not laid out efficiently.  Most solutions incorporate both to allow power users or data scientists to access the data quickly via a data lake while allowing all the other users to access the data in a relational database or cube, making self-service BI a reality (as most users would not have the skills to access data in a data lake properly or at all so a cube would be appropriate as it provides a semantic layer among other advantages to make report building very easy – see Why use a SSAS cube?).

Relational data warehouses continue to meet the information needs of users and continue to provide value.  Many people use them, depend on them, trust them, and don’t want them to be replaced with a data lake.  Data lakes offer a rich source of data for data scientists and self-service data consumers (“power users”) and serves analytics and big data needs well.  But not all data and information workers want to become power users.  The majority (at least 90%) continue to need well-integrated, systematically cleansed, easy to access relational data that includes a large body of time-variant history.  These people are best served with a data warehouse.

I can’t stress enough if you need high data quality reports you need to apply the exact same transformations to the same data to produce that report no matter what your technical implementation is.  If you call it a data lake or a data warehouse, or use an ETL tool or Python code, the development and maintenance effort is still there.  You need to avoid falling into the old mistake that the data lake does not need data governance.  It’s not a place with unicorns and fairies that will magically make all the data come out properly – a data lake is just a glorified file folder.

Here are some of the reasons why it is not a good idea to have a data lake in Hadoop as your data warehouse and forgo a relational data warehouse:

  • Hadoop does not provide for very fast query reads in all use cases.  While Hadoop has come a long way in this area, Hive LLAP and Spark SQL have limits on what type of queries they support (i.e. not having full support for ANSI SQL such as certain aggregate functions which limits the range of users, tools, and applications that can access Hadoop data) and it still isn’t quite at the performance level that a relational database can provide
  • Hadoop lacks a sophisticated query optimizer, in-database operators, advanced memory management, concurrency, dynamic workload management and robust indexing strategies and therefore performs poorly for complex queries
  • Hadoop does not have the ability to place “hot” and “cold” data on a variety of storage devices with different levels of performance to reduce cost
  • Hadoop is not relational, as all the data is in files in HDFS, so there is always a conversion process to convert the data to a relational format if a reporting tool requires it in a relational format
  • Hadoop is not a database management system.  It does not have functionality such as update/delete of data, referential integrity, statistics, ACID compliance, data security, and the plethora of tools and facilities needed to govern corporate data assets
  • There is no metadata stored in HDFS, so another tool such as a Hive Metastore needs to be used to store that, adding complexity and slowing performance.  And most metastores only work with a limited number of tools, requiring multiple metastores
  • Finding expertise in Hadoop is very difficult: The small number of people who understand Hadoop and all its various versions and products versus the large number of people who know SQL
  • Hadoop is super complex, with lot’s of integration with multiple technologies to make it work
  • Hadoop has many tools/technologies/versions/vendors (fragmentation), no standards, and it is difficult to make it a corporate standard.  See all the various Apache Hadoop technologies here
  • Some reporting tools don’t work against Hadoop
  • May require end-users to learn new reporting tools and Hadoop technologies to query the data
  • The newer Hadoop solutions (Tez, Spark, Hive LLAP etc) are still figuring themselves out.  Customers might not want to take the risk of investing in one of these solutions that may become obsolete (like MapReduce)
  • It might not save you much in costs: you still have to purchase hardware or pay for cloud consumption, support, licenses, training, and migration costs.  As relational databases scale up, support non-standard data types like JSON, and run functions written in Python, Perl, and Scala, it makes it even more difficult to replace them with a data lake as the migration costs alone would be substantial
  • If you need to combine relational data with Hadoop, you will need to move that relational data to Hadoop or invest in a technology such as PolyBase to query Hadoop data using SQL
  • Is your current IT experience and comfort level mostly around non-Hadoop technologies, like SQL Server?  Many companies have dozens or hundreds of employees that know SQL Server and not Hadoop so therefore would require a ton of training as Hadoop can be overwhelming

As far as performance, it is greatly affected by the use of indexing – Hive with LLAP (or not) doesn’t have indexing, so when you run a query, it reads all of the data (minus partition elimination).  Spark SQL, on the other hand, isn’t really an interactive environment – it’s fast-batch – so again, not going to see the performance users will expect from a relational database.  Also, a relational database still beats most competitors when performing complex, multi-way joins.  Given that most analytic queries are just that, a traditional data warehouse still might be the right choice.

From a security standpoint, you would need to integrate Hive LLAP or Spark with Apache Ranger to support granular security definition at the column level, including data masking where appropriate.

Concurrency is another thing to think about – Hadoop clusters have to get VERY large to support hundreds or thousands of concurrent connections – remember, these systems aren’t designed for interactive usage – they are optimized for batch and we are trying to shoehorn interactivity on top of that.

A traditional relational data warehouse should be viewed as just one more data source available to a user on some very large federated data fabric.  It is just pre-compiled to run certain queries very fast.  And a data lake is another data source for the right type of people.  A data lake should not be blocked from all users so you don’t have to tell everyone “please wait three weeks while I mistranslate your query request into a new measure and three new dimensions in the data warehouse”.

Most data lake vendors assume data scientists or skilled data analysts are the principal users of the data.  So, they can feed these skilled data users the raw data.  But most business users get lost in that morass.  So, someone has to model the data so it makes sense to business users.  In the past, IT did this, but now data scientists and data analysts can do it using powerful, self-service tools.  But the real question is: does a data scientist or analyst think locally or globally?  Do they create a model that supports just their use case or do think more broadly how this data set can support other use cases?  So it may be best to continue to let IT model and refine the data inside a relational data warehouse so that it is suitable for different types of business users.

I’m not saying your data warehouse can’t consist of just a Hadoop data lake, as it has been done at Google, the NY Times, eBay, Twitter, and Yahoo.  But are you as big as them?  Do you have their resources?  Do you generate data like them?  Do you want a solution that only 1% of the workforce has the skillset for?  Is your IT department radical or is it conservative?

I think a relational data warehouse still has an important place: performance, ease of access, security, integration with reporting components, and concurrency all lean towards using it, especially when performing complex, multi-way joins that make up analytic queries which is the sweet spot for a traditional data warehouse.

The bottom line is a majority of end users need the data in a relational data warehouse to easily do self-service reporting off of it.  A Hadoop data lake should not be a replacement for a data warehouse, but rather should augment/complement a data warehouse.

More info:

Is Hadoop going to Replace Data Warehouse?


The Demise of the Data Warehouse

Counterpoint: The Data Warehouse is Still Alive

The Future of the Data Warehouse

Whither the Data Warehouse? Reflections From Strata NYC 2017

Big Data Solutions Decision Tree

Dimensional Modeling and Kimball Data Marts in the Age of Big Data and Hadoop

Hadoop vs Data Warehouse: Apples & Oranges?


Posted in Data Lake, Data warehouse, SQLServerPedia Syndication | 10 Comments

What is Azure Databricks?

Azure Databricks (documentation and user guide) was announced at Microsoft Connect, and with this post I’ll try to explain its use case.  At a high level, think of it as a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project.  It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib).  It has built-in integration with Azure Blog Storage, Azure Data Lake Storage (ADLS), Azure SQL Data Warehouse (SQL DW), Cosmos DB, Azure Event Hub, Apache Kafka for HDInsight, and Power BI (see Spark Data Sources).  Think of it as an alternative to HDInsight (HDI) and Azure Data Lake Analytics (ADLA).

It differs from HDI in that HDI is a PaaS-like experience that allows working with many more OSS tools at a less expensive cost.  Databricks advantage is it is a Software-as-a-Service-like experience (or Spark-as-a-service) that is easier to use, has native Azure AD integration (HDI security is via Apache Ranger and is Kerberos based), has auto-scaling and auto-termination (like a pause/resume), has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.  Note that all clusters within the same workspace share data among all of those clusters.

Also note with built-in integration to SQL DW it can write directly to SQL DW, as opposed to HDInsight which cannot and therefore more steps are required: when HDInsight processes data it must write it back to Blob Storage and then requires Azure Data Factory (ADF) to move the data from Blob Storage to SQL DW.

It is in limited public preview now: Sign up for the Azure Databricks limited preview

More info

Microsoft makes Databricks a first-party service on Azure


Microsoft Launches Preview of Azure Databricks

A technical overview of Azure Databricks

Microsoft Azure Debuts a ‘Spark-as-a-Service’


Posted in Azure Data Lake Analytics, Azure Databricks, HDInsight, SQLServerPedia Syndication | 1 Comment

Microsoft Connect(); announcements

Microsoft Connect(); is a developer event from Nov 15-17, where plenty of announcements are made.  Here is a summary of the data platform related announcements:

  • Azure Databricks: In preview, this is a fast, easy, and collaborative Apache Spark based analytics platform optimized for Azure. It delivers one-click set up, streamlined workflows, and an interactive workspace all integrated with Azure SQL Data Warehouse, Azure Storage, Azure Cosmos DB, Azure Active Directory, and Power BI.  More info
  • Azure Cosmos DB with Apache Cassandra API: In preview, this enables Cassandra developers to simply use the Cassandra API in Azure Cosmos DB and enjoy the benefits of Azure Cosmos DB with the familiarity of the Cassandra SDKs and tools, with no code changes to their application.  More info.  See all Cosmos DB announcements
  • Microsoft joins the MariaDB Foundation: Microsoft is a platinum sponsor – MariaDB is a community of the MySQL relational database management system and Microsoft will be actively contributing to MariaDB and the MariaDB community.  More info
  • Azure Database for MariaDB: An upcoming preview will bring fully managed service capabilities to MariaDB, further demonstrating Microsoft’s commitment to meeting customers and developers where they are by offering their favorite technologies on Azure.  More info
  • Azure SQL Database with Machine Learning Services: In preview this provides support for machine learning models inside Azure SQL Database. This makes it seamless for data scientists and developers to create and train models in Azure Machine Learning and deploy models directly to Azure SQL Database to create predictions at blazing fast speeds
  • Visual Studio Code Tools for AI: In preview, create, train, manage, and deploy AI models with all the productivity of Visual Studio and the power of Azure.  Works on Windows and MacOS.  More info
Posted in Azure Cosmos DB, Azure SQL Database, SQLServerPedia Syndication | 1 Comment