PASS Summit Announcements: PolyBase enhancements

Microsoft usually has some interesting announcements at the PASS Summit, and this year was no exception.  I’m writing a set of blogs covering the major announcements.  Next up is the PolyBase enhancements.

PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.  It allows you to run queries on external data in Hadoop or Azure blob storage.  The queries are optimized to push computation to Hadoop via MapReduce jobs.

By simply using Transact-SQL (T-SQL) statements, you an import and export data back and forth between relational tables in SQL Server and non-relational data stored in Hadoop or Azure Blob Storage.  You can also query the external data from within a T-SQL query and join it with relational data.

The major use cases for PolyBase are:

  • Load data: Use Hadoop as an ETL tool to cleanse data before loading to data warehouse with PolyBase
  • Interactively Query: Analyze relational data with semi-structured data using split-based query processing
  • Age-out Data: Age-out data to HDFS and use it as ‘cold’ but query-able storage

The main benefits of PolyBase are:

  • New business insights across your data lake
  • Leverage existing skillsets and BI tools
  • Faster time to insights and simplified ETL process

PolyBase supports the following file formats: Delimited text (UTF-8), Hive RCFile, Hive ORC, Parquet, gzip, zlib, Snappy compressed files.

For more details see: Introduction to PolyBase presentation, PolyBase Guide, and the list of supported data sources here.

polybase

Polybase was first made available in Analytics Platform System in March 2013, and then in SQL Server 2016.  The announcement at the PASS Summit was that by preview early next year, in addition to Hadoop and Azure blob storage, PolyBase will support Teradata, Oracle, SQL Server, and MongoDB in SQL Server 2016.  And the Azure Data Lake Store will be supported in Azure SQL Data Warehouse PolyBase.

With SQL Server 2016, you can create a cluster of SQL Server instances to process large data sets from external data sources in a scale-out fashion for better query performance (see PolyBase scale-out groups):

polybase2016

In summary, the main reasons to use PolyBase:

  • Ability to integrate SQL Server with data stored in HDFS or Windows Azure Storage BLOB
  • Commodity hardware and storage are cheap, easily distributed on HDFS; increases data reliability at a low cost
  • Increasing number of different types of data; structured, unstructured, semi-structured (Can have them stored on the best system suitable and queried in one place)
  • Increasing size of data and strong aversion to data deletion due to company culture or restrictions

More info:

Integrating Big Data and SQL Server 2016

PolyBase in SQL Server 2016 video

Polybase in SQL Server – Big Data Queried with T-SQL video

Why Polybase?

About James Serra

James is a big data and data warehousing solution architect at Microsoft. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 25 years of IT experience.
This entry was posted in PolyBase, SQLServerPedia Syndication. Bookmark the permalink.

2 Responses to PASS Summit Announcements: PolyBase enhancements

  1. Pingback: PASS Summit Announcements: PolyBase enhancements - SQL Server Blog - SQL Server - Toad World

  2. Pingback: Upcoming Polybase Enhancements – Curated SQL

Leave a Reply

Your email address will not be published. Required fields are marked *