PolyBase explained

PolyBase is a new technology that integrates Microsoft’s MPP product, SQL Server Parallel Data Warehouse (PDW), with Hadoop.  It is designed to enable queries across relational data stored in PDW and in non-relational Hadoop data that is stored in the Hadoop Distributed File System (HDFS), bypassing Hadoop’s MapReduce distributed computing engine that is typically used to read data from HDFS.  You can create an external table in PDW that references Hadoop data (kinda like a linked server) and you can then query it with SQL, in essence adding structure to un-structured data.  So you can: 1) retrieve data from HDFS with a PDW query that will even allow that data to be joined to native PDW relational tables so that Hadoop and SQL PDW can be queried in tandem, with result sets that integrate data from each source (seamlessly joining structured and semi-structured data); 2) you can import data from HDFS to PDW; and 3) you can export data from PDW to HDFS (for example, as a backup strategy).

The biggest benefit with PolyBase is you don’t need to understand HDFS or MapReduce (typically written in Java) to access Hadoop, and there is no ETL needed.  And you can quickly and easily use a tool such as Power Pivot to connect to PDW and pull in data from PDW tables and external Hadoop tables.

Microsoft Technical Fellow David Dewitt is one of the principals behind PolyBase.  Some things to note:

  • When selecting data in Hadoop, the data is not stored in PDW – it uses a ShuffleMove/BroadcastMove/Round Robin to temporarily bring the data into PDW into temporary tables
  • PolyBase only works within PDW for now, but later it might be added to SQL Server (but there are no plans for that).  PolyBase relies on the Data Movement Service (DMS) in PDW, and DMS does not exist in SQL Server
  • It does not support DML operations
  • It may in the future be able to access other storage systems besides Hadoop
  • It only works for delimited text files
  • It requires Java RunTime environment (Oracle JRE)
  • It can connect to Hortonworks Data Platform (HDP) on Windows Server, HDP on Linux, Cloudera (CHD) on Linux
  • Soon PDW will have the ability to add a Hadoop scale-unit (compute nodes and storage) right into the PDW rack

In a future version of PolyBase the query optimizer will be able make a cost-based decision, when referencing data in an HDFS, to determine whether it should transform the query into a MapReduce job to be performed on the Hadoop cluster or if it should just process using the SQL server instances on the PDW.  Also, the optimizer will have the ability to move the workload of a query involving only PDW data to the Hadoop cluster.  This intelligence within the optimizer will allow it to split the workload between the two platforms and thus leverage the true capabilities of the Hadoop cluster.

polybase

So in summary, the main features of PolyBase are:

  • Simplicity: You can query data in Hadoop via regular SQL
  • Performance: Parallelized data reading and writing into Hadoop
  • Openness: Supports various Hadoop distributions
  • Integration: Works with Microsoft BI tools such as Power Pivot, Power View, SSRS, SSAS

Untitled picture

More info:

Seamless insights on structured and unstructured data with SQL Server 2012 Parallel Data Warehouse

Polybase: Hadoop Integration in SQL Server PDW V2

Microsoft’s PolyBase mashes up SQL Server and Hadoop

Insight through Integration: SQL Server 2012 Parallel Data Warehouse

PASS talk: Polybase: What, Why, How

PolyBase in APS – Yet another SQL over Hadoop solution?

About James Serra

James currently works for Microsoft specializing in big data and data warehousing using the Analytics Platform System (APS), a Massively Parallel Processing (MPP) architecture. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence/MDM architect and developer, specializing in the Microsoft BI stack. He is a SQL Server MVP with over 25 years of IT experience.
This entry was posted in Hadoop, PDW, SQLServerPedia Syndication. Bookmark the permalink.

8 Responses to PolyBase explained

  1. I watched the joint Hortonworks and Microsoft webinar a few days ago on how PDW has incorporated Hortonworks. When they got to the section where the presenter showed how a user can easily join existing PDW data with Hadoop Hive/HCatalog tables it was impressive. Really the future of how easy it will be to merge unstructured data in Hadoop with traditionally structured data. Now, how soon will we be able to get that with the basic SQL Server BI stack?

  2. Pingback: Introduction to Hadoop - SQL Server - SQL Server - Toad World

  3. Pingback: Introduction to Hadoop | James Serra's Blog

  4. Pingback: What is HDInsight? | James Serra's Blog

  5. Pingback: Parallel Data Warehouse (PDW) benefits made simple | James Serra's Blog

  6. Pingback: Parallel Data Warehouse (PDW) benefits made simple - SQL Server - SQL Server - Toad World

  7. Pingback: Parallel Data Warehouse (PDW) AU1 released - SQL Server - SQL Server - Toad World

  8. Pingback: What is the Microsoft Analytics Platform System (APS)? - SQL Server - SQL Server - Toad World

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>