PolyBase explained

PolyBase is a new technology that integrates Microsoft’s MPP product, SQL Server Parallel Data Warehouse (PDW), with Hadoop.  It is designed to enable queries across relational data stored in PDW and in non-relational Hadoop data that is stored in the Hadoop Distributed File System (HDFS), bypassing Hadoop’s MapReduce distributed computing engine that is typically used to read data from HDFS.  You can create an external table in PDW that references Hadoop data (kinda like a linked server) and you can then query it with SQL, in essence adding structure to un-structured data.  So you can: 1) retrieve data from HDFS with a PDW query that will even allow that data to be joined to native PDW relational tables so that Hadoop and SQL PDW can be queried in tandem, with result sets that integrate data from each source (seamlessly joining structured and semi-structured data); 2) you can import data from HDFS to PDW; and 3) you can export data from PDW to HDFS (for example, as a backup strategy).

The biggest benefit with PolyBase is you don’t need to understand HDFS or MapReduce (typically written in Java) to access Hadoop, and there is no ETL needed.  And you can quickly and easily use a tool such as Power Pivot to connect to PDW and pull in data from PDW tables and external Hadoop tables.

Microsoft Technical Fellow David Dewitt is one of the principals behind PolyBase.  Some things to note:

  • When selecting data in Hadoop, the data is not stored in PDW – it uses a ShuffleMove/BroadcastMove/Round Robin to temporarily bring the data into PDW into temporary tables
  • PolyBase only works within PDW for now, but later it might be added to SQL Server (but there are no plans for that).  PolyBase relies on the Data Movement Service (DMS) in PDW, and DMS does not exist in SQL Server
  • It does not support DML operations
  • It may in the future be able to access other storage systems besides Hadoop
  • It only works for delimited text files
  • It requires Java RunTime environment (Oracle JRE)
  • It can connect to Hortonworks Data Platform (HDP) on Windows Server, HDP on Linux, Cloudera (CHD) on Linux
  • Soon PDW will have the ability to add a Hadoop scale-unit (compute nodes and storage) right into the PDW rack

In a future version of PolyBase the query optimizer will be able make a cost-based decision, when referencing data in an HDFS, to determine whether it should transform the query into a MapReduce job to be performed on the Hadoop cluster or if it should just process using the SQL server instances on the PDW.  Also, the optimizer will have the ability to move the workload of a query involving only PDW data to the Hadoop cluster.  This intelligence within the optimizer will allow it to split the workload between the two platforms and thus leverage the true capabilities of the Hadoop cluster.


So in summary, the main features of PolyBase are:

  • Simplicity: You can query data in Hadoop via regular SQL
  • Performance: Parallelized data reading and writing into Hadoop
  • Openness: Supports various Hadoop distributions
  • Integration: Works with Microsoft BI tools such as Power Pivot, Power View, SSRS, SSAS

Untitled picture

More info:

Seamless insights on structured and unstructured data with SQL Server 2012 Parallel Data Warehouse

Polybase: Hadoop Integration in SQL Server PDW V2

Microsoft’s PolyBase mashes up SQL Server and Hadoop

Insight through Integration: SQL Server 2012 Parallel Data Warehouse

PASS talk: Polybase: What, Why, How

PolyBase in APS – Yet another SQL over Hadoop solution?

About James Serra

James is a big data and data warehousing solution architect at Microsoft. Previously he was an independent consultant working as a Data Warehouse/Business Intelligence architect and developer. He is a prior SQL Server MVP with over 25 years of IT experience.
This entry was posted in Hadoop, PDW/APS, PolyBase, SQLServerPedia Syndication. Bookmark the permalink.

26 Responses to PolyBase explained

  1. I watched the joint Hortonworks and Microsoft webinar a few days ago on how PDW has incorporated Hortonworks. When they got to the section where the presenter showed how a user can easily join existing PDW data with Hadoop Hive/HCatalog tables it was impressive. Really the future of how easy it will be to merge unstructured data in Hadoop with traditionally structured data. Now, how soon will we be able to get that with the basic SQL Server BI stack?

  2. Pingback: Introduction to Hadoop - SQL Server - SQL Server - Toad World

  3. Pingback: Introduction to Hadoop | James Serra's Blog

  4. Pingback: What is HDInsight? | James Serra's Blog

  5. Pingback: Parallel Data Warehouse (PDW) benefits made simple | James Serra's Blog

  6. Pingback: Parallel Data Warehouse (PDW) benefits made simple - SQL Server - SQL Server - Toad World

  7. Pingback: Parallel Data Warehouse (PDW) AU1 released - SQL Server - SQL Server - Toad World

  8. Pingback: What is the Microsoft Analytics Platform System (APS)? - SQL Server - SQL Server - Toad World

  9. Pingback: Non-obvious APS/PDW benefits - SQL Server - SQL Server - Toad World

  10. Pingback: Non-obvious APS/PDW benefits | James Serra's Blog

  11. Pingback: The Modern Data Warehouse - SQL Server - SQL Server - Toad World

  12. Pingback: The Modern Data Warehouse | James Serra's Blog

  13. Pingback: Hadoop and Data Warehouses | James Serra's Blog

  14. Pingback: What is a data lake? - SQL Server - SQL Server - Toad World

  15. Pingback: Azure SQL Data Warehouse | James Serra's Blog

  16. Pingback: Azure SQL Data Warehouse - SQL Server - SQL Server - Toad World

  17. Pingback: What is a data lake? | James Serra's Blog

  18. Pingback: Azure Data Lake | James Serra's Blog

  19. Pingback: Azure Data Lake - SQL Server - SQL Server - Toad World

  20. Pingback: SQL Server 2016 public preview arriving this summer - SQL Server - SQL Server - Toad World

  21. Romeo Valencia says:

    We have determined in our product/solution to revise the existing strategy and architecture as far as data is concerned. We have come to the point where we needed a common gateway or endpoint where structured and unstructured data can seamlessly be accessible and can be homogenized. Our backend data processing are all HADOOP based infrastructure and our OLTP/DWH database are currently under SQL SERVER 2008 R2 platform. According to our research, the new version of SQL SERVER and that is SQL SERVER 2014, offers a specific feature that is suitable to our requirements. The specific feature that we are looking at is the “POLYBASE”. Our main objective for the re-achitecuring are as follows: 1. Very minimal to zero revisions/impacts to our front-end application; 2. Optimal performance gain in accessing (including data manipulation) data for both structured/unstructured; 3. Scalability and maintainability; 4. Re-utilization of an existing skillset; What we are looking at or expectation: 1. Some success stories implementing such feature and if need be some presentation; 2. Provide and guide us in the implementation of such feature; 3. Any other feature as an extension of SQL Server to Hadoop.

  22. Pingback: SQL Server 2016 public preview arriving this summer | James Serra's Blog

  23. This is a follow-up to my first comment. We’ll, it appears that PolyBase will be coming to the enterprise edition of SQL Server 2016. I’ve reviewed CTP 2.2 and PolyBase is a big part of CTP 2.2. This is great news. If anyone is interested, I’ve created a series of posts on how to setup, and get started with PolyBase in SQL Server 2016.
    You can find the posts here: http://realizeddesign.blogspot.com/

  24. Cyed says:

    Hi Sir,

    Great article sir, I have been waiting for this kind of stuff, which can help us to fetch data from hadoop without MR,Yarn,PIG,HIVE etc.
    Thank you for posting

  25. Pingback: Why use a data lake? - SQL Server - SQL Server - Toad World

  26. Pingback: Why use a data lake? | James Serra's Blog