Microsoft SQL Server Parallel Data Warehouse (PDW) Explained

Microsoft SQL Server Parallel Data Warehouse (PDW), formally called by its code name “Project Madison”, is an edition of Microsoft’s SQL Server 2008 R2 that was released in December 2010.  PDW is Microsoft’s reworking of the DatAllegro Inc. massive parallel processing (MPP) product that Microsoft acquired in July 2008.  It only works with certain hardware (two so far), the first of which is HP Enterprise Data Warehouse Appliance (Dell Parallel Data Warehouse Appliance is the other, with a couple more to come in the near future: IBM and Bull).  This edition of SQL Server can’t be bought as an independent piece of software, it has to be bought along with the hardware.

So what is MPP?  Until now, SQL Server has been a Symmetric Multiprocessing (SMP) solution, which essentially means it uses one server.  MPP provides scalability and query performance by running independent servers in parallel.  That is the quick definition.  For more details, read What MPP means to SQL Server Parallel Data Warehouse.

MPP is also available from other companies such as EMC Greenplum, Teradata, Oracle ExadataHP Vertica, and IBM Netezza, but those use proprietary systems, where PDW can be used with commodity hardware, providing a much lower cost per terabyte.  But it’s still not in-expensive: The hardware and installation will cost around $2 million (not including software licenses), but gets you “200 times faster queries and 10 times the scalable than traditional Microsoft SQL Server deployments” (see press release).  PDW also comes with its own support model.

Microsoft has had clustering capabilities in SQL Server for a while, but the scalability part was lacking.  This is where PDW comes in.  Scalability in PDW means handling tens of terabytes of data and then moving to hundreds of terabytes worth (up to 600 TB).  At about 50 terabytes to 60 terabytes of data, clustering is needed; thereafter, clustering starts to approach its limits, and that is when you need to move to PDW.  Clustering brings concurrency to the system and reduces load, but it can’t reduce the time that a single query would take without any resource latency. To break this barrier, parallelism would be required to execute bits of the same request simultaneously and this is what exactly this setup would bring to the table.  PDW partitions large tables across multiple physical nodes, each having its own dedicated CPU, memory, storage, and each running its own instance of SQL Server in a parallel shared nothing design.  Tables can either be replicated, where a copy will be on each node (usually for dimension tables), or distributed, where portions of a table are uniformly distributed across all nodes (usually for fact tables).

One drawback to PDW is that it does not use SQL Server Management Studio, but uses a third-party tool called Nexus Chameleon (this third-party tool is needed because SSMS hasn’t been reworked to connect directly to the control node of the Parallel Data Warehouse).  It also uses its own query engine and not all features of SQL Server are supported.  So, you might not be able to use all your DBA tricks.  And you wouldn’t want to build a solution against SQL Server and then just hope to upsize it to Parallel Data Warehouse Edition.

PDW uses multiple servers within the appliance, virtualized as if they were one unified data warehousing resource available.  It can use up to 480 cores.  PDW Works by controlling several different physical servers each running their own instance of SQL Server 2008 R2. The database and it’s tables are spread across these physical servers but appear as one database and table(s) to the end-user. The data warehouse appliance or brain of the PDW manages query execution and the meta data for what is stored and processed on what portion of the PDW.

Microsoft IT’s experience with PDW showed when they migrated Information Security Consolidated Event Management (ICE) to PDW, they saw query performance improve to an average of 15-20 times faster in PDW, SSIS data load throughput of up to 285 GB/hour (with minimal query performance impact), and support for up to 12 TB/day in throughput in SSIS.  See video.

Part of the technology incorporated into PDW includes a parallel database copy that enables rapid data movement and consistency between PDW and data marts used by SSAS.

In short, PDW is ideal for large data warehouses and BI, but not for OLTP systems.  Write one check, and you get a complete soup-to-nuts data warehouse storage engine that includes everything from the servers, SAN, configuration, and training.

HP calls PDW by a different name: Enterprise Data Warehouse (EDW).  Here is the layout of the HP Enterprise Data Warehouse Appliance (full specs here with review and architecture overview and performance guide).  The architecture is hub-and-spoke and supports up to 47 servers, made up of a control rack and a data rack.  A one rack system has 17 servers, 22 processors/132 cores, and 125TB and can be scaled out to a four rack system with 47 servers, 82 processors/492 cores, and 500TB:

The future road map for PDW includes column store, petabyte scalability, real-time data warehousing, MDM, and data quality.

More Info:

Microsoft SQL Server Parallel Data Warehouse Edition

Parallel Data Warehouse News and Hadoop Interoperability Plans

Microsoft ships CTP of Hadoop Connectors for SQL Server and Parallel Data Warehouse

Bill Inmon’s Data Warehousing 2.0 and SQL Server: Architecture and Vision

Implementing a SQL Server PDW Using the Kimball Approach

TechEd Video All Up Data Warehouse: From SMP to Parallel Data Warehouse

TechEd Video Microsoft SQL Server 2008 R2 Parallel Data Warehouse Deep Dive

TechEd Video Highly Scalable BI and Data Warehouse Solutions from HP and Microsoft

TechEd Video Best Practices in Fast Track, Parallel Data Warehouse, and Traditional Data Warehouse Design

TechEd Video Parallel Data Warehouse Overview

TechEd Video Microsoft SQL Server Reference Architecture and Appliances

TechEd Video From Fast Track to Parallel Data Warehouse

Brian Mitchell’s SQLPDW blog

eClinic 10333- Introduction to Microsoft® SQL Server® 2008 R2 Parallel Data Warehouse

Parallel Data Warehouse has speed, space, special requirements

Using SQL Server to Build a Hub-and-Spoke Enterprise Data Warehouse Architecture

Microsoft SQL Server Data Warehousing

Pragmatic Works Video SQL Server Parallel Data Warehouseing (PDW)

Pragmatic Works Video Scaling a Data Warehouse with EDW

Microsoft SQL Server Parallel Data Warehouse for Newcomers

Video Building BI Solutions with Microsoft SQL Server PDW AU3

Designing, Developing, and Deploying the World’s First Parallel Data Warehouse Customer Solution

Introduction to PDW (Parallel Data Warehouse)

About James Serra

James is SQL Server MVP and a independent consultant with the title of Business Intelligence/Data Warehouse/Master Data Management Architect and Developer, specializing in the Microsoft SQL Server BI stack.
This entry was posted in Appliance, PDW, SQL Server, SQLServerPedia Syndication. Bookmark the permalink.

31 Responses to Microsoft SQL Server Parallel Data Warehouse (PDW) Explained

  1. Pingback: Microsoft SQL Server Reference Architecture and Appliances | James Serra's Blog

  2. Pingback: Microsoft's Data Warehouse offerings | James Serra's Blog

  3. Pingback: Microsoft Data Warehouse Offerings « Garrett Edmondson

  4. Pingback: Reduce heap table fragmentation | James Serra's Blog

  5. Pingback: products | Pearltrees

  6. Pingback: SQL Server Parallel Data Warehouse (PDW) Appliance Update 3 (AU3) | James Serra's Blog

  7. Pingback: SQL Server Parallel Data Warehouse (PDW) and SQL Server Fast Track Data Warehouse « The SQL Pro

  8. Pingback: PASS 2012 Announcements | James Serra's Blog

  9. Pingback: SQL Server Parallel Data Warehouse (PDW) 介绍 | 架构(architecture.riaos.com)

  10. Richard Schneider says:

    Hi James,

    I believe your following statement is somewhat incorrect.
    “MPP is also available from other companies such as EMC Greenplum, Teradata, Oracle Exadata, HP Vertica, and IBM Netezza, but those use proprietary systems, where PDW can be used with commodity hardware, providing a much lower cost per terabyte.”
    I believe Terada and Vertica utilize unmodified commodity hardware while MS PDW requires specially configured proprietary HP or Dell hardware.

    Thanks.
    Dick Schneider

    • Mark Stacey says:

      Hi Dick

      All the hardware in a Dell or HP PDW is commodity. On one of my clients, we repurposed a piece of old kit from a V1 appliance for other workloads after we upgraded.

      Configured specifically, yes. But not custom hardware like the Netezza FPGAs

  11. Pingback: Parallel Data Warehouse (PDW) Version 2 - SQL Server Blog - SQL Server - Telligent

  12. Pingback: Parallel Data Warehouse (PDW) Version 2 | James Serra's Blog

  13. Mj Redrowthu says:

    Hello,

    As you said, Dimension tables get replicated and Fact tables get distributed across the the nodes in PDW. Is the data stored in same manner in Hadoop HDFS?

  14. Pingback: SQL Servers Big Brother – Parallel Data Warehouse | uxcps information management blog

  15. Pingback: SQL Server Data Tools (SSDT) – January 2014 update | James Serra's Blog

  16. Pingback: SQL Server Data Tools (SSDT) – January 2014 update - SQL Server - SQL Server - Toad World

  17. Pingback: Why I just became a Microsoft Employee | James Serra's Blog

  18. Pingback: PolyBase explained | James Serra's Blog

  19. Pingback: PolyBase explained - SQL Server - SQL Server - Toad World

  20. Pingback: What is HDInsight? | James Serra's Blog

  21. Pingback: Parallel Data Warehouse « NGBI

  22. Pingback: Madison ermittelt weiter… « NGBI

  23. Pingback: Real-time query access with PDW | James Serra's Blog

  24. Pingback: Real-time query access with PDW - SQL Server - SQL Server - Toad World

  25. Pingback: What is HDInsight? - SQL Server - SQL Server - Toad World

  26. Pingback: Parallel Data Warehouse (PDW) benefits made simple - SQL Server - SQL Server - Toad World

  27. Pingback: Parallel Data Warehouse (PDW) benefits made simple | James Serra's Blog

  28. Pingback: Parallel Data Warehouse (PDW) AU1 released | James Serra's Blog

  29. Pingback: Parallel Data Warehouse (PDW) AU1 released - SQL Server - SQL Server - Toad World

  30. Pingback: Parallel Data Warehouse (SQL Servers Big Brother) Presentation | uxcps information management blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>