What is the Microsoft Analytics Platform System (APS)?

Analytics Platform System (APS) is simply a renaming of the Parallel Data Warehouse (PDW).  It is not really a new product, but rather a name change due to a new feature in Appliance Update 1 (AU1) of PDW.  That new feature is the ability to have a HDInsight region (a Hadoop cluster) inside the appliance.

So APS combines SQL Server and Hadoop into a single offering that Microsoft is touting as providing “big data in a box.”

Think of APS as the “evolution” of Microsoft’s current SQL Server Parallel Data Warehouse product.  Using PolyBase, it now supports the ability to query data using SQL across the traditional data warehouse, plus data stored in a Hadoop region, whether in the appliance or a separate Hadoop Cluster.

More info:

The data platform for a new era

Posted in PDW, SQLServerPedia Syndication | 1 Comment

Parallel Data Warehouse (PDW) AU1 released

Microsoft’s PDW has seen a big boost in visibility and sales over the past year, and part of the reason is due to frequent upgrades to the hardware and software.  About every six months there is an appliance update, which is sort of like a service pack, except that in addition to new features it usually also includes upgrades in hardware (since the PDW is a hardware and software appliance).  Just released is Appliance Update 1 (AU1) to version 2 of PDW.  Details below, including improvements with HDInsight, Hadoop, and Polybase,

First on-premises HDInsight region

  • Enables customers to load, query and analyze structured and unstructured data within a single appliance.
  • T-SQL compatibility over Hadoop via Polybase.
  • Windows failover clustering and full hardware redundancy.
  • Unified management and monitoring experience across HDInsight and PDW.

Polybase Enhancements

  • Better Polybase performance
    • Compute push-down enables 2x improvement.
    • Ability to create statistics over external tables via fullscan or sampling.
  • Polybase T-SQL semantics – Close alignment with T-SQL semantics for data types & conversions.
  • RCFile format support.
  • Compressed and uncompressed Hadoop data support.
  • Better manageability and user experience with new catalog views, DMVs, EXPLAIN, and SSDT support.

Hybrid Cloud Support

  • Import and export to Azure Blob storage – WASB/ASV.
  • Seamless query across WASB and HDInsight region.

PDW Performance Improvements

  • Parallel data load enables 2-8x better load performance.
  • Native Type Conversion for Loader improves load time of single load by 75%, with multiple concurrent loads achieving 2x improvements, validated up to 9.4TB/Hr.

Integrated Authentication (PDW)

  • Enables single sign-on for simplified user management and easier authentication.
  • Support for Kerberos and NTLM.
  • Windows authentication supported through SqlClient, ODBC, and OLE-DB.

Transparent Data Encryption (PDW):

  • Protects data at rest by encrypting data pages on disk, transactional logs, and database backups.

Add Capacity (PDW):

  • Enables customers to expand to any supported topology.
  • All operations except data redistribution are online operations, reducing downtime.

Add Region (HDInsight):

  • Enables customers to add HDI region to existing PDW appliance.

Better upgrade experience:

  • Upgrades can now be run remotely.
  • Reduces time to upgrade appliances by simplifying and minimizing manual steps.
Posted in PDW, SQLServerPedia Syndication | 1 Comment

PASS Summit 2012 session videos online

All PASS Summit 2012 recordings are now available for free to all PASS members (201 of them!).  To view, log on to your myPASS account, and watch the sessions from myRecordings.  While there, check out mine: Building an Effective Data Warehouse Architecture

Posted in PASS, SQLServerPedia Syndication, Videos | Leave a comment

Book: Reporting with Microsoft SQL Server 2012

I am happy to say I have published my first book!  It is called Reporting with Microsoft SQL Server 2012 (order).  Much thanks goes to my co-author and friend Bill Anton.  Below is a brief overview or check out the listing on the Packt Publishing site for a sample chapter and the table of contents.

MyBook

Reporting with Microsoft SQL Server 2012 will cover all the features of SSRS and Power View and will provide a step-by-step lab activity to help you develop reports very quickly.

Starting with the difference between standard and self-service reporting, this book covers the main features and functionality offered in SQL Server Reporting Services 2012 including a breakdown of the report components, development experience, extensibility, and security. You will also learn to set up and use Power View within Excel and SharePoint and connect to a tabular model as well as a multidimensional model. The book provides real-life reporting scenarios that help to clarify when those scenarios are discussing standard reporting, in which case SSRS is the best choice, and when they are discussing self-service reporting, in which case Power View is the best choice.

This book will have you creating reports in SSRS and Power View in no time!

Posted in Power View/Project Crescent, SQLServerPedia Syndication, SSRS | Leave a comment

SQL Server 2014 released April 1st!

SQL Server 2014 has been released to manufacturers (RTM) and will be generally available April 1.  Here is the official announcement.  Here are my blog posts about some of the new features.

More info:

SQL Server 2014 RTM Announced for April 1 Release

SQL Server 2014 releases April 1

Posted in SQL Server 2014, SQLServerPedia Syndication | Leave a comment

Parallel Data Warehouse (PDW) benefits made simple

I have heard people say that the Parallel Data Warehouse (PDW) is Microsoft’s best kept secret.  So let me give a 10,000 foot overview on what PDW is and its benefits.  Keep in mind the purpose for a PDW is for data warehouses, not OLTP systems.

As opposed to a symmetric multiprocessor system (SMP), which is one server where each CPU in the server shares the same memory and disk, PDW is a massively parallel processing (MPP) solution, which means data is distributed among many independent servers running in parallel and is a shared-nothing architecture, where each server operates self-sufficiently and controls its own memory and disk.  A query sent by a user will, behind the scenes, be sent to each server, executed, and the results combined and sent to the user.  PDW is designed for data warehouses only, not OLTP applications.

With most data warehouses on SMP’s, the bottleneck is disk IO and not the cpu.  With PDW, the appliance is optimized so the cpu’s are fed data from the disks as fast as they can accept data, in large part thanks to DAS.  Direct Attached Storage (DAS) is much faster for data warehouse applications (see Performance Tuning SQL Server: Hardware Architectures Throughput Analysis).  While SANs can be great for OLTP applications, they are less optimal for data warehouses, in addition they are costly and hard to predict performance.  CPU’s can consume 250MB/sec/core but SAN disks can be feeding each cpu only 16GB/s.

That is a quick explanation of just one benefit of a PDW.  For more details on this benefit, read What MPP means to SQL Server Parallel Data Warehouse.  Here is a list of the many other benefits provided by a PDW over a SMP solution (where the underlined benefits are the additional benefits not found in a SMP/SQL Server 2014 solution):

  • Query performance: Expect a 10x-100x increase, which is so important because nowadays there is the expectation of fast queries from users.  You can also expect a reasonable linear increase when adding more servers to your PDW.  PDW is not just an appliance for “big data”.  It can be very useful for small sets of data that need performance.
  • Data loading performance: 10-40x faster due to parallel loading of data.  Data loading speed is 250Gb/hr per compute node (a half rack of 4 compute nodes gives 1 TB/hour, with minimal query performance impact)
  • Scalability (data growth) Start with only a quarter-rack (2 compute servers, 32 cores, 15TB of uncompressed capacity) and grow as needed, up to 7 racks (56 compute servers, 896 cores, 1.2PB of uncompressed capacity.  Using a conservative 5:1 compression, data capacity is from 75TB to 6PB.  And there is no “forklifting” when you upgrade (backing up and restoring from the old server to the new server).  Instead, you add the new servers and the data is automatically redistributed
  • Built-in high availability and failover:  One fault-tolerant cluster across the whole appliance.  Virtualized architecture and no dependency of SAN technologies.  Automatic VM migration on host failure.  All appliance components are fully redundant (disks, networking, etc).
  • PolyBase:  Combine relational with non-relational data (Hadoop) using SQL.  Hides all the complexity of using Hadoop so most business users do not need to know anything about Hadoop.  See PolyBase explained for more details.  PolyBase also has the ability to push down portions of the query processing to the Hadoop cluster and allows you to move data faster between the Hadoop and SQL world because of parallel data transfers
  • Integration with cloud-born data (Windows Azure HDInsight, Windows Azure blog storage).  See What is HDInsight? for more info
  • HDInsight integration into the PDW rack
  • Improved concurrency because of how quickly queries execute
  • Mixed workload support (i.e. no performance issues with queries when a data load is happening)
  • Less DBA maintenance:  Don’t need to create indexes besides a clustered columnstore index, don’t need to archive/delete data to save space, management simplicity (monitor hardware and software from System Center), don’t need to worry about many normal monitoring/maintenance that happens with a SMP system (blocking, logs, query hints, wait states, IO tuning, query optimization/tuning, index reorgs/rebuilds, managing filegroups, shrinking/expanding databases, managing physical servers, patching servers).  DBAs can spend more of their time as architects and not baby sitters
  • Limited training needed: If you are already a Microsoft shop, using a PDW is not much different from using a SMP solution
  • Use familiar BI tools: If you are already a Microsoft shop, all your familiar tools (i.e. SSRS, PowerPivot, Excel, Power View) work fine against a PDW.  The only thing you do differently is enter the IP address and port number of the PDW in the connection string.  So you will not have to rewrite and re-implement the many SSRS reports you have created over the years.  Plus you can expand your report filters because performance is not a problem anymore (i.e. increase the number of years).
  • Improved data compression: 3x-15x more than a SMP system, with 5x being a conservative number.  Unique compression because of data distribution across compute nodes
  • Consolidation of all your data warehouses and the ability to integrate data sources that you could not before.  A centralized data warehouse that is one source for the truth
  • Ease of deployment in appliance vs build-your-own: You can deploy in hours, not weeks, thanks to PDW being a turnkey solution complete with hardware and software.  It is pre-tested and tuned for your data warehouse
  • Data warehouse consolidation: With all the disk space and performance you get with a PDW, you can make it a true enterprise data warehouse by bringing in all the sources, data marts, and other data warehouses into one place.  A true “single version of the truth”
  • Easy support model: With a PDW you get an integrated support plan with a single Microsoft contact.  Whether it’s a problem with the hardware or the software, you just call Microsoft and they will work with the vendor if it’s a hardware issue

If you answer “Yes” to a few of the below questions, a PDW may be right for you:

  • Is your data volume growth becoming unmanageable using currently implemented DW technologies? (>20-30% annually)
  • Is there a specific Big Data business need (e.g. social media analysis, fraud detection) in a high-priority industry (Retail, Financial, Pub Sec)?
  • Is your DW or storage spend consuming a disproportionate and increasing amount of your IT budget?
  • Do your business users need to find, combine, and refine structured and unstructured data? Internal and external sources?
  • In the near future do you expect to need both on-premise and cloud-based BI capabilities?
  • Do you have a need to capture and analyze streaming data?  At what scale and velocity?
  • Do you currently (or plan to) collect, store, and analyze multiple forms of unstructured data (XML, JSON, CSV, etc.)?
  • Are you able to serve your business users’ analytics provisioning and data requests in a timely manner?
  • Are you experiencing data management issues such as security or compliance due to business owners (“shadow” IT) creating their own unmanaged data stores?
  • Are you trying to build, grow, and manage your next-generation DW without adding new headcount or talent (data scientists, external consultants, etc.)?

There are three vendors that sell PDW: HP, Dell, and Quanta.  It comes with a integrated support plan with a single Microsoft contact.

Interested in finding out more about PDW, maybe a demo?  If so, shoot me an email!

More info:

Parallel Data Warehouse (PDW) Version 2

Microsoft SQL Server Parallel Data Warehouse (PDW) Explained

Appliance: Parallel Data Warehouse (PDW)

Parallel Data Warehouse Solution Brief

Introduction to PDW (Parallel Data Warehouse)

Introduction to SQL Server 2012 Parallel Data Warehouse

Posted in PDW, SQLServerPedia Syndication | 2 Comments

Real-time query access with PDW

The Parallel Data Warehouse (PDW) officially supports Analysis Services as a data source, both the Multidimensional model (ROLAP and MOLAP modes) and the Tabular model (In-Memory and DirectQuery modes).  The big benefit of using ROLAP or DirectQuery is you get real-time query access to the relational data source in PDW (as opposed to the data only up to the last time the cube was processed) and don’t have to process the cube (just make sure to use clustered columnstore indexes on the PDW tables to improve performance).  You create MDX queries when using ROLAP, which get translated to SQL when hitting PDW, and you create DAX queries when using DirectQuery, which also get translated to SQL when hitting PDW.

Keep in mind that the PDW is so fast when using clustered columnstore indexes, that if you have a properly defined star schema you might not even need to use a cube because the results will be returned to the user quickly.  But there are other reasons besides performance as to why you might still want to use a cube (see Why use a SSAS cube?).

An SSAS cube that uses PDW as a data source is just like any other data source that SSAS uses.  Performance is usually fast because of the clustered columnstore indexes, with the only caveat is sometimes the SQL that is generated by DirectQuery to pull data from PDW is not that great (the SQL generated by ROLAP is usually pretty good).

The other thing to note about DirectQuery, which applies to any data source, is you can’t use PerformancePoint or Excel PivotTables with DirectQuery.  This is because MDX queries are not supported for a tabular model in DirectQuery mode, only DAX, so you need to use a DAX client like Power View (PerformancePoint and Excel PivotTables generate MDX queries behind the scenes).  The other limitation with DirectQuery is it does not cache results like ROLAP and there are some unsupported data types (geometry, xml, and nvarchar(max)).  Finally, there are some DAX functions that are not supported in DirectQuery mode and some that might return different results (see Formula Compatibility in DirectQuery Mode) and there are two DAX functions that are not supported (EXACT and REPLACE).  So it seems that ROLAP is the better choice over DirectQuery for many situations.  But if you do go with a tabular model you may want to look into using a hybrid mode (see Tabular query modes: DirectQuery vs In-Memory and Partitions and DirectQuery Mode).  Definitely go with a DirectQuery tabular model over a in-memory model if your database is 1TB or more.

One limit of ROLAP to note is it does not support parent-child hierarchies.  One improvement is Distinct Count performance for ROLAP queries is faster if you enable an optimization.  Some other ROLAP limitations against PDW:

  • Auto-cube refresh is not supported
  • Materialized views, also called Indexed views, are not supported
  • Proactive caching is supported only if you use the polling mechanisms provided by Analysis Services
  • Writeback is not supported

Some things I have learned when using ROLAP against a PDW:

  • Sometimes it is better to have your fact tables as ROLAP, but keep the dimensions as MOLAP
  • Think about using MOLAP for your historical partitions and ROLAP for just your current partition
  • Make sure the measures are in BIGINT in the fact tables or MDX aggregates might not work (MDX aggregates use INT by default unless the source is BIGINT)
  • PDW supports a max of 32 concurrent users, so if you have a lot of users hitting the cube it may be better to move the data from the PDW to a SMP data mart and create the cube there

More info:

Comparing DirectQuery and ROLAP for real-time access

Tabular model: Not ready for prime time?

PDW and SSAS

Parallel Data Warehouse (PDW) and ROLAP

Analysis Services ROLAP for SQL Server Data Warehouses

Columnstore vs. SSAS

Posted in PDW, SQLServerPedia Syndication, SSAS | Leave a comment

Book: Expert Cube Development with SSAS Multidimensional Models

Those of you who use SQL Server Analysis Services (SSAS) are likely familiar with the “bible” for designing cubes: Expert Cube Development with Microsoft SQL Server 2008 Analysis Services by Marco Russo, Alberto Ferrari , and Chris Webb.  Well, a new edition has just been published, with a changed name to reflect that the subject matter is strictly multidimensional models: Expert Cube Development with SSAS Multidimensional Models.  No worries if you are looking for help with the tabular model, as the same authors have a book for that: Microsoft SQL Server 2012 Analysis Services: The BISM Tabular Model.

I was fortunate to be the technical reviewer for this new book, and I hope you find it as interesting and helpful as I did while reviewing it!  Click here to order.

Posted in SQLServerPedia Syndication, SSAS | Leave a comment

Introduction to Hadoop

Hadoop was created by the Apache foundation as an open-source software framework capable of processing large amounts of heterogeneous data-sets in a distributed fashion (via MapReduce) across clusters of commodity hardware on a storage framework (HDFS).  Hadoop uses a simplified programming model.  The result is Hadoop provides a reliable shared storage and analysis system.

MapReduce is a software framework that allows developers to write programs that perform complex computations massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers.  MapReduce libraries have been written in many programming languages (usually Java), with different levels of optimization.  It works by breaking down a large complex computation into multiple tasks and assigning those tasks to individual worker/slave nodes and taking care of coordination and consolidation of the results.  A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).

Hadoop Distributed File System (HDFS) is a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.  When data is pushed to HDFS, it will automatically split into multiple blocks and stores/replicates the data across various datanodes, ensuring high availability and fault tolerance.

NameNode holds the information about all the other nodes in the Hadoop cluster, files present in the cluster, constituent blocks of files and their locations in the cluster, and other information useful for the operation of  the Hadoop cluster.  Each DataNode is responsible for holding the data.  JobTracker keeps track of the individual tasks/jobs assigned to each of the nodes and coordinates the exchange of information and results.  TaskTracker is responsible for running the task/computation assigned to it.

Untitled picture

A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode.  A slave or worker node acts as both a DataNode and TaskTracker.

Hadoop_1

There are many other tools that work with Hadoop:

Hive is part of the Hadoop ecosystem and provides an sql-like interface to Hadoop.  Hive uses MapReduce code to extract data from the Hadoop cluster.  Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files.  Hive supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop.  Hive appeals to data analysts familiar with SQL.

Pig enables you to write programs using a procedural language called Pig Latin that are compiled to MapReduce programs on the cluster.  It also provides fluent controls to manage data flow.  Pig is more of a scripting language while Hive is more SQL-like.  With Pig you can write complex data transformations on a data set such as aggregate, join and sort.  It can be extended using User Defined Functions in Java and called directly by Pig.

While Hadoop is a natural choice for processing unstructured and semi-structured data like logs and files, there may be a need to process structured data stored in relational databases as well.  Sqoop (SQL-to-Hadoop) is a tool that allows you to import structured data from SQL Server and SQL Azure to HDFS and then use it in MapReduce and Hive jobs.  You can also use Sqoop to move data from HDFS to SQL Server.

How Hadoop fits in with the Parallel Data Warehouse (PDW) and Polybase:

Untitled picture

More info:

Big Data Basics – Part 3 – Overview of Hadoop

Video Hadoop Tutorial: Core Apache Hadoop

Hadoop Tutorial: Intro to HDFS

Hadoop: What it is, how it works, and what it can do

Hive Data Warehouse: Lessons Learned

Posted in Hadoop, PDW, SQLServerPedia Syndication | 7 Comments

What is HDInsight?

There are two flavors of HDInsight: Windows Azure HDInsight Service and Microsoft HDInsight Server for Windows (recently quietly killed but lives on in a different form).  Both were developed in partnership with Hadoop software developer and distributor Hortonworks and were made generally available in October, 2013.

Windows Azure HDInsight Service (try) is a service that deploys and provisions Apache Hadoop clusters in the Azure cloud, providing a software framework designed to manage, analyze and report on big data.  It makes the HDFS/MapReduce software framework and related projects such as Pig, Sqoop and Hive available in a simpler, more scalable, and cost-efficient environment.  It uses Windows Azure Blob storage as the default file system (or you can store it in the native Hadoop Distributed File System (HDFS) file system that is local to the compute nodes).

Untitled picture

Microsoft HDInsight Server for Windows was killed shortly after it was released but lives on in two flavors: Hortonworks Data Platform (HDP) (try) and Microsoft’s Parallel Data Warehouse (PDW).  Both are on-premise solutions.  With HDP, it includes core Hadoop (meaning the HDFS and MapReduce), plus Pig for MapReduce programming, Hive data query infrastructure, Hortonworks’ recently introduced HCatalog table management service for access to Hadoop data, Scoop for data movement, and the Ambari monitoring and management console.  All of the above have been reengineered to run on Windows and all are open-source components that are compatible with Apache Hadoop and are being contributed back to the community.  With PDW, you can add an HDInsight region into the appliance, and this region includes HDP and can be accessed via Polybase.

The HDInsight Server is designed to work with (but does not include) Windows Server and Microsoft SQL Server.  In the case of Windows, HDInsight is integrated with Microsoft System Center for administrative control and Active Directory for access control and security.

Untitled picture

More info:

Microsoft Releases Hadoop On Windows

Hadoop and HDInsight: Big Data in Windows Azure

Video HDInsight: Introduction to Hadoop on Windows

Video Introduction To Windows Azure HDInsight Service

Let there be Windows Azure HDInsight

Working With Data in Windows Azure HDInsight Service

HDInsight patterns & practices Windows Azure Guidance

Hortonworks Makes HDP 2.0 for Windows Server Generally Available

Windows Azure HDInsight supports preview clusters of Hadoop 2.2

Windows Azure HDInsight Supporting Hadoop 2.2 in Public Preview

Posted in Hadoop, PDW, SQLServerPedia Syndication | 4 Comments