What is Azure Data Factory?

The Azure Data Factory is a service designed to allow developers to integrate disparate data sources.  It is a platform somewhat like SSIS in the cloud to manage the data you have both on-prem and in the cloud.

It provides access to on-premises data in SQL Server and cloud data in Azure Storage (Blob and Tables) and Azure SQL Database.  Access to on-premises data is provided through a data management gateway that connects to on-premises SQL Server databases.

It is not a drag-and-drop interface like SSIS.  Instead, data processing is enabled initially through Hive, Pig and custom C# activities.  Such activities can be used to clean data, mask data fields, and transform data in a wide variety of complex ways.

You will author your activities, combine them into a pipeline, set an execution schedule and you’re done.  Data Factory also provides an up-to-the moment monitoring dashboard, which means you can deploy your data pipelines and immediately begin to view them as part of your monitoring dashboard.

Within the Azure Preview Portal, you get a visual layout of all of your pipelines and data inputs and outputs.  You can see all the relationships and dependencies of your data pipelines across all of your sources so you always know where data is coming from and where it is going.  You get a historical accounting of job execution, data production status, and system health in a single monitoring dashboard.

Data Factory provides customers with a central place to manage their processing of web log analytics, click stream analysis, social sentiment, sensor data analysis, geo-location analysis, etc.

Microsoft views Data Factory as a key tool for customers who are looking to have a hybrid story with SQL Server or who currently use Azure HDInsight, Azure SQL Database, Azure Blobs, and Power BI for Office 365.

In short, developers can use Data Factory to transform semi-structured, unstructured and structured data from on-premises and cloud sources into trusted information.

endtoendworkflow

More info:

Microsoft takes wraps off preview of its Azure Data Factory service

Data Factory Public Preview – build and manage information production pipelines

Video Azure Data Factory Overview

Introduction to Azure Data Factory Service

The Ins and Outs of Azure Data Factory – Orchestration and Management of Diverse Data

Posted in Cloud, SQLServerPedia Syndication | Leave a comment

Fast Track Data Warehouse Reference Architecture for SQL Server 2014

I have previously blogged about the Fast Track Data Warehouse, a reference configuration optimized for data warehousing (see Microsoft SQL Server Reference Architecture and Appliances).  That was version 3.0 of the reference configuration that was for SQL Server 2008 R2.  Released a couple of years ago was version 4.0 for SQL Server 2012 (download).  Now appearing is Fast Track Data Warehouse solutions for SQL Server 2014.  While a new reference architecture whitepaper is not available yet, these solutions can be ordered now. These are solutions validated by Microsoft for deploying SMP data warehouse on SQL Server 2014 EE:

3 solutions from HP (in collaboration with SanDisk):

HP DL380 Gen8 and HP LE PCIe Workload Accelerator 28TB/45 TB Data Warehouse Fast Track Reference Architecture

HP DL580 Gen 8 and HP LE PCIe Workload Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

One from Lenovo:

Lenovo

One from EMC:

EMC Solutions

More partners are expected soon.

Posted in Data warehouse, Fast Track, SQLServerPedia Syndication | 1 Comment

Should you move your data to the cloud?

Should you move your data to the cloud?  That is the question.  The answer is not simple.  While moving data to the cloud is all the rage, the fact is a large majority of the Fortune 500 companies are not keeping any data in the cloud at all.  At least not yet.  Why is that?  Well, some of those reasons include:

  • Security concerns (potential for compromised information, issues of privacy when data is stored on a public facility, might be more prone to outside security threats because its high-profile, some providers might not implement the same layers of protection you can achieve in-house)
  • Lack of operational control: Lack of access to servers (i.e. say you are hacked and want to get to security and system log files; if something goes wrong you have no way of controlling how and when a response is carried out; the provider can update software, change configuration settings, and allocate resources without your input or your blessing; you must conform to the environment and standards implemented by the provider)
  • Lack of ownership (an outside agency can get to data easier in the cloud data center that you don’t own vs getting to data in your onsite location that you own)
  • Compliance restrictions
  • Regulations (health, financial)
  • Legal restrictions (i.e. data can’t leave your country)
  • Company policies
  • You may be sharing resources on your server, as well as competing for system and network resources

If you can get past most or all of those reasons, the cloud offers many benefits:

  • Fire up a server quickly (abbreviated infrastructure implementation build-out times)
  • Grow as demand is needed (unlimited elastic scale).  Change hardware instantly
  • Reduce as demand lessons (pay for what you need)
  • Don’t need co-location space, so cost savings (space, power, etc)
  • No hardware costs
  • No commitment or long-term vendor lock
  • Allows companies to benefit from changes in the technology impacting the latest storage solutions
  • High availability and disaster recovery done for you
  • More frequent updates to OS, sql server, etc
  • Automatic updates
  • Automatic geography redundancy
  • Really helpful for proof-of-concept (POC) or development projects with a known lifespan

Also, there are some constraints of on-premise data:

  • Scale constrained to on-premise procurement
  • Capex up-front costs, although some companies may prefer this over a yearly operating expense (OpEx)
  • A staff of employees or consultants must be retained to administer and support the hardware and software in place
  • Expertise needed for tuning and deployment

These are just some quick bullet points to get you thinking about the pros and cons of moving to the cloud.  And the answer can be further muddied by adding into the discussion public cloud (Azure) versus private cloud (Cloud Platform System) – a discussion for later!

More info:

The Hybrid Cloud: Having your Cake

To Cloud or Not to Cloud, That is the Question

Posted in Cloud, SQLServerPedia Syndication | 1 Comment

Book: Microsoft SQL Server 2014 Business Intelligence Development Beginner’s Guide

I was fortunate to be a technical reviewer for a new book, SQL Server 2014 Business Intelligence Development by Reza Rad.  Below is the abstract.  It’s a great book for those of you trying to learn about all the various tools in the Microsoft BI stack.  Click here to check it out!

8888EN_mockupcover.jpg

Take advantage of the real power behind the BI components of SQL Server 2014, Excel 2013, and SharePoint 2013 with this hands-on book.  This book will give you a competitive advantage by helping you to quickly learn how to design and build BI system with Microsoft BI tools.  This book starts with designing a data warehouse with dimensional modeling, and then looks at creating data models based on SSAS multidimensional and Tabular technologies.  It will illustrate how to use SSIS for ETL, and MDS and DQS for data governance and data quality.  The book is packed with real-world examples that will give you a good understanding of the BI and DW components of SQL Server 2014, Excel 2013, and SharePoint 2013.

Posted in Business Intelligence, SQLServerPedia Syndication | Leave a comment

Analytics Platform System (APS) AU2 released

The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), has recently released an appliance update (AU2), which is sort of like a service pack, except that it includes many new features.  These appliance updates are made available frequently, about every 3-4 months.  Below is what is new in this release:

TSQL Compatibility improvements to reduce migration friction from SQL SMP

  • Adds full support for user defined schemas. This is one of the most requested features from customers migrating from SQL Server (CREATE SCHEMA, ALTER USER, GRANT/DENY, etc.)
  • Improves statistics collection capability and supportability by introducing filtered statistics, ability to control the sample size for sampled statistics, and support for DBCC SHOW_STATISTICS and STATS_DATE intrinsics
  • Adds support for a set of 51 mathematical, string ODBC, date/time, data and type analytic built in functions:
    ACOS, ATN2, COT, DEGREES, PI, RADIANS, ASCII, CHAR, CONCAT, DIFFERENCE, NCHAR, REVERSE, SPACE, STUFF, UNICODE, TERTIARY_WEIGHTS, PARSENAME, BIT_LENGTH, CURRENT_DATE, CURRENT_TIME, DAYNAME, DAYOFMONTH, DAYOFWEEK, HOUR, MINUTE, SECOND, MONTHNAME, QUARTER, WEEK, DATEFROMPARTS, DATETIME2FROMPARTS, DATETIMEFROMPARTS, DATETIMEOFFSETFROMPARTS, SMALLDATETIMEFROMPARTS, TIMEFROMPARTS, SYSDATETIME, SYSDATETIMEOFFSET, SYSUTCDATETIME, ISDATE, DATENAME, GETUTCDATE, EOMONTH, SWITCHOFFSET, TODATETIMEOFFSET, HASHBYTES, DATALENGTH, TYPE_ID, TYPE_NAME, TYPEPROPERTY, SQL_VARIANT_PROPERTY
  • Enables broader set of 3rd party ISV tools such as SAS and Tableau, as well as unblocking scenarios around managing data and analytical functions within the appliance

HDI Region Support for HDP 2.0 and Yarn

  • Enables users to write custom apps based on new data processing models (beyond MapReduce), enabling new scenarios such as interactive/ad hoc querying or search
  • Improves processing on the same hardware with predictable performance and better quality of service
  • On boards SCOM management pack for HDI, enabling System Center management of PDW+HDI
  • Updates the Microsoft .NET SDK for Hadoop to add value for both APS and Azure HDInsight clusters

Polybase HDP 2.0 and Query Hint Support

  • Enables all Polybase scenarios to work against both HDP 2.0 on Windows and Linux, pushing computation through the Yarn framework
  • Enables query hint support for Polybase push-down, enabling ability to force or disable push-down on a per-query basis

Install, Upgrade, and Servicing Improvements

  • Reduces end-to-end install time from 12 to under 9 hours (single scale unit)
  • Reduces end-to-end install time for virtual appliances from 7.5 to 4.5 hours
  • Reduces end-to-end servicing and patching time from 2 to 1 hour
  • Improves supportability with fast-fail and simplified troubleshooting steps
  • Improves performance by performing parallel and group patching.

Appliance Heartbeat Monitoring

  • Collects critical alerts and appliance heartbeats, storing appliance health data in Azure

Self-host Appliance and Customer Workload

  • On boards key first party workloads by creating APS-based internal data warehouse

Engineering Efficiency Improvements

  • Introduces train model, reducing release cycle to 120 days end-to-end
  • Enables more frequent test passes to reduce bug escape
  • Completely automates virtual appliance allocation

TFS and MSBuild Migration

  • Migrates codebase to Team Foundation Server (TFS) under PDW_Main
  • Converts codebase to MSBuild
Posted in PDW/APS, SQLServerPedia Syndication | 1 Comment

Non-obvious APS/PDW benefits

The Analytics Platform System (APS), which is a renaming of the Parallel Data Warehouse (PDW), has a lot of obvious benefits, which I discuss here.  For those of you who find your database is getting too big, or becoming too slow, or you need to integrate non-relational data, check out APS, Microsoft’s MPP solution.

But there are a lot of non-obvious benefits to using APS that I have listed below:

  • APS is a hardware platform that removes roadblocks for future needs. Be proactive instead of reactive!
  • Numerous capabilities (i.e. PolyBase, mixed workload support, scalability, HDInsight) to get you thinking about better ways to use your data
  • Much quicker development time due to speed of execution
  • Allows removal of ETL complexity (temp tables, aggregation tables, data marts, other band aids and workarounds)
  • Permits elimination of SSAS cubes or switch to ROLAP mode (so real-time data and no cube processing time)
  • Reduction or elimination of nightly maintenance windows. Instead use intra-day batch cycles
  • Tuning, redesigning ETL, upgrading hardware (more memory, Fusion IO cards), etc., for SMP to get 20-50% improvement versus 20-50x improvement with MPP
  • SMP is optimized for OLTP while APS is optimized for data warehouses
  • Allows a clear path to do predictive analytics via tools like Azure ML by having the disk space and processing power
  • Don’t think of it as just a solution for a large data warehouse but rather for any size warehouse that needs faster query performance
  • Faster query performance allows for adding more parameters to reports that also can be run real-time (instead of at night with fixed parameters) so business users can ask more sophisticated questions and execute ad-hoc queries.  This will result in a lot more self-service reporting
  • Have the space and performance to consolidate all your various data warehouses and data marts to one place
  • Decrease infrastructure costs through consolidation: reduced server count, reduction of duplicate data, decreasing licenses, viewer ETL pieces, less infrastructure management costs (hardware, SQL Server, OS)
Posted in PDW/APS, SQLServerPedia Syndication | 3 Comments

IaaS, PaaS, and SaaS explained

You might be reading a lot about Cloud computing and see three acronyms frequently: IaaS, PaaS, Saas.  Cloud providers offer their services according to these three fundamental models:

Infrastructure-as-a-service (IaaS)

This is the most basic model which is essentially your virtual machines in a cloud data center.  You set up, configure, and manage VMs that run in the data center infrastructure, and you put whatever you want on them.  A hypervisor such as Hyper-V runs the virtual machines as guests.  Pools of hypervisors installed at a data center can support large numbers of virtual machines and the ability to scale services up and down according to customers’ varying requirements.  Windows Azure, Hortonworks Data Platform, Amazon Elastic Compute Cloud (EC2)Rackspace, and Google Compute Engine are the most popular examples.

Traits of IaaS:

  • You Build/Upload Virtual Machines to a DC on the Internet – e.g. Windows Azure
  • You PAY for time/resources used and the software in your VM’s
  • Your virtual machines RUN on hardware shared with other organizations
  • You manage ALL aspects of the software stack inside your virtual machines
  • You perform OS updates and manage runtime and middleware
  • VM’s can be moved to/from the Cloud and your own data center
  • App development is unchanged

Platform-as-a-service (PaaS)

With PaaS, a provider delivers a computing platform, typically including operating system, programming language execution environment, database, and web server.  You don’t have to worry about OS updates or managing runtime and middleware.  The provider manages the hardware and software infrastructure and you just use the service.  It is usually a layer on top of IaaS.  Examples are Microsoft Azure SQL Database, HDInsight, AWS Elastic Beanstalk, Windows Azure BLOB Storage, and Google App Engine.

Using a Windows Azure BLOB Storage example:

  • You SUBSCRIBE to the service and create a unique name
  • You GIVE Blobs(Files) to the Storage Service – simple API or REST
  • The service provides resilience and scale, you don’t have to.
  • You ask for them back – you don’t care or know where they really are (which VM’s)
  • The service and the fabric controller make sure your data is stored so there is no single point of failure
  • You pay for the amount of storage you use – the service manages everything
  • The service can also geo-replicate, provide disaster recovery

Software-as-a-service (SaaS)

With SaaS, users are provided access to application software and databases. Cloud providers manage the infrastructure and platforms that run the applications.  SaaS is sometimes referred to as “on-demand software”.  Google Apps (which includes GMail), Salesforce, and Microsoft Office 365 are good examples.

Traits of SaaS:

  • Complete apps you use
  • Subscribe, on-board, normally pay for the # of users who use the app
  • No access to underlying platform
  • Software may support some customizations
  • Shared hardware, platform and finished software across multiple customers
  • A layer on top of PaaS

To summarize, when you have a data center on site, you manage everything.  When it’s infrastructure as a service, part of that stack is outsourced to a vendor.  With platform as a service, you’re responsible for the application and data – everything else is outsourced to the vendor.  With software as a service, you outsource everything.  Using a “pizza-as-a-service” analogy:

  • On Premise = you buy everything and make the pizza at home
  • IaaS = take and bake (pick up the pizza, you cook it at home)
  • PaaS = pizza delivered
  • SaaS = dining in the restaurant

More info:

Windows Azure – Write, Run or Use Software

But what can I *do* with Windows Azure? Create (Free) Websites and Applications

IaaS, PaaS and SaaS Terms Clearly Explained and Defined

Cloud Jargon Unwound: Distinguishing SaaS, IaaS and PaaS

What Is Cloud Computing?

Cloud Service Models (IaaS, SaaS, PaaS) + How Microsoft Office 365, Azure Fit In

Microsoft Azure for Enterprises

Cloud Models (IaaS, PaaS, SaaS) explained with examples

Pizza as a Service – On Prem, IaaS, PaaS and SaaS Explained through Pie (not Pi)

Posted in SQLServerPedia Syndication | Leave a comment

Presentation slides for Modern Data Warehousing

Thanks to everyone who attended my session “Modern Data Warehousing” at the Central New Jersey SQL User Group yesterday.  The abstract for my session is below.  I hope you enjoyed it!

Here is the PowerPoint presentation: Modern Data Warehousing

Modern Data Warehousing

The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data.  How can you prevent this from happening?  Enter the modern data warehouse, which is able to handle and excel with these new trends.  It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries.  Is there one appliance that can support this modern data warehouse?  Yes!  It is the Analytics Platform System (APS) from Microsoft (formally called the Parallel Data Warehouse or PDW), which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1).  In this session I will dig into the details of the modern data warehouse and APS.  I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance.  In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.

Posted in Data warehouse, Presentation, Session, SQLServerPedia Syndication | 1 Comment

Modern Data Warehousing Presentation

I will be presenting the session “Modern Data Warehousing” tomorrow (Wednesday, August 13th) at the Central New Jersey SQL User Group at 6:30pm.  The abstract for my session is below.  I hope you can make it!

Modern Data Warehousing

The traditional data warehouse has served us well for many years, but new trends are causing it to break in four different ways: data growth, fast query expectations from users, non-relational/unstructured data, and cloud-born data.  How can you prevent this from happening?  Enter the modern data warehouse, which is able to handle and excel with these new trends.  It handles all types of data (Hadoop), provides a way to easily interface with all these types of data (PolyBase), and can handle “big data” and provide fast queries.  Is there one appliance that can support this modern data warehouse?  Yes!  It is the Analytics Platform System (APS) from Microsoft (formally called the Parallel Data Warehouse or PDW), which is a Massively Parallel Processing (MPP) appliance that has been recently updated (v2 AU1).  In this session I will dig into the details of the modern data warehouse and APS.  I will give an overview of the APS hardware and software architecture, identify what makes APS different, and demonstrate the increased performance.  In addition I will discuss how Hadoop, HDInsight, and PolyBase fit into this new modern data warehouse.

Posted in Data warehouse, Presentation, Session, SQLServerPedia Syndication | Leave a comment

24 Hours of PASS videos online

All the videos for the sessions at 24 Hours of PASS (Spring 2014) are available for free.  Check the session schedule for a complete list.  Also, many of the PASS virtual chapters have recordings of their meetings online.  Enjoy!

Posted in PASS, SQL Server, SQLServerPedia Syndication, Videos | Leave a comment