Types of NoSQL databases

founA NoSQL database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.  NoSQL is often interpreted as Not-only-SQL to emphasize that they may also support SQL-like query languages.

NoSQL is simply the term that is used to describe a family of databases that are all non-relational.  While the technologies, data types, and use cases vary wildly amount them, it is generally agreed that there are four types of NoSQL databases:

  • Key-value stores – These databases pair keys to values.  An analogy is a files system where the path acts as the key and the contents act as the file.  There are usually no fields to update, instead, the entire value other than the key must be updated if changes are to be made.  The simplicity of this scales whell but it can limit the complexity of the queries and other advanced features.  Examples are: Dynamo, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike, OrientDB, MUMPS, HyperDex
  • Graph stores – These excel at dealing with interconnected data.  Graph databases consist of connections, or edges, between nodes.  Both nodes and their edges can store additional properties such as key-value pairs.  The strength of a graph database is in traversing the connections between the nodes.  But they generally require all data to fit on one machine, limiting their scalability.  Examples include: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog, Sesame
  • Column stores – Relational databases store all the data in a particular table’s rows together on-disk, making retrieval of a particular row fast.  Column-family databases generally serialize all the values of a particular column together on-disk, which makes retrieval of a large amount of a specific attribute fast.  This approach lends itself well to aggregate queries and analytics scenarios where you might run rnage queries over a specific field.  Examples include: Accumulo, Cassandra, Druid, HBase, Vertica
  • Document stores – These databases store records as “documents” where a document can generally be thought of as a grouping of key-value pairs (it has nothing to do with storing actual documents such as a Word document).  Keys are always strings, and values can be stored as strings, numeric, Booleans, arrays, and other nested key-value pairs.  Values can be nested to arbitrary depths.  In a document database, each document carries its own schema — unlike an RDBMS, in which every row in a given table must have the same columns.  Examples include: Lotus Notes, Clusterpoint, Apache CouchDB, Couchbase, MarkLogic, MongoDB, OrientDB, Qizx, Cloudant, DocumentDB

The CAP Theorem states that it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:

  • Consistency (all nodes see the same data at the same time)
  • Availability (a guarantee that every request receives a response about whether it succeeded or failed)
  • Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

CAP_Diagram_dist copy

Since you can only pick two guarantees, here is a list of NoSQL system broken out by the two that they support:

medium_media_httpfarm5static_mevIk

Here is a quick summary of the most popular NoSQL products by group:

  • Key-value stores
    • Riak – Offers high availability, fault tolerance, operational simplicity, and scalability.  Riak is one of the more sophisticated data stores.  It offers most of the features found in others, then adds more control over duplication.  Although the basic structure stores pairs of keys and values, the options for retrieving them and guaranteeing their consistency are quite rich.
    • Redis – Like CouchDB and MongoDB, Redis stores documents or rows made up of key-value pairs.  Unlike the rest of the NoSQL world, it stores more than just strings or numbers in the value.  It will also include sorted and unsorted sets of strings as a value linked to a key, a feature that lets it offer some sophisticated set operations to the user.  There’s no need for the client to download data to compute the intersection when Redis can do it at the server.  Redis is also known for keeping the data in memory and only writing out the list of changes every once and a bit.  Some don’t even call it a database, preferring instead to focus on the positive by labeling it a powerful in-memory cache that also writes to disk.  Traditional databases are slower because they wait until the disk gets the information before signaling that everything is OK.  Redis waits only until the data is in memory, something that’s obviously faster but potentially dangerous if the power fades at the wrong moment.
  • Document stores
    • MongoDB – Is designed for scale, flexible data aggregation and to store files of any size.  It has rich querying, high availability and full indexing support and is fast being adopted by many businesses.  Uses GridFS instead of HDFS.  MongoDB is designed for OLTP workloads. It can do complex queries, but it’s not necessarily the best fit for reporting-style workloads.  Or if you need complex transactions, it’s not going to be a good choice.  However, MongoDB’s simplicity makes it a great place to start.  MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON), making the integration of data in certain types of applications easier and faster.  MongoDB is built to store data as an object in a dynamic schema, instead of a tabular database like SQL.
    • Coachbase – Can be used both as a document database that stores JSON documents or a pure key-value database.  Click here for how it compares to MongoDB.  CouchDB stores documents, each of which is made up of a set of pairs that link key with a value.  The most radical change is in the query. Instead of some basic query structure that’s pretty similar to SQL, CouchDB searches for documents with two functions to map and reduce the data.  One formats the document, and the other makes a decision about what to include.
  • Column stores
    • Cassandra – Born at Facebook and built on Amazon’s Dynamo and Google’s BigTable, it is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure.  It is essentially a hybrid between a key-value and a column-oriented (or tabular) database.  Most agree it is better than Hbase and MongoDB.  It can be used for both OLTP and data warehousing.  It replaces HDFS with the Cassandra File System (CFS).  Cassandra does not support joins or subqueries and emphasizes denormalization.
    • HBase – Patterned after Google BigTable, HBase is designed to provide fast, tabular access to the high-scale data stored on HDFS.  It is well suited for sparse data sets, which are common in many big data use cases.  HBase offers two broad use cases. First, it gives developers database-style access to Hadoop-scale storage, which means they can quickly read from or write to specific subsets of data without having to wade through the entire data store.  Most users and data-driven applications are used to working with the tables, columns, and rows of a database, and that’s what HBase provides.  Second, HBase provides a transactional platform for running high-scale, real-time applications. In this role, HBase is an ACID-compliant database that can run transactional applications.  That’s what conventional relational databases like Microsoft SQL Server are mostly used for, but HBase can handle the incredible volume, variety, and complexity of data encountered on the Hadoop platform.  Like other NoSQL databases, it doesn’t require a fixed schema, so you can quickly add new data even if it doesn’t conform to a predefined model.  It can be used for lightweight OLTP.  Tables are de-normalized for speed (so no joins), but updates can be slow.  HBase does not use Hadoop’s MapReduce capabilities directly, though HBase can integrate with Hadoop to serve as a source or destination of MapReduce jobs
  • Graph
    • Neo4j – Neo4J lets you fill up the data store with nodes and then add links between the nodes that mean things. Social networking applications are its strength.  The code base comes with a number of common graph algorithms already implemented.  If you want to find the shortest path between two people — which you might for a site like LinkedIn — then the algorithms are waiting for you.
    • OrientDB – It is a document-based database, but the relationships are managed as in graph databases with direct connections between records.  It supports schema-less, schema-full and schema-mixed modes.  It has a strong security profiling system based on users and roles and supports SQL as a query language.  OrientDB uses a new indexing algorithm called MVRB-Tree, derived from the red–black tree and from the B+ tree; this reportedly has benefits of having both fast insertions and fast lookups.

Untitled picture

But I’ll leave you with this note: Although NoSQL databases are becoming more popular, according to DB-Engines Ranking they only make up about 12% of the total database market when you include relational databases!

More info:

Visual Guide to NoSQL Systems

List Of NoSQL Databases

Thumbtack: NoSQL Database Comparison by Ben Engber

MongoDB, Cassandra, and HBase — the three NoSQL databases to watch

What is Apache Cassandra?

DB-Engines database popularity ranking

NoSQL showdown: MongoDB vs. Couchbase

NoSQL standouts: New databases for new applications

The Rise and Fall of the NoSQL Empire (2007–2013)

Posted in Hadoop, SQLServerPedia Syndication | 2 Comments

What is a data lake?

A “data lake” is a storage repository, usually in Hadoop, that holds a vast amount of raw data in its native format until it is needed.  It’s a great place for investigating, exploring, experimenting, and refining data, in addition to archiving data.  Some characteristics of a data lake include:

  • A place to store unlimited amounts of data in any format inexpensively
  • Allows collection of data that you may or may not use later: “just in case”
  • A way to describe any large data pool in which the schema and data requirements are not defined until the data is queried: “just in time” or “schema on read
  • Complements an Enterprise Data Warehouse (EDW) and can be seen as a data source for the EDW – capturing all data but only passing relevant data to the EDW
  • Frees up expensive EDW resources (storage and processing), especially for data refinement
  • Allows for data exploration to be performed without waiting for the EDW team to model and load the data
  • Some processing in better done on Hadoop than ETL tools like SSIS
  • Also called bit bucket, staging area, landing zone or enterprise data hub (Cloudera)

To use some pictures to show the benefit of a data lake, here is the traditional approach for a data warehouse environment:

dl1

But as we introduce larger data volumes into this environment along with non-relational data, we run into problems:

dl2

The impact if we keep the current architecture:

  • Drop useful data by introducing ETL “bias”
  • Potentially insightful data is lost
  • Create latency as volumes increase and sources change
  • Duplicate data through staging environments to support ETL
  • Expensive “reactive” hardware to support processing scale requirements

So we take a new approach, where the non-relational data is copied to a data lake and refined, and then copied to the data warehouse.  Meanwhile, much of the relational data can keep being fed directly to the data warehouse using the current ETL, bypassing the data lake:

dl3

By changing the architecture for the analyst’s needs, we get the following benefits:

  • Entire “universe” of data is captured and maintained
  • Mining of data via transformation on read leaves all data in place
  • Refineries leverage the power of the cloud and traditional technologies
  • Integration with traditional data warehousing methodologies
  • Scale can be pushed to cloud for more horsepower
  • Orchestration of data is a reality (less rigid, more flexible, operational)
  • Democratization of predictive analytics, data sets, services and reports

Note there are technologies, such as PolyBase, that allow end-users to query data in a data lake using regular SQL, so they are not required to learn any Hadoop-related technologies.  In fact PolyBase allows the end-user to use SQL, or any reporting tool that uses SQL, to join data in a relational database with data in a Hadoop cluster.

More info:

Hadoop and Data Warehouses

The Modern Data Warehouse

Analysts Warn of Data Lake ‘Fallacy’ in Big Data Analytics

Make Sure Your Data Lake is Both Just-in-Case and Just-in-Time

Top Five Differences between Data Lakes and Data Warehouses

Hadoop vs Data Warehouse: Apples & Oranges?

Martin Rennhackkamp: Data Lake, Data Lake vs Data Warehouse, Building a Data Lake, Revolt against the Data Lake as Staging Area

Posted in Data warehouse, Hadoop, SQLServerPedia Syndication | 1 Comment

Creating a large data warehouse in Azure

Microsoft Azure provides you two options when hosting your SQL Server-based data warehouse: Microsoft Azure SQL Database and SQL Server in Azure Virtual Machine.  Which one is appropriate based on the size of the data warehouse?  What are some hardware features to choose from for an Azure VM for a large data warehouse?

Let’s look at each option.

Microsoft Azure SQL Database is a Platform-as-a-service (PaaS), or more specifically a relational database-as-a-service.  It is built on standardized hardware and software that is owned, hosted, and maintained by Microsoft.  You can develop directly on the service using built-in features and functionality.  When using Azure SQL Database, you pay-as-you-go with options to scale up or out for greater power.  Azure SQL Database has a max database size of 500GB (see Azure SQL Database Service Tiers and Performance Levels).  Also, not all the SQL Server 2014 Transact-SQL statements are supported in Azure SQL Database (see Azure SQL Database Transact-SQL Reference) and it does not support SQL Server instance level features (such as, SQL Server Agent, Analysis Services, Integration Services, or Reporting Services).  But there are some features that land first in Azure SQL Database before on-prem SQL Server, such as Row-Level Security and Dynamic Data Masking.  Note that Azure SQL Database has built-in fault tolerance infrastructure capabilities that enable high availability as well as business continuity options (see Azure SQL Database Business Continuity).  So this option is only appropriate if you have a relatively small data warehouse that does not require full SQL support.

The other option that will almost always be the correct choice for a large data warehouse is to create a Azure VM that has SQL Server 2014 installed, resulting in an Infrastructure-as-a-service (IaaS).  This allows you to run SQL Server inside a virtual machine in the cloud.  Similar to Azure SQL Database, it is built on standardized hardware that is owned, hosted, and maintained by Microsoft.  When using SQL Server in a VM, you can either bring your own SQL Server license to Azure (by uploading a Windows Server VHD to Azure) or use one of the preconfigured SQL Server images in the Azure portal.  If going with a preconfigured image you should choose “SQL Server 2014 Enterprise Optimized for Data Warehousing on Windows Server 2012 R2″ (see VM Images Optimized for Transactional and DW workloads in Azure VM Gallery) which will attach 15 data disks (12 disks for a 12TB data pool and 3 disks for a 3TB log pool).  You will also need to choose the virtual machine size for your VM.  Note you can setup high availability and disaster recovery solutions (see High Availability and Disaster Recovery for SQL Server in Azure Virtual Machines).  This resulting VM will be very similar to an on-prem SQL Server solution except for the various hardware configurations that you have to choose from for your virtual machine size.

If you look at the Azure Virtual Machines Pricing for SQL Server, here are the options you would want to consider for your virtual machine size:

  • A7, 8 cores, 56GB memory, 605GB max disk size, 16 Persistent 1TB Data Disks Max, $3/hr SQL Enterprise
  • A9, 16 cores, 112GB RDMA memory, 382GB max disk size, 40Gbit/s InfiniBand, 16 Persistent 1TB Data Disks Max, $6/hr SQL Enterprise
  • D14, 16 cores, 112GB memory, 800GB max disk size (SSD), 32 Persistent 1TB Data Disks Max, cpu 60% faster than A-series, $6/hr SQL Enterprise
  • G5, 32 cores, 448GB memory, 6596GB max disk size (SSD), 64 Persistent 1TB Data Disks Max, cpu Xeon E5 v3, $12/hr SQL Enterprise.  See Largest VM in the Cloud

The “Persistent 1TB Data Disks” refers to connecting external attached drives but be aware that network bandwidth can be a bottleneck.  From Virtual Machine and Cloud Service Sizes for Azure you can see for G5 you can add 64 of 1TB data disks yielding the potential total volume size of up to 64 TB.  See How to Attach a Data Disk to a Windows Virtual Machine to attach a drive and give it a new drive letter, or attach drives to use as a storage space which is a way to make multiple disks appear as one (see Windows Server 2012 Storage Virtualization Explained).

If this still is not enough disk size for your data warehouse, you will need to use an on-prem SQL Server solution or an MPP solution such as Microsoft’s Analytics Platform System.

There is a script you can download (see Deploy a SQL Server Data Warehouse in Windows Azure Virtual Machines) that allows a user to create a *Data Warehousing* optimized VM on Azure running SQL Server 2012 or SQL Server 2014 and will also attach empty disks to the VM to be used for SQL Server data and log files.

untitled

For info on how to connect to an Azure VM using SSMS, click here.  To connect SSMS to an Azure SQL database, click here.

If you are concerned about data throughput from on-prem to azure, check out the ExpressRoute service.  ExpressRoute enables dedicated, private, high-throughput network connectivity between Azure datacenters and your on-premises IT environments.  Using ExpressRoute, you can connect your existing datacenters to Azure without having to flow any traffic over the public Internet, and enable–guaranteed network quality-of-service and the ability to use Azure as a natural extension of an existing private network or datacenter.

More info:

Introduction to Automating Deployment of SQL Server in Azure IaaS Virtual Machines

Understanding Azure SQL Database and SQL Server in Azure VMs

Inside Microsoft Azure SQL Database

Automating Image-Based Deployment of SQL Server on Azure IaaS VMs – Preparing OS Image

Posted in Azure, Data warehouse, SQLServerPedia Syndication | 4 Comments

Operational Data Store (ODS) Defined

I see a lot of confusion on what exactly is an Operational Data Store (ODS).  While it can mean different things to different people, I’ll explain what I see as the most common definition.  First let me mention that an ODS is not a data warehouse or data mart.  A data warehouse is where you store data from multiple data sources to be used for historical and trend analysis reporting.  It acts as a central repository for many subject areas and contains the “single version of truth”.  A data mart serves the same purpose but comprises only one subject area.  Think of a data warehouse as containing multiple data marts.  See my other blogs that discuss this is more detail: Data Warehouse vs Data Mart,Building an Effective Data Warehouse Architecture, and The Modern Data Warehouse.

The purpose of an ODS is to integrate corporate data from different heterogeneous data sources in order to facilitate operational reporting in real-time or near real-time .  Usually data in the ODS will be structured similar to the source systems, although during integration the data can be cleaned, denormalized, and business rules applied to ensure data integrity.  This integration will happen at the lowest granular level and occur quite frequently throughout the day.  Normally an ODS will not be optimized for historical and trend analysis as this is left to the data warehouse.  And an ODS is frequently used as a data source for the data warehouse.

To summarize the differences between an ODS and a data warehouse:

  • An ODS is targeted for the lowest granular queries whereas a data warehouse is usually used for complex queries against summary-level or on aggregated data
  • An ODS is meant for operational reporting and supports current or near real-time reporting requirements whereas a data warehouse is meant for historical and trend analysis reporting usually on a large volume of data
  • An ODS contains only a short window of data, while a data warehouse contains the entire history of data
  • An ODS provides information for operational and tactical decisions on current or near real-time data while a data warehouse delivers feedback for strategic decisions leading to overall system improvements
  • In an ODS the frequency of data load could be every few minutes or hourly whereas in a data warehouse the frequency of data loads could be daily, weekly, monthly or quarterly

Major reasons for implementing an ODS include:

  • The limited reporting in the source systems
  • The desire to use a better and more powerful reporting tool than what the source systems offer
  • Only a few people have the security to access the source systems and you want to allow others to generate reports
  • A company owns many retail stores each of which track orders in its own database and you want to consolidate the databases to get real-time inventory levels throughout the day
  • You need to gather data from various source systems to get a true picture of a customer so you have the latest info if the customer calls customer service.  Custom data such as customer info, support history, call logs, and order info.  Or medical data to get a true picture of a patient so the doctor has the latest info throughout the day: outpatient department records, hospitalization records, diagnostic records, and pharmaceutical purchase records

More info:

Comparing Data Warehouse Design Methodologies for Microsoft SQL Server

Operational Data Stores (ODS)

The Operational Data Store

Defining the Purpose of the Operational Data Store

Operational data store – Implementation and best practices

Posted in Data warehouse, SQLServerPedia Syndication | 5 Comments

Presentation slides for “Building a Big Data Solution”

Thanks to everyone who attended my session “Building a Big Data Solution” (Building an Effective Data Warehouse Architecture with Hadoop, Cloud and MPP) for Pragmatic Works today.  The abstract for my session is below and the recording will be available here tomorrow.  I hope you enjoyed it!

Here is the PowerPoint presentation: Building a Big Data Solution

Building a Big Data Solution

As a follow-on to the presentation “Building an Effective Data Warehouse Architecture”, this presentation will explain exactly what Big Data is and its benefits, including use cases.  We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built.  We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data.  We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices.  You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in.  In the end you will have guidance on the best solution for your data warehouse going forward.

Posted in Data warehouse, Presentation, Session, SQLServerPedia Syndication | Leave a comment

Power BI Made Simple

In an effort to understand Power BI and all the products it encompasses, I have made this slide deck to hopefully make things easy for you: Power BI Made Simple.

It is a presentation that covers all the products under the Power BI umbrella. I give an overview of the products and how they all fit together.  Microsoft has a lot of pieces to the puzzle, and I try to show how they all fit together!

Posted in Power BI, Power Map, Power Pivot, Power Query, Power View/Project Crescent, SQLServerPedia Syndication | 2 Comments

What is Advanced Analytics?

Advanced Analytics, or Business Analytics, refers to future-oriented analysis that can be used to help drive changes and improvements in business practices.  It is made up of four phases:

Descriptive Analytics: What is generally referred to as “business intelligence”, this phase is where a lot of digital information is captured.  Then this big data is condensed into smaller, more useful nuggets of information, creating an understanding of the correlations between those nuggets to find out why something is happening (“Diagnostic Analytics”).  In short, you are providing insight into what has happened to uncover trends and patterns.  An example is Netflix using historic sales and customer data to improve their recommendation engine.

Predictive analytics: Utilizes a variety of statistical, modeling, data mining, and machine learning techniques to study recent and historical data, thereby allowing analysts to make predictions, or forecasts, about the future.  In short, it helps model and forecast what might happen.  For example, taking sales data, social media data, and weather data to forecast the product demand for a certain region and to adjust production.  Or you can use predictive analytics to determine outcomes such as whether a customer will “leave or stay” or “buy or not buy.”

Prescriptive analytics: Goes beyond predicting future outcomes by also suggesting actions to benefit from the predictions and showing the decision maker the implications of each decision option.  Prescriptive analytics not only anticipates what will happen and when it will happen, but also why it will happen.  The output is a decision using simulation and optimization.  In short, it seeks to determine the best solution or preferred course of action among various choices.  For example, airlines sift thought millions of flight itineraries to set an optimal price at any given time based on supply and demand.  Also, prescriptive analytics in healthcare can be used to guide clinician actions by making treatment recommendations based on models that use relevant historical intervention and outcome data.

To summarize all four phases:

Descriptive: “What happened?”, Diagnostic: “Why did it happen?”, Predictive: “What will happen?”, Prescriptive: “What is the best outcome and how can we make it happen?”

analytic-maturity

More info:

Defining Advanced Analytics

Predictive, Descriptive, Prescriptive Analytics

Big Data Analytics: Descriptive Vs. Predictive Vs. Prescriptive

Understanding Your Business With Descriptive, Predictive and Prescriptive Analytics

Descriptive, Predictive, and Prescriptive Analytics Explained

Business Analytics: Moving From Descriptive To Predictive Analytics

From insight to action: why prescriptive analytics is the next big step for big data

Prescriptive Analytics

Forecasting and Predictive Analytics

Business Analytics 101

Posted in SQLServerPedia Syndication | Leave a comment

Microsoft product roadmap now public

Ever wonder about Microsoft’s product roadmap?  With Microsoft rapidly releases products and services, they realized the need to provide better transparency.  Well wonder no more!  They have released the Cloud Platform Roadmap Site (visit here).

Roadmaps are given for: Microsoft Azure, Intune, Power BI, and Visual Studio Online; server offerings such as Windows Server, System Center, SQL Server and Visual Studio; and converged system appliance offerings such as Cloud Platform System, Analytics Platform System and StorSimple.

It’s a great way to so where Microsoft is focusing some of their development efforts and what technology is currently in development and coming within the next few months.  Check it out!

Here is the official announcement: Increasing Visibility for Our Customers: Announcing the Cloud Platform Roadmap Site.

Posted in Microsoft, SQLServerPedia Syndication | 1 Comment

Power BI Designer and Power BI Dashboard

Long has the question been asked “Which Microsoft tool do I use for dashboards?”.  SSRS, Excel, PowerView, Report Builder and PerformancePoint are all candidates.  But that has all changed, and the future of dashboarding is here: Power BI Designer.

The Power BI Designer is a new companion application for Power BI.  It is a standalone Windows Desktop application that can downloaded from the Power BI site or here.

Note that Power BI Designer is an on-prem version, with a companion tool called Power BI Dashboard that is a web version and is available via a “Introducing Power BI Dashboards (Preview) – Try it now” link on the front of your Power BI site.  Power BI Dashboard can use files created with Power BI Designer.  Think of Power BI Designer as a tool to create the reports, and Power BI Dashboard as the tool to create the dashboard that contains those reports.  However, Power BI Dashboard can create a limited number of reports on its own against other products (GitHub, Marketo, Microsoft Dynamics CRM, Salesforce, SendGrid, Zendesk), as well as against on-prem SSAS (see Using Power BI to access on-premise data), and finally it can connect to an Excel workbook (where data can be automatically refreshed from OneDrive).

Power BI Designer combines Power Query, Power View (a slimed-down version) and the Power Pivot Data Model into a seamless experience that will allow you to build your reports in an offline fashion and then upload them to your Power BI Site via Power BI Dashboard.  When you finish building your report you save it in a new Power BI designer file format called PBIX.

GettingStarted_35

Note that Power BI Designer reports can’t be refreshed after uploaded– must be refreshed manually in Power BI Designer an uploaded again (for now).  But you can edit the design of the report after it has been uploaded.

The Power BI add-ins (Power Query, Power Pivot, Power View, Power Map) for Excel 2013 are still available and you can continue to use that to model your data and build reports.  The Power BI Designer will be another option and also can allow customers with an older version of Office to be able to create reports.

Within the Power BI Designer is a true SSAS tabular data model.  A SSAS tabular instance runs as a child process in local mode.

Users can create personalized dashboards to monitor their most important data.  A dashboard combines on-premises and cloud-born data in a single pane of glass, providing a consolidated view across the organization regardless of where the data lives.

Users can easily explore all their data using intuitive, natural language capabilities and receive answers in the form of charts and graphs. They can also explore data through detailed reports that target specific aspects of their business.  Visuals from these reports can also be pinned to their dashboards for continuous monitoring.  As part of this experience new visualizations have been added including combo charts, filled maps, gauges, tree maps, and funnel charts.

Power BI Dashboard provides “out of the box” connectivity to a number of popular SaaS applications.  In addition to the existing connection with Microsoft Dynamics CRM Online, customers can also connect to their data in Salesforce, Zendesk, Marketo, SendGrid, and GitHub with many more to come in the months ahead.  With an existing subscription to one of these services, customers can login from Power BI.  In addition to establishing a data connection, Power BI provides pre-built dashboards and reports for each of these applications.

4645_dashboard2

Also is a new Power BI connector for SQL Server Analysis Services that allows customers to realize the benefits of a cloud-based BI solution without having to move their data to the cloud.  Customers can now create a secure connection to an “on-premises” SQL Server Analysis Services server from Power BI in the cloud.  When users view and explore dashboards and reports, Power BI will query the on-premise model using the user’s credentials.

There is still a way to go before Power BI Designer is the solution for any and all dashboards you want to create.  It will need to add all the features that are in PerformancePoint, such as KPI’s.  And hopefully there will be a version that will allow dashboards to be deployed to on-prem SharePoint.  But this is a great start and I’m excited to see what is next!

More info:

Power BI Designer Preview

Videos Power BI Designer

New Power BI features available for preview

Getting Started with Power BI Designer

Video Power BI Designer Preview

Unwrapping the Power BI vNext Public Preview

Getting Started with the Power BI Dashboards Public Preview

Posted in Power BI, SQLServerPedia Syndication | 1 Comment

Using Power BI to access on-premise data

The following blog describes how to use certain cloud-based Power BI for Office 365 products (Power View and Q&A) on a Power BI Site that will reference data on-premises.  To do this you will use certain on-premise Power BI products (Power Pivot and Power Query) to gather the data.  Note that you cannot currently upload Power Map reports to a Power BI site (instead use maps in Power View, see Maps in Power View).

This blog applies to on-premise data that resides in SQL Server that is on an Azure VM or an Analytics Platform System (APS), as well as a normal server.  I’ll make note of any differences.

Download and install the Data Management Gateway (DMG) on the server that contains the on-premise data that resides in SQL Server.  If you are using APS, this will have already been done for you on a VM installed with AU3.  This VM will already have the DMG installed which allows a gateway to APS.  Next create a Power BI site if you have not already done so (see Add Power BI sites to your Office 365 site).  On your Power BI site, create a data gateway (see Data Management Gateway Introduction) and note the generated gateway key and then enter that key in the DMG.  This will link your DMG to the Power BI site (a DMG can only be linked to one Power BI site).  Then create a SQL Server data source on your Power BI site that uses the data gateway (see Data Sources via Power BI), enter the connection info to your server, and specify the SQL Server tables to expose. Take note of the created OData data feed URL.

Next you will open an on-premise Excel workbook, go to the Power Query tab, sign in to your Power BI site, and use either “Online Search” or “From Other Sources -> From OData Feed” and choose the table(s) to load (both will create an OData feed connection).  If loading the data into Excel, once loaded select the data and choose “Add to Data Model” on the PowerPivot tab.  Or, load the table directly to a data model.  Then create a Power View report.  Save the Excel workbook and upload it to your Power BI site.  You will now be able to view the Power View report on the Power BI site.  Then use the “Add to Q&A” option on the workbook on the Power BI site, and you will then be able to use Q&A.

Note that you are using data in the Power Pivot model in the workbook you uploaded to Power BI and not the data directly on SQL Server (so the Power View report is hitting Power Pivot, so if data changes in SQL Server, that won’t be reflected in Power View unless you setup the DMG to refresh the data from SQL Server as I explain below).  To refresh the data in the workbook in Power BI from SQL Server, use the “Schedule Data Refresh” option on the workbook in Power BI and create a daily refresh (there is no option to update more than once a day).  To refresh the data more frequently, you must do it manually by going to the setting tab on the “Schedule Data Refresh” page and clicking the button “refresh report now” (the status of the connection must be OK to see this button).  See Schedule data refresh for workbooks in Power BI for Office 365.

Before you use data refresh, be aware that the OData feed connection info that was in the Excel workbook when it was uploaded will not work for a data refresh.  This is because any data sources that are retrieved via Power Query using an OData feed are not supported for data refresh.  So what you need to do is get a proper connection string by importing data from SQL Server using an SQL Server connection: In on-premise Excel, go to the Power Query tab, click “From Database -> From SQL Server Database”.  Enter the connection info (server name) and then Excel will populate the Navigator with names of all the tables.  Then load one of the tables into the data model.  Now go to the Data tab, click Connections, select the properties of the Power Query connection, and on the Definition tab copy the connection string.  Next go to your Power BI site and configure a Power Query data source (see Scheduled Data Refresh for Power Query) where you will paste that connection string.  After adding the Power Query data source, any Excel workbooks that you want to refresh should use a connection to SQL Server (“From Database -> From SQL Server Database”) and not an OData feed (“Online Search” or “From Other Sources -> From OData Feed”).  Once uploaded to Power BI the workbooks can be refreshed as Power BI knows to discover your gateway and data sources.  You won’t need to create another Power Query data source as long as each workbook uses the same raw connection info (server name).

Now in beta for Power BI is a way to access data from the Power BI site in real-time on on-premise SQL Server.  This is accomplished by having Power BI connect to an SSAS tabular model that is on-premise and uses Direct Query mode.  That SSAS tabular model is connected to SQL Server, so a Power View report or Q&A on the Power BI site would be hitting SQL Server in real-time.  Power BI is able to “find” the on-premise SSAS via an installation piece that is similar to the DMG, called the Power BI Analysis Services Connector.  Note that Installing the Power BI Analysis Services Connector and the Data Management Gateway on the same computer is not supported.  Once the Power BI Analysis Services connector is installed and connected to an SSAS server, the tabular model databases for the server instance you configured will appear when users select the “Get Data” option when using Power BI Designer Preview (this is the only place the tabular models can be used).  The connector currently supports Analysis Services 2012 and 2014.  In the future, Analysis Services Multidimensional, SQL Server and other data sources will be added.

Some notes:

The DMG can be used to publish raw data sources (connecting to data sources directly using the server name) or expose them as OData feeds (using OData to find the data source to save users from having to know the server name but with the disadvantage of workbooks the use an OData feed will not be able to be refreshed on a Power BI site).  You will want to create both a raw data source connection as well as a OData feed to the same data source so users can use both features: search for data without needing connection info as well as have the ability to refresh workbooks on the Power BI site.

When using the OData feed URL in Excel, you will be asked for the method you want to use to access it.  Select the Organizational account and enter your Office 365 credentials.  Also make sure you are signed into Excel with the same account (see Power BI – Data Management Gateway).

If you are using an Azure VM for your SQL Server, when creating a data source in Power BI, the set credentials does not work (you get the message “Failed to verify gateway ‘InstanceName1′ status.  Unable to connect to the remote server”).  As a workaround, login to the VM itself and configure the data source.  See Using Power BI Data Management Gateway on Non-Domain Azure VM.

If you are using an Azure VM for your SQL Server you will want to change the computer name of the Azure VM (see Using Power BI DMG on Non-Domain Azure VMs – August 2014 Update) and add endpoints in Azure Management Portal and firewall rules exceptions on your Azure VM (see Using Odata feed on Azure client whose Data Source is configured against DMG on Azure VM and How to Set Up Endpoints to a Virtual Machine).

More info:

How to set Power BI Schedule Data refresh with simple excel file Office 365

Power BI first impressions

Power BI Data Management Gateway 1.2 Changes the Game

Limitations for Power Query OData Feeds in Power BI

When is a Data Management Gateway Needed in Power BI?

Video Deep Dive on the Data Management Gateway in Power BI for connectivity to on premise and hybrid scenarios

Tip! How To Configure Power BI Preview Analysis Services

Posted in PDW/APS, Power BI, SQLServerPedia Syndication | 1 Comment