Data virtualization goes by a lot of different names: logical data warehouse, data federation, virtual database, and decentralized data warehouse. Data virtualization allows you to integrate data from various sources, keeping the data in-place, so that you can generate reports and dashboards to create business value from the data. It is an alternative to building a data warehouse, where you collect data from various sources and store a copy of the data in a new data store.
The main advantage of data virtualization is speed-to-market, where we can build a solution in a fraction of the time it takes to build a data warehouse. This is because you don’t need to design and build the data warehouse and the ETL to copy the data into it, and also don’t need to spend as much time testing. Copying the data means more hardware costs, more software licenses, more ETL flows to build and maintain, more data inconsistencies and more data governance costs, so using data virtualization can also save you a lot of money. Other reasons for data virtualization include rapid prototyping for batch data movement, self-service analytics via a virtual sandbox, and regulatory constraints on moving data.
Some of the more popular data virtualization products are Cisco Data Virtualization (previously called Composite Software), Denodo Platform for Data Virtualization, Informatica Data Virtualization, Drermio, IBM Big SQL and Incorta.
Along the same lines of data virtualization vs data warehouse is federated queries vs data lake. Such technologies as PolyBase, Metanautix, and U-SQL in Azure Data Lake Analytics provide for federated queries.
But there are some major drawbacks to data virtualization and federated queries, so you have to ask the following questions when you are thinking about using it:
- Speed. Is this something I could use for a Power BI dashboard where I wanted to slice and dice data with sub-second response times? Or is this more for operational type reporting?
- How much will this affect the performance of the source system? Could a query consume all the resources of a server with a data source that I’m querying against? Does it push down the query in the same way PolyBase does?
- Do I need to install something on each server that contains a data source I want to use?
- Does it use the indexes of each technology on the data store, or does it create its own indexes?
- How is security handled for giving users access to each data source?
- How is master data management handled (i.e. the same customer in multiple data sources but the customer name spelled differently)?
- Where and how will the data be cleaned?
- Will reports break if the source system is changed?
And there are some very valid reasons why a physical data warehouse is required:
- Many production systems don’t keep track of historical data. This data must be stored somewhere for historical analysis of the data. The physical data warehouse is, in this case, the most obvious solution
- Accessing production systems directly for reporting and analytics can lead to too much interference on those systems and to performance degradation. Note that this was once the reason why physical data warehouses were developed in the first place
- Speed: I data warehouse is optimized for read access while a source system is usually optimized for writes
- In building a data warehouse you will be restructuring, renaming, and joining data (i.e. creating star schemas) to make it easy for users to create reports
- A data warehouse protects users against source system upgrades
A word about Views, such as those in SQL Server, can be thought of as a “lightweight” data virtualization solution: When users need access to operational data, views can be defined directly on the operational data store or the production databases. But views have their own issues: operations for correcting and transforming data must be added to the view definitions, because the original operations are implemented in the physical data warehouse environment and are now bypassed. They now have to be virtualized.