There are a number of terms that have been coined to describe the process of collecting and compiling data—data integration, data federation, data virtualization….the list goes on. To an extent, the confusion across these terms is understandable as they all serve similar functionalities — connecting to data sources and subsequently publishing it in some form. However, the similarities end there. Many leading analysts have been using the terms data federation and data virtualization interchangeably, leading to confusion in the marketplace. Data virtualization is a superset of the ten-year old data federation technology and has come to include the advanced capabilities of performance optimization as well as self-service search and discovery. Put more simply, data virtualization is data federation on steroids. Allow me to explain it further.
Let’s begin with data federation. It is a distributed access to data residing in multiple systems with the purpose of joining the data together as if it came from the same system. After amassing all the data from multiple systems, the combined data would be provided to the respective consuming systems. Sounds simple enough, but data federation had a number of problems, particularly regarding performance when accessing large volumes of data. To illustrate this inefficiency, let’s say your company sells 1,000 products over the course of a year, and a total of 1 million sales transactions have been recorded during this period. With that, we need to know how much revenue has been earned for each product, meaning that the system should give us 1,000 rows of data with the name of each product and the total revenue for each. Data federation would move all 1 million transactions to the data federation layer and would compute the sum of revenue for each of the 1000 products. Obviously, moving a million rows of data over the wires takes time.
Cue data virtualization
Data virtualization has evolved from data federation by improving performance and adding other advanced capabilities such as self-service search and discovery. These developments, when put into play in the above example, allows data virtualization to compute the sum of revenue within the source system itself and would bring back only 1,000 rows of results. Naturally, moving 1,000 rather than 1 million rows of data is much faster. But how exactly is this improved performance possible? Advanced data virtualization products like the Denodo Platform have evolved to include dynamic query optimization techniques to determine the best query execution plan thus delivering optimal performance times. Even further, the Denodo Platform also includes self-service and discovery capabilities. This means that data virtualization acts as a central repository with knowledge of all enterprise data (even desktop files like Excel), and because of this, it’s a natural place for business analysts to “search” and “discover” the data they are looking for.
Given data virtualization’s clear advancements over the decade old data federation, calling data virtualization as data federation is fundamentally incorrect. Equating the two is as technologically incorrect as calling the Blackberry a modern smartphone or saying a CD is the same as streaming music! With that, I’d hope that analysts correct their terminology and stop spreading misplaced information!
- Data Virtualization and Data Science - July 1, 2021
- Key Insights from Three Cloud Experts Roundtables: Accelerate Hybrid Cloud Journey, Harness Cloud Best Practices, and Simplify Data Management - September 30, 2020
- A CIO’s Guide: How to survive tough economic times through IT Portfolio Rationalization - September 9, 2020