Beware of “Straw Man” Stories: Clearing up Misconceptions about Data Virtualization
In the last few years, data virtualization technology has experienced tremendous growth, emerging as a key component for enabling modern data architectures such as the logical data warehouse, data fabric, and data mesh. Gartner recently named it “a must-have data integration component” and estimated that it results in 45% cost savings in data integration, while Forrester has estimated 65% faster data delivery than ETL processes.
As the adoption of the technology grows, it is naturally discussed more frequently by analysts and vendors. Unfortunately, the descriptions of the technology are not always accurate, and some materials, especially those promoting centralized, monolithic architectures, describe a “straw man” version of data virtualization (A “Straw man” argument is defined by the Oxford Languages dictionary as “An intentionally misrepresented proposition that is set up because it is easier to defeat than an opponent’s real argument”). This “straw man” conception of data virtualization is similar to the early data federation systems of the late 90’s /early 00’s, but very different from what solutions like the Denodo Platform can do today.
To avoid confusion, it is worth going through some of the main misconceptions:
Misconception #1: Data Virtualization is Synonymous with Data Federation
Data virtualization provides a unified data integration and delivery layer for the business. It abstracts data consumers from the location and native formats of the data by creating a semantic layer on top of multiple distributed data sources, without the need to replicate all of the data into a central repository. This semantic layer can be accessed in a secure and governed manner using a variety of access methods.
In contrast, data federation is a much more specific concept: it’s the ability to answer user queries by combining and transforming data in real time from several data sources.
While data federation is one of the key capabilities of a data virtualization platform (we will discuss these below), it is not the only one and, arguably, it is not even the most important one. Most of the benefits of advanced data virtualization platforms like the Denodo Platform turn up also when queries retrieve data from only one data source (which may be different for each query). Some of these benefits are:
- The ability to create and manage rich semantic layers, which expose data to different consumers in their chosen form, without needing to replicate data for them every time.
- A layered approach to defining canonical data views and metrics that can be reused across multiple use cases. This ensures consistency and interoperability, increases productivity, and fosters collaboration. It also supports top-down modelling, to set “schema contracts,” which developers need to comply with when they create new data products.
- Data access using any technology, such as SQL, Rest, OData, SOAP, GraphQL and MDX, no matter which technologies are natively supported by the data sources.
- The ability to move the data used by the query to another system (e.g. because processing is cheaper there) without affecting your data consumers.
- It enables data quality, security, and governance policies to be enforced from a single point, across multiple data sources and consumers. This way, the policies do not need to be implemented in multiple systems, and they can be specified at the semantic layer level, independently of the particular methods supported by the individual data sources.
- The automatic generation of documentation of any data products exposed through data virtualization using standard formats such as Open API.
- Data lineage and change impact analysis functionality for all data products across all data sources.
- A data catalog that enables business users and other data consumers to quickly discover, understand, and get access to data products offered by the data virtualization layer, effectively implementing a data marketplace for the business.
- Advanced caching and query acceleration capabilities out-of-the-box, to improve the performance of slow data sources.
- And many other benefits.
This list is not comprehensive, but it shows that there is much more to data virtualization than data federation. Any discussion ignoring these capabilities will probably be misleading.
Misconception #2: Data Virtualization Introduces Additional Workload on Your Operational Systems
This is a common misunderstanding about logical analytics architectures. In these architectures, the data virtualization platform is used to provide unified semantics, security, and governance on top of multiple analytics systems like data warehouses, data marts, operational data stores, data lakes, and some types of SaaS applications or NoSQL data sources. This common semantic layer is needed because modern analytic needs, in large organizations, are too complex and diverse to be resolved by a single system, so companies maintain multiple analytics systems on-premises and in the cloud.
In this scenario, the data virtualization engine rarely (if ever) directly hits the operational systems used in the transactional business processes. Instead, it hits the analytics systems that specialize in resolving analytic queries over large data volumes.
There may still be specific cases where data virtualization is needed to limit the workload introduced in a given system or to accelerate a slow data source. For those situations, the Denodo Platform also includes throttling capabilities, which limit the number and type of queries sent to a data source, and also caching/acceleration techniques (which are discussed below).
Misconception #3: Data Virtualization Always Needs to Transfer Large Data Volumes over the Network
The data sources used in analytics architectures will usually contain very large data volumes. This can lead one to think that, especially when federating data from several data sources in the same query, data virtualization platforms will always need to retrieve large data volumes through the network, heavily affecting query performance.
To understand why this is wrong, it is useful to describe, in greater detail, how query execution works in the Denodo Platform. The Denodo execution engine can be thought of as a coordinator that pushes most of the work involved in resolving a query down to the data sources. If the query can be resolved with data from a single data source, all of the work will be pushed down to it. If the query needs data from several data sources, the Denodo Platform automatically rewrite it in such a way that each data source computes all the calculations (including joins, group bys) on its own data and returns the results to the Denodo Platform. Therefore, in the final step, the Denodo Platform only needs to read and integrate the precalculated partial results from each source.
This means that the amount of data that needs to be read back into the data virtualization system is reduced by orders of magnitude as compared to the traditional techniques used in early incarnations of data federation tools. The techniques to automatically rewrite user queries to maximize query pushdown were pioneered by Denodo, and represent one of the key aspects that explain the recent rise of data virtualization usage in analytics scenarios. Detailed explanations of some of these techniques can be found here and here.
The optimization of real-time queries is complemented by support for massively parallel processing (MPP) and caching/acceleration techniques, which are discussed next.
Misconception #4: Data Virtualization Does Not Use MPP Capabilities
Some descriptions of data virtualization claim that it uses a non-parallel approach for data processing. Let’s describe why this is wrong.
First, as I discussed in point 3, most of the query execution work is pushed down to the data sources. Also, as I noted in point 2, the data sources used by the Denodo Platform are typically specialized engines for analytical queries, which use MPP. Therefore, the work pushed down to these sources will benefit from it.
To combine data from several data sources, the Denodo Platform may still need to combine the partial calculations obtained from each one. For many queries, this process is too lightweight to benefit from MPP processing. For those of you who are familiarized with how an MPP engine works, the role of the Denodo Platform in this context has some similarities with the coordinator in an MPP system, and it’s worth noting that the coordinator in those systems does not run in parallel. But for queries for which it does make sense, the Denodo Platform can use Presto, Impala, or Spark to benefit from MPP also in this stage, as illustrated, for instance, here.
Misconception #5: Data Virtualization Must Retrieve All Data in Real Time
The default query execution mode used by the Denodo Platform is obtaining the required data in real time from the data sources. This will often perform well because, as we have seen in previous points, the Denodo optimizer pushes down most of the processing to data sources that are specialized in solving these types of queries. That is why this is the most common execution strategy used by Denodo customers.
However, advanced data virtualization platforms like the Denodo Platform also support additional execution methods to further improve performance and/or to deal with slow data sources:
- Denodo supports smart caching and query acceleration to provide sub-second query response times by caching/replicating small datasets, which are orders of magnitude smaller than the original data. Artificial intelligence techniques are also used to analyze past workloads and automatically recommend what specific data subsets can be cached/materialized to accelerate performance with minimal replication.
- When needed, the Denodo Platform can also replicate specific virtual datasets in full. This can be useful for specific cases, such as providing data scientists with a data copy they can modify and play with, without affecting the original data. The Denodo Platform enables this without needing to write any ETL code, and it keeps track of the lineage of the copies, so no ungoverned silos are created, even in those cases. Incremental refresh options are available to keep the copies up-to-date.
Modern data virtualization platforms like the Denodo Platform enable you to decide the best degree of data replication for each use case. Rather than replicating all of the data for each new use case as in traditional/monolithic approaches, you can decide between a range of options from zero replication, to partial, to full replication. And that decision is transparent to the data consumers, so you can change it at any time without affecting them.
The rate of data virtualization adoption is growing much more quickly in recent years as a key component of data analytics architectures such as the logical data warehouse, data fabric, and data mesh.
When evaluating the role that data virtualization can play in modern analytics architectures, it is crucial to avoid the trap of considering is as a synonym of old data federation.
Modern data virtualization platforms like the Denodo Platform provide unified semantics, security, and governance on top of multiple analytics systems, enabling a data marketplace that enables business users and applications to access any data using any access technology. Such solutions use advanced optimizations to minimize network traffic even with extremely large datasets, they can leverage MPP, and they include advanced caching and acceleration powered by AI to provide sub-second query responses.
Don’t get fooled by “straw man” descriptions of data virtualization, and take a look at the real thing.
- Beware of “Straw Man” Stories: Clearing up Misconceptions about Data Virtualization - November 11, 2021
- Why Data Mesh Needs Data Virtualization - August 19, 2021
- No Single Data Repository Can Be Your Silver Bullet - April 14, 2021