Some still think it makes sense to compare ETL tools with data virtualization servers. Let me be clear, it really doesn’t. Unquestionably, both tool categories belong to the class of data integration tools, like data replication tools, data wrangling tools, and enterprise services buses; but this doesn’t imply they should be compared in detail.
Two things always have at least a few characteristics in common by which they can be compared. We can compare elephants with trucks based on weight for example, and computers with toasters based on their power consumption. But what will such comparisons tell us? Sometimes it does make sense. For example, when loading airplanes, it must be important to know the weights of trucks and elephants, but you never compare the two to determine which one you need for a specific use case.
So, ETL tools and data virtualization can be compared based on common characteristics such as price, data filtering capabilities, resource consumption and so on, but it’s a waste of time, because the tool categories have different use cases. ETL tools are specialized in extracting data from some source systems, transforming and aggregating that data, and finally storing the result in some target database. In fact, that’s what the letters stand for: Extract, Transform, and Load. They lift and shift the data. Because of the copying, the data used for reporting is never the most up-to-date data. There is always a time delay between the moment the data is inserted in the source system and when it can be accessed by a report. ETL is very well suited to operate in an architecture in which this need exists to replicate data before it’s used.
The essence of data virtualization is abstraction, or call it decoupling. It’s about decoupling data producers and data stores. By doing this, changes made on the data production and data storage side won’t impact the reports, and vice versa. Abstraction leads to a more agile environment for delivering data to data consumers, let me give a few examples:
Centralization of specifications: Specifications required to extract, integrate, filter, aggregate, and calculate data are commonly spread across reporting tools, ETL programs, and database servers. This makes lineage and impact analysis complex. If we use data virtualization servers properly, most of these transformation specifications are defined in its repository. This centralized storage of specifications simplifies lineage and impact analysis, improves productivity and maintenance, and leads to more transparency.
Decrease of data latency: With ETL, if the data required for reporting and analysis must be transformed first, the transformed data must first be physically stored before it can be used resulting in an increase of data latency. With data virtualization the data can be transformed when the data is being used. There is no need to store a transformed copy of the data. It can be physically stored though, but that’s an option, it’s not mandatory as with ETL.
Database migration: In my previous blog, I indicated that with data virtualization database servers can be replaced easily without impacting the reports. Lately, so many new powerful data storage and data processing technologies have been introduced. To be able to leverage the power of these new products, a migration is required. Data virtualization simplifies such a migration, because it operates as an abstraction layer between the reports and the databases.
Evidently, data virtualization has characteristics comparable to ETL and there is overlap in functionality, which is they both belong to the data integration category. But if you compare two types of products, you probably haven’t specified your precise requirements clearly. Again, if you’re comparing elephants and cars in detail, you probably haven’t specified your use case in detail either.
Don’t see data virtualization as a replacement for ETL, because that’s not what it is; they can co-exist. If there is a need to replicate data, ETL is the preferred approach. When an abstraction layer is needed to increase the agility of a system, data virtualization is the right choice.
- The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022
- Use the Cloud More Creatively - January 28, 2022
- Data Minimization as Design Guideline for New Data Architectures - May 6, 2021