Comparing ETL with Data Virtualization Makes No Sense

Reading Time: 3 minutes

Some still think it makes sense to compare ETL tools with data virtualization servers. Let me be clear, it really doesn’t. Unquestionably, both tool categories belong to the class of data integration tools, like data replication tools, data wrangling tools, and enterprise services buses; but this doesn’t imply they should be compared in detail.

Two things always have at least a few characteristics in common by which they can be compared. We can compare elephants with trucks based on weight for example, and computers with toasters based on their power consumption. But what will such comparisons tell us? Sometimes it does make sense. For example, when loading airplanes, it must be important to know the weights of trucks and elephants, but you never compare the two to determine which one you need for a specific use case.

So, ETL tools and data virtualization can be compared based on common characteristics such as price, data filtering capabilities, resource consumption and so on, but it’s a waste of time, because the tool categories have different use cases. ETL tools are specialized in extracting data from some source systems, transforming and aggregating that data, and finally storing the result in some target database. In fact, that’s what the letters stand for: Extract, Transform, and Load. They lift and shift the data. Because of the copying, the data used for reporting is never the most up-to-date data. There is always a time delay between the moment the data is inserted in the source system and when it can be accessed by a report. ETL is very well suited to operate in an architecture in which this need exists to replicate data before it’s used.

The essence of data virtualization is abstraction, or call it decoupling. It’s about decoupling data producers and data stores. By doing this, changes made on the data production and data storage side won’t impact the reports, and vice versa. Abstraction leads to a more agile environment for delivering data to data consumers, let me give a few examples:

Centralization of specifications: Specifications required to extract, integrate, filter, aggregate, and calculate data are commonly spread across reporting tools, ETL programs, and database servers. This makes lineage and impact analysis complex. If we use data virtualization servers properly, most of these transformation specifications are defined in its repository. This centralized storage of specifications simplifies lineage and impact analysis, improves productivity and maintenance, and leads to more transparency.

Decrease of data latency: With ETL, if the data required for reporting and analysis must be transformed first, the transformed data must first be physically stored before it can be used resulting in an increase of data latency. With data virtualization the data can be transformed when the data is being used. There is no need to store a transformed copy of the data. It can be physically stored though, but that’s an option, it’s not mandatory as with ETL.

Database migration: In my previous blog, I indicated that with data virtualization database servers can be replaced easily without impacting the reports. Lately, so many new powerful data storage and data processing technologies have been introduced. To be able to leverage the power of these new products, a migration is required. Data virtualization simplifies such a migration, because it operates as an abstraction layer between the reports and the databases.

Evidently, data virtualization has characteristics comparable to ETL and there is overlap in functionality, which is they both belong to the data integration category. But if you compare two types of products, you probably haven’t specified your precise requirements clearly. Again, if you’re comparing elephants and cars in detail, you probably haven’t specified your use case in detail either.

Don’t see data virtualization as a replacement for ETL, because that’s not what it is; they can co-exist. If there is a need to replicate data, ETL is the preferred approach. When an abstraction layer is needed to increase the agility of a system, data virtualization is the right choice.

Author
Recent Posts

Rick F. van der Lans

Rick F. van der Lans is an independent analyst, consultant, author, and lecturer specializing in data warehousing, business Intelligence, data virtualization, and database technology. He is an internationally acclaimed speaker and has lectured worldwide for the last 25 years. His popular IT books have been translated into many languages and have sold over 100,000 copies. Last summer, Rick published a new book entitled “Data Virtualization for Business Intelligence Systems”. See www.r20.nl for more details.

Latest posts by Rick F. van der Lans (see all)

Metadata, the Neglected Stepchild of IT - December 8, 2022
Zero-Copy Architectures Are the Future - October 27, 2022
The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022

7 Comments


Ralf

May 9, 2019 at 7:54 am

Dear Rick,
interesting article. What does this mean for the concept of a ‘logical data warehouse’ which you introduced a couple of years ago?
Kind Regards,
Ralf
- Reply
  
  Rick van der Lans
  
  May 9, 2019 at 11:14 am
  
  Hi Ralf,
  In a logical data warehouse we may be using data virtualization technology and ETL technology. DV would be used for data abstraction and because of that it would be the heart of the logical data warehouse. In that same architecture ETL may be used, for example, to copy data from data sources to a datahub.
  Best regards,
  Rick van der Lans

Jeremy Vincent

May 30, 2019 at 1:41 pm

Rick, I think I understand DV but how is data sucked out of and placed back into legacy apps like SAP, ORACLE, Siemens Team Centre etc? (We have hundreds of these various legacy apps).
Is there a useful guide to the market of DV products?
The Decoupling effect is because data is taken from legacy systems on demand, correct? Or does the DV layer provide a form of cache?
How does such an architecture cope with massive peak input/output with the legacy platforms many of which were never designed/architected to work in such a manner.
Do DV products come with ADAPTERs or do these have to be custom developed?

thanks
- Reply
  
  Rick van der Lans
  
  June 3, 2019 at 6:52 am
  
  Hi Jeremy,
  Good questions. I have tried to answer them all:
  how is data sucked out of and placed back into legacy apps like SAP, ORACLE, Siemens Team Centre etc? (We have hundreds of these various legacy apps).
  >>> Through APIs or by retrieving data directtly from their databases. Note that DV products do not have adapters/connectors for every app developed on this planet, but commonly the well-known ones are supported. SDKs exist to develop adapters yourself.
  Is there a useful guide to the market of DV products?
  >>> Not that I am aware of, athough my seminars on this topic contain a lot of material on the more popular products.
  The Decoupling effect is because data is taken from legacy systems on demand, correct?
  >>> Yes, that’s one way of decoupling.
  Or does the DV layer provide a form of cache?
  >>> Yes, it does.
  How does such an architecture cope with massive peak input/output with the legacy platforms many of which were never designed/architected to work in such a manner.
  >>> DV is always restricted by what the underlying apps can do for you.
  Do DV products come with ADAPTERs or do these have to be custom developed?
  >>> Yes.
  - Reply
    
    jeremy
    
    June 5, 2019 at 10:32 am
    
    Thanks RIck – much appreciated.
    
    One final question if I may
    
    DV layers in full stack mobile/web development environments – is it a common practice? Is there a trend? It strikes me as a very very good option – to use a DV layer.
    
    In this way users can access Enterprise System (Legacy) via native user sign-ons to do the low level transactional, OLTP, record type stuff (what the app was originally intended to do) and the DV layer can ‘cache’ data to / from them allowing 2nd and 3rd order derivative apps (WEB/mobile) etc. to be built fully decoupled and independent from the heavyweight architectures/skill sets of the legacy apps themselves. Therefore making our dev team far, far quicker and agile.
    
    Is this a correct way to ‘see’ things. I am trying to get a bunch of people on board who have even less understanding of Enerprise Architecture than I do !
    thanks

Rick van der Lans

June 6, 2019 at 6:18 am

Hi Jeremy,
Yes, that would be a good use case for data virtualization. the DV layer would make live very easy for the developers.
- Reply
  
  jeremy
  
  June 19, 2019 at 8:43 am
  
  Thanks and greatly appreciated

Comparing ETL with Data Virtualization Makes No Sense

Related Posts

About me

7 Comments

Leave a Reply Cancel reply