Myths in Data Virtualization Performance
Update: Interested in data virtualization performance? Check out our most recent blog posts, “Performance of Data Virtualization in Logical Data Warehouse Scenarios” and “Physical vs Logical Data Warehouse Performance: The Numbers.”
The topic I probably get asked most about is Data Virtualization performance. I would like to address several common misconceptions on this subject.
Many of the remarks I often hear can be summarized in the following (false) statements:
- “In data intensive scenarios, Data Virtualization requires moving big amounts of data through the network”.
- “Moving data through the network is slow”.
- The (again false) corollary is that “Data Virtualization queries are slow in data-intensive scenarios” such as those involving Big Data.
Before we jump into details, here is a video which can provide you with a good overview about the topic:
The reality is that data-intensive scenarios require processing large amounts of data, but not necessarily transferring them. Actually, query execution can usually be organized in such a way that you only need to transfer the results of the data analysis from each involved data source. And, almost by definition, these results should be a short summary of the analyzed data, and much lower in size. Let me illustrate this idea with an example.
Let’s say you work for a big retailer which sells around ten thousand products. You have a Data Warehouse with sales information from the last year, stored in a facts table with, say, 200 million rows. Let’s also assume you periodically offload sales data corresponding to previous years to a Hadoop-based repository running Hive (or a similar system), and that you currently have information about one billion sales stored there.
Now let’s suppose you want to run a report to obtain the total amount of sales of each product in the last two years. This requires integrating product data with sales information from both the Data Warehouse and the Hadoop-based system. Notice the size of the final report is only ten thousand rows (one row for each product).
Common (and false) “wisdom” about Data Virtualization performance assumes that this report would be executed using the naive query plan illustrated in Figure 1: the Data Virtualization tool pushes down to the data sources the filtering conditions (sales in the last two years), retrieves all the rows verifying those conditions and then computes the report entirely by itself. Although pushing-down the filter conditions allows retrieval of “only” 150 million rows instead of the total of one billion rows from Hadoop, the overall plan still requires transferring roughly 350 million rows, so execution will probably be painfully slow.
Actually, it is true that many Data Virtualization tools will generate a naive execution plan like this! Nevertheless, a first-class Data Virtualization tool can do much better. Figure 2 illustrates the query plan Denodo would automatically generate to create the aforementioned report. Each data source computes the total amount of sales by product on its own data, and transfers the partial results to Denodo. Denodo then computes the final result for each product by adding the partial sales amount computed by each data source.
As you can see, we have gone from moving 350 million rows to roughly 20 thousand; only twice the size of the report results. This volume of records can be sent through modern Gigabit local networks in a matter of seconds. (By the way, it is worth checking this table, originally compiled by Jeff Dean of Jeff-Dean Facts fame, to remind us how fast networks have become in recent years, up to the point of surpassing disk read times in many cases).
Intelligent query rewritings like the one shown in this example are only one among many optimization techniques available in Denodo that can be used for virtualizing data-intensive scenarios, and which go far beyond the conventional query optimization techniques used in databases and less sophisticated Data Virtualization systems. Other examples are:
- Automatic Data Movement, which automatically transfers small amounts of data from one data source to a temporal table in another data source, to maximize query push-down to the target data source and minimize overall network traffic.
- Incremental Queries, which retrieve from the data source only the data changed since the last time the data was refreshed.
- … and many more!
Also recommended: Performance of Data Virtualization in Logical Data Warehouse Scenarios
- No Single Data Repository Can Be Your Silver Bullet - April 14, 2021
- Unlocking the Potential of Machine Learning in a Data Lake - March 27, 2019
- 4 Key Takeaways from the Gartner Magic Quadrant for Data Integration Tools - August 2, 2018