Spark and Data Virtualization
Reading Time: 3 minutes

Apache Spark and data virtualization enables us to extract data from heterogeneous data sources, integrate and process that data, and make it available for a wide range of reporting and analytics tools. Both do so without the need to store the extracted result before it can be processed and analyzed. Both support on-demand data integration. It’s therefore not surprising that the two are sometimes seen as competing technologies. But that is a common misconception.

They can only be seen as competitors if we compare them at a very high level, but not if we study them in more detail. Let’s look at some of the features they both support, then see how they differ, and then explain how they can cooperate.

Spark and Data Virtualization: Features in Common

Apache Spark does support typical data virtualization features. For example, it can extract data from a wide range of sources. Data from multiple sources can be integrated easily. In other words, Spark also presents all the heterogeneous data sources as one logical database. Like data virtualization, Spark also enables applications to access data using a variety of languages, including SQL, Java, R, Python, and Scala. Applications are not forced to use a specific language or API.

Differentiated Features between Spark and Data Virtualization

However, data virtualization supports some key features that Spark does not. First, data virtualization supports many more query pushdown capabilities. Data virtualization servers try to push as much query processing “into” the data source, which makes sense with powerful platforms, such as Hadoop, SnowflakeDB, and Teradata. Especially if a lot of data needs to be accessed to create a result, even though the result itself is small, it is much more efficient to let the data source do most of the query processing. This minimizes network delay and exploits the full query power of the data source. The capabilities of the data source determine how much of the query can be pushed down.

Another feature not offered by Spark is distributed join optimization. Before data can be integrated, it must first be loaded in memory from the data sources. The join is executed within Spark. No distributed join optimization techniques are supported. This is one of the areas in which data virtualization servers excel, using their join optimization techniques, such as injection joins, ship joins, and parallel joins.

Spark works best if data is loaded in memory. But memory has its limitations. In this era of big data, it is not always practical to load all of the data in memory. Extracting all the data, transmitting it, and loading it can take a very long time. Sometimes, big data is just too big to move, and it is better to push the processing to the source.

Spark also lacks some key features for managing the data environment. The first is view lineage. Views can be defined to transform, join, filter, and aggregate data both in Spark and data virtualization servers. When many of those views are defined, a mechanism is needed that shows how the views are all linked together: the view lineage. This also helps users to understand the impact if the definition of one of the underlying views needs to be changed. The second is, unlike data virtualization, Spark supports no catalog in which all the views and tables can be documented, defined, and tagged, so that data can be easily searched and discovered.  

Best of Both Worlds

Clearly, the two technologies have their differences. They should not be considered competitors. It makes much more sense to regard them as cooperators. For example, data virtualization can use the in-memory power of Spark to temporarily cash data and speed up queries. Or, Spark can serve as a data source to a data virtualization implementation. For example, data can be streamed with Kafka into Spark. That data can then be made available through data virtualization to support real-time BI dashboards. A data virtualization server can also act as a data source for Spark. This would enable Spark to access even more data sources, and, maybe more importantly, it would enable Spark to execute distributed joins. Data virtualization can also extend the number of data sources that Spark can access. For example, it would allow Spark to access non-relational data sources, such as, SAP, and Google Search.

It’s understandable that the two technologies are sometimes seen as competitors. But though they do have some overlapping features, it is more accurate to think of them as cooperating technologies that can both benefit from being deployed together. They can augment each other to create an environment in which data can be unlocked, combined, and analyzed in any possible way more quickly and more easily than ever before. Combining Spark and data virtualization has a synergetic effect.

Rick F. van der Lans
Latest posts by Rick F. van der Lans (see all)