Simplifying Big Data Integration with Data Virtualization
The strength of Hadoop and NoSQL systems in supporting big data systems is indisputable. These big data technologies are designed, built, and optimized specifically for large-scale and high-performance environments. Their support for massively parallel hardware architectures and cheap commodity hardware makes them the obvious choice for a wide range of big data use cases.
However, there is a price to pay with respect to productivity, complexity, query support, and proprietary interfaces.
NoSQL products, such as Apache HBase, CouchDB, MongoDB, and Riak, are ideal for developing transactional big data systems, whereas NoSQL graph servers, such as AllegroGraph, Neo4j, and VertexDB, are designed specifically for doing graph analytics on large and complex graph structures. All of them can handle a massive workload on big data. But they can’t support every big data use case imaginable. They are specialized database servers aimed at a limited set of use cases, in the same way that race cars are specialized cars for one use case: driving very fast. In this respect, this specialization distinguishes them from traditional database servers that have a more general purpose.
In the world of big data technology not that many standards exist. For example, each NoSQL product speaks its own language, has its own API, and supports its own database concepts, although some similarities exist. These proprietary interfaces increase the heterogeneity even more and amplify the level of specialization.
Specialization implies that when organizations need several big data systems with different use cases, for each one separately the “best” big data technology must be selected; whatever best is: best price/performance ratio or possibly the fastest one. Working with multiple “best options” leads to what is called polyglot persistency. With polyglot persistency different applications deploy different storage technologies.
Although there is a lot of good to say about polyglot persistency, it can result in an integration nightmare. Eventually, data from all these different systems have to be integrated for reporting and analytics requiring applications and tools to deal with this multitude of APIs and languages.
This integration process can be simplified and accelerated by deploying data virtualization servers. They can hide the technical and proprietary aspects of the big data and traditional technologies and they allow report developers and analysts to use their favorite APIs, such as SQL or REST. So, instead of having to deal with several of the technical APIs, data virtualization servers present this heterogeneous set of databases as one integrated database. This improves productivity, simplifies maintenance, creates independence of specific big data technologies, and improves the time to market for reports and analytics. Or, in other words, they make working with big data technologies easier by hiding the polyglot persistent environment. It lowers the technical adoption hurdle of big data systems and allows the existing skill set of current developers to be used.
Besides making them easier to use, data virtualization servers can also speed up reporting and analytics on big data technologies. Most of them are fast on processing transaction, but slow on reporting and analytics; it’s not their forte. With data virtualization, big data stored in, for example NoSQL products, can be easily copied and cached into a SQL database server designed and optimized for querying. Queries are automatically redirected from the NoSQL product to that SQL engine, dramatically improving the query importance. Note that data virtualization servers automatically manage and temporarily refresh all the cached data.
Another aspect that makes it worthwhile to place data virtualization in between big data stores and the reports is data security. The data security features supported by data virtualization servers are much more extensive and fine-grained than those of the NoSQL products. With data virtualization one integrated data security layer can be defined on top of the polyglot persistent environment.
To summarize, data virtualization servers strengthen the use of big data technologies. The latter group consists of fast but highly specialized database servers. Each of them excels at supporting a small set of use cases. And where they are weak, data virtualization can fill in the gaps. They can simplify the use of big data, ease the integration of multiple big data technologies, improve query performance, and enrich the data security level.
Latest posts by Rick F. van der Lans (see all)
- Data Virtualization and SnowflakeDB: A Powerful Combination - January 23, 2020
- Spark and Data Virtualization: Competitors or Cooperators? - October 24, 2019
- Comparing ETL with Data Virtualization Makes No Sense - May 2, 2019