Avoiding the Swamp: Data Virtualization and Data Lakes
The last couple of years have seen successful Hadoop deployments in large companies as the established paradigm for Big Data scenarios. One of the most popular architectures in this new ecosystem is the so-called “data lake”.
In short, the idea of a data lake is to store large volumes of data from multiple locations in an inexpensive Hadoop cluster, taking advantage of relational technologies like Hive, Impala or Hawk. Data does not go through the heavy transformations and modeling that involve the definition of a data warehouse. Instead, data is usually replicated preserving the format of their original sources.
Data lakes are a great solution for some scenarios, but also have some inherent problems. I recommend taking a look at this Gartner article about the “Data Lake Fallacy” or just Google “data swamp” to get an idea of some misuses of the data lake concept.
Some of the key ingredients of the data lake will also sound familiar to data virtualization enthusiasts:
- Access to disparate data using a single layer
- Preserve the format and structure of the original sources
However, unlike data virtualization, a data lake is by definition a data replication solution. As we will explore, this is the root of some problems raised by data lake critics. Let’s highlight some aspects that make the two approaches different, as well as benefits of a combined scenario where data virtualization works together with a data lake, offering the best of both worlds.
In a data lake architecture you will create a copy of your data. Data virtualization is focused on accessing data directly from the sources. This has several implications:
- Governance aspects: When data is copied, control is lost.
- Changes in one place may not be reflected in the other, which could lead to inconsistencies in results. Some data is especially sensitive to this, for example golden records and other MDM information.
- Lineage information is lost. Tracing where information is originated and the transformations are applied becomes more complicated.
- Stale data: Copied data will not be up to date.
- Workarounds include Change Data Capture (CDC) techniques and tools like Kafka for a Hadoop ecosystem. But these solutions increase the complexity of the system exponentially.
One of the main values of data virtualization is the agility to react to change. For example:
- Adding a new data source:
- In data virtualization, this just means defining a new data source connection. It is a matter of minutes before you can start using that new data.
- In a persisted approach like a data lake, you have to set up the load jobs using external tools, and then move the entire data schema before running your first query. This can take from hours to days.
- Changing a transformation:
- In data virtualization, a transformation is just a rule that is applied to data in execution, therefore a change takes effect immediately.
- In a persisted approach, you have to modify your load jobs and run them again to replace the existing data.
- Modifications of structure in the sources:
- Adding a new column to a source can be immediately detected in a data virtualization solution.
- Again, persisted approaches need to re-load data.
The security model of the Hadoop ecosystem is still a bit rudimentary compared to traditional RDBMS security models, which data virtualization follows. Furthermore, the option to pass-through credentials to the sources in execution time allows data virtualization to leverage existing security infrastructures.
Global Cost of Operation
Managing a Hadoop cluster is a complex task, made more complex if you add other components like Kafka to the mix. A virtualized approach is inherently easier to manage and operate.
Data lakes are a great approach to deal with some analytics scenarios. Storage in the Hadoop cluster is cheap compared with large MPPs, and new technologies like Spark or Impala add the speed of processing that the first wave of Hadoop solutions didn’t have.
However, as highlighted above, it also has significant limitations. In that sense, data lakes are not different from other persisted approaches like data marts and warehouses, based on ETL-like processes. Data virtualization was designed to be a more agile method for data integration with those aspects in mind.
In complex enterprise scenarios, both technologies work well together, addressing different problems and concerns with different tools. As an example, a virtual layer can be used to combine data from the data lake (where heavy processing of large datasets is pushed down) with golden records from the MDM that are more sensitive to stale copies. The advance optimizers of modern data virtualization tools like Denodo make sure that processing is done where it is more convenient, leveraging existing hardware and processing power in a transparent way for the end user. Security and governance in the virtual layer also add significant value to the combined solution.
In short, a Hadoop cluster, a data warehouse and a data virtualization layer play different roles that complement each other in a global enterprise data architecture.