Rethinking the Data Lake with Data Virtualization
In this post, I will introduce the idea of the logical data lake, a logical architecture in which a physical data lake augments its capabilities by working in tandem with a virtual layer. In subsequent posts in this series, I’ll cover architecting the logical data lake, the logical data lake for data scientists, and the logical data lake for business users.
Introducing the Logical Data Lake
The world of big data is like a crazy rollercoaster ride. Hadoop distributions have grown in complexity over the years; currently, the maturity and number of projects in the Hadoop ecosystem cover the needs of a comprehensive list of use cases. Gartner predicts, however, that Hadoop distributions will not make it to the plateau of productivity.
At the same time, new offerings by major cloud vendors blend the concepts of SaaS with big data. For example, the lines that distinguish HDFS, Amazon S3, and Azure data lake storage are becoming finer. Next-generation cloud MPPs like Snowflake and Redshift are almost indistinguishable from SQL-on-Hadoop systems like Spark or Presto (think Qubole or Databricks, to name a few). Clearly we live in interesting times, for data management.
But in the midst of this constantly evolving world, there is a one concept in particular that is at the center of most discussions: the data lake. This is a place where all data can be found, with almost infinite storage and massive processing power. However, despite their clear benefits, data lakes have been plagued by criticism. See, for example, these articles from Garner (2014), Forbes (2016), and concepts like “data swamps,” to understand some of the challenges with data lakes.
Let’s review three of those challenges:
- The principle of “load first, ask later.” Good governance is key to a usable data lake, but this strategy can easily lead to an ungoverned data lake, with multiple uncontrolled copies of the same data, stale versions, and unused tables. Additionally, due to data restrictions and local laws, not all data can be replicated into the lake.
- High expectations of raw data. It is usually a mistake to first create a data lake, and then define the data pipelines that feed the lake, before determining the expected results and benefits. It is best to begin by asking what data should go in the lake, for what purpose, and with what granularity.
- Complexity and talent. Managing an on-premises Hadoop cluster, or dealing with the fine tuning of a cloud-based system, are complex tasks. When these tasks end up in the hands of non-technical business users that were promised unparalleled power and infinite storage, you have a recipe for disaster. Acquiring the right talent to successfully operate and use the cluster is difficult and costly. In many cases, data lakes are only used by data scientists, which reduces their potential.
These challenges affect data lake ROI, delaying projects, limiting their value, increasing their operational costs, and leading to frustration due to the initially high expectations.
Data Virtualization and Data Lakes
Data virtualization can overcome each of these challenges. In fact, data virtualization shares many ideas with data lakes, as both architectures begin with the premise of making all data available to end users. In both architectures, the broad access to large data volumes is used to better support BI, analytics, and other evolving trends like machine learning (ML) and AI. However, the implementation details of these two approaches are radically different.
The idea to combine both approaches was first described by Mark Beyer from Gartner in 2012 and has gained traction in recent years as a way to minimize the drawbacks of fully persisted architectures. The logical data lake is a mixed approach centered on a physical data lake with a virtual layer on top, which offers many advantages.
The premises of a logical data lake are simple:
• It uses a logical approach to provide access to all data assets, regardless of location and format, without replication. Copying data becomes an option, not a necessity.
• It allows for the definition of complex, derived models that use data from any of the connected systems, keeping track of their lineage, transformations, and definitions.
• It is centered around a big data system (the physical data lake), and it can leverage its processing power and storage capabilities in a smarter way.
For more information on logical data lakes, see this detailed paper by Rick Van der Lans (April 2018), from R20 Consulting; watch this webinar by Philip Russom (June 2017), from TDWI; or read this “Technical Professional Advice” paper by Henry Cook from Gartner (April 2018).
These capabilities are fundamental to understanding how a logical data lake can address the major drawbacks of traditional data lakes, and overcome the previously mentioned challenges:
- Load first, ask later. With a logical architecture, data is not necessarily persisted, but connected. This means that access to most data does not require any initial investment to bring data into the logical data lake.
o The data virtualization layer can access data in its original location, which means multiple copies of the same data are not necessary.
o In cases when direct source access is not optimal for performance reasons, data virtualization technologies like the Denodo Platform can easily load the data into the physical lake, making the transition completely seamless.
o A similar approach can be taken for higher level cleansing and transformations. If needed, data can be easily persisted. Otherwise, Denodo’s engine will orchestrate the calculation on-demand. This will leverage the data lake engine as a sort of ETL process, or more accurately, an ELT process.
- High expectations of raw data. In a logical system, raw data can stay in the original sources, and only useful data needs to be brought into the system. Data can be curated, transformed, aggregated, and combined within the logical model so that only the parts that are needed are eventually persisted into the data lake.
o This is a typical approach for external data, especially coming from SaaS applications, and for specialized time series stores, where the useful level of granularity is usually higher (e.g. aggregations by minute or by hour).
o In addition, for data that doesn’t have storage at the source (e.g. devices, sensors), streaming data to the physical data lake for persistence is the best option. Edge computing techniques and streaming analytics can reduce the amount of data that needs to be stored. Adding a virtual layer to the architecture doesn’t force you to access all data from the sources, it’s an additional tool to use only when it makes sense.
- Complexity. With a virtual layer, business users don’t need to interact with the back-end system directly.
o The virtual layer offers a simpler-to-use SQL engine that abstracts the complexities of the backend for reads, data uploads, and the processing of complex queries.
o In addition, a user-friendly data catalog enables easy access to the data model, data lineage, descriptions, and data previews.
o Requests to IT are reduced, and usage of the data lake is broadened to a larger non-technical audience.
As we can see, a logical data lake can shorten development cycles and reduce operational costs when compared to a traditional physical lake. It also helps to broaden adoption, increasing the ROI of the data lake investment.
Here are links to two stories of companies that have successfully implemented logical data lakes:
- Managing Oil Production, Pricing and Distribution with Data Virtualization: Anadarko Petroleum centralizes data-driven insight generation by having data scientists and business users access information using data virtualization. The effort drove an increase in oil production, pricing, and distribution, and improved fleet management.
- Simplified Data Management with Hadoop and Data Virtualization: Vizient leveraged data virtualization to enable business users to discover data using the familiar SQL, abstracting their direct access to Hadoop.
But how does a logical data lake work, in dealing with large data volumes? How is it configured and used? We will get into those details in the next post in this series.