“Le roi est mort, vive le roi.” This phrase—the king is dead, long live the king—marks the transfer of power from one monarch to the next. In times of potentially troublesome change, the apparent paradox and inner poetry of these words promise certainty and continuity. The phrase “la reine est morte, vive le roi” (when a king takes over from a queen, as recently experienced in the UK) lacks the same rhythm and authority. This past year also feels like a time of major upheaval in decision-making support. “The data warehouse is dead, long live the data lakehouse!” proclaim certain vendors. I say it’s time we revamped the ending: “Long live the data warehouse!”
For almost a decade, the data warehouse was battered by big data and the data lake—at least in marketing terms—and survived. It now appears threatened once again, this time by cloud, the internet of things (IoT), and the current business belief that all decision making must be nearly instantaneous.
These trends have led to the emergence of two distinct architectural design patterns (ADPs) that claim to replace the warehouse. The data lakehouse is said to take the best of the data lake and data warehouse and combine them in a cloud-based environment, and that is the topic of this post. The second pattern, data mesh, aims to displace the warehouse entirely—at least in its most extreme form. I will unpack this claim in an upcoming post. Data fabric is sometimes also promoted as a data warehouse killer. However, a closer examination shows that it is actually a modern reprise of the well-accepted logical data warehouse pattern, of which data virtualization is a key part.
One More Time: What Is a Data Warehouse and Why Do I Need One?
A data warehouse is an architectural design pattern. It is more than a piece of software, however large or complex. It is more than a combination of software packages, although many major pieces are needed—such as a relational database and metadata tools; population tools, including those for ETL (extract, transform, and load) and streaming; real-time access via data virtualization; and data access and manipulation software. Most importantly, a data warehouse requires data governance and management techniques, design and operations methodologies, and supporting organizational structures.
My definition of data warehouse has evolved since I first described it in the 1980s and published it in the first architecture in 1988. I define a data warehouse as a system to create a semantically and temporally consistent and reconciled, cross-enterprise, logical/physical set of business data (schema on write) from multiple, disparate sources, and make that data available to businesspeople for analytics and decision making in a readily understood and usable (self-service) manner.
Note that this definition calls out some challenging tradeoffs. Semantic and temporal consistency can never be fully achieved simultaneously in the real world due to the laws of physics and the contrariness of organizations. Logical data sets always require some physical instantiation(s) despite the cloud’s confusing terminology. Business understanding of data is far from consistent across different departments and uses, implying that the long sought-after vision of a single version of the truth must be balanced against multiple visions of reality across the business.
Such tradeoffs have made it exceedingly difficult to build complete, successful data warehouses in large enterprises. Such failures have, in turn, led to the drive to define new ADPs that can address these underlying problems. Unfortunately, the thinking underpinning these ADPs is often too firmly embedded in technological issues and solutions, a problem that is very clear in the data lakehouse.
So, What is a Data Lakehouse?
Beneath the somewhat metaphorical definition of a lakehouse as implementing the best of lake and warehouse lies some conveniently circumscribed understanding of the warehouse definition above and what the data lake has become in the 2020s. The initial data lake of a decade ago was little more than a data dump: raw, unprocessed, “unstructured” data stored in any format you liked with minimal governance, and the ability to create the informational context at query/analysis time, known as schema on read. The original data lake was diametrically opposed to the warehouse: there can literally be no best of both worlds in this case.
However, the data lake has evolved to a cloud-based store of largely semi-structured / structured data (think CSV files) arriving from external IoT sources and clickstream-like data from e-commerce sites in real-time through streaming interfaces, often requiring early consolidation with internally sourced data. This scenario demands significant data management, data reconciliation—both semantic and temporal, well-defined metadata, and a schema-on-write model. These “modern” data lake requirements definitely do require adherence to data warehouse principles and the use of relational database and data management software that is still relatively immature in the cloud. The best-of-both-worlds argument in this case is essentially the data management of the warehouse combined with the timeliness, elasticity, and cost-effectiveness of cloud-based lakes. For a deeper dive into the defined characteristics of the lakehouse, please see my November 2021 post on the Data Virtualization Blog.
Is a Data Lakehouse Really a Data Warehouse?
The short answer is “yes,” at least in terms of technical functionality. Simply put, a data lakehouse uses cloud object storage as a data store, and builds database, data preparation, transactional processing, and data access (all components of data warehousing) on top of that store, using the elastic and compute-storage separation principles of the cloud (data lake).
Bill Inmon, the erstwhile father of the data warehouse, has even repurposed the title and much of the thinking of his 1992 book, Building the Data Warehouse, to write Building the Data Lakehouse in 2021. In it, the lakehouse is described almost entirely in terms of legacy warehouse function that has been extended using cloud concepts and open formats to handle textual and analog/IoT-sourced data. He writes, “The unique ability of the lakehouse [is] to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse…” This sentence alone and, indeed, the book in its entirety—lacking as it is in much technical or architectural depth—serve to emphasize that the defining characteristics of a data lakehouse revolve around data management, governance, and organizational issues. The same characteristics as a data warehouse.
The longer answer to the above question is “not yet.” The data lakehouse is probably in about the same evolutionary stage as data warehousing was in 1992. The focus of lakehouse vendors is strongly on the required underlying software function to ingest, clean, store, integrate, manage, and access data in the cloud environment as reliably and efficiently as traditional relational database technology has been able to for many years. If lakehouse implementers learn quickly and well from the experience of data warehouse developers over the past 35 years, what the data lakehouse will deliver will be nothing more than a good data warehouse built on cloud technology. By my reckoning, and that of Gartner’s latest Hype Cycle for Data Management, that could take from two to five years.
In the meantime, vive le data warehouse!
- The Data Warehouse is Dead, Long Live the Data Warehouse, Part II - November 24, 2022
- The Data Warehouse is Dead, Long Live the Data Warehouse, Part I - October 18, 2022
- Weaving Architectural Patterns III – Data Mesh - December 16, 2021