“Le roi est mort, vive le roi.” This phrase—the king is dead, long live the king—marks the transfer of power from one monarch to the next. In times of potentially troublesome change, the apparent paradox and inner poetry of these words promise certainty and continuity. The phrase “la reine est morte, vive le roi” (when a king takes over from a queen, as recently experienced in the UK) lacks the same rhythm and authority. This past year also feels like a time of major upheaval in decision-making support. “The data warehouse is dead, long live the data lakehouse!” proclaim certain vendors. I say it’s time we revamped the ending: “Long live the data warehouse!”
For almost a decade, the data warehouse was battered by big data and the data lake—at least in marketing terms—and survived. It now appears threatened once again, this time by cloud, the internet of things (IoT), and the current business belief that all decision making must be nearly instantaneous.
These trends have led to the emergence of two distinct architectural design patterns (ADPs) that claim to replace the warehouse. The data lakehouse is said to take the best of the data lake and data warehouse and combine them in a cloud-based environment, and that is the topic of this post. The second pattern, data mesh, aims to displace the warehouse entirely—at least in its most extreme form. I will unpack this claim in an upcoming post. Data fabric is sometimes also promoted as a data warehouse killer. However, a closer examination shows that it is actually a modern reprise of the well-accepted logical data warehouse pattern, of which data virtualization is a key part.
One More Time: What Is a Data Warehouse and Why Do I Need One?
A data warehouse is an architectural design pattern. It is more than a piece of software, however large or complex. It is more than a combination of software packages, although many major pieces are needed—such as a relational database and metadata tools; population tools, including those for ETL (extract, transform, and load) and streaming; real-time access via data virtualization; and data access and manipulation software. Most importantly, a data warehouse requires data governance and management techniques, design and operations methodologies, and supporting organizational structures.
My definition of data warehouse has evolved since I first described it in the 1980s and published it in the first architecture in 1988. I define a data warehouse as a system to create a semantically and temporally consistent and reconciled, cross-enterprise, logical/physical set of business data (schema on write) from multiple, disparate sources, and make that data available to businesspeople for analytics and decision making in a readily understood and usable (self-service) manner.
Note that this definition calls out some challenging tradeoffs. Semantic and temporal consistency can never be fully achieved simultaneously in the real world due to the laws of physics and the contrariness of organizations. Logical data sets always require some physical instantiation(s) despite the cloud’s confusing terminology. Business understanding of data is far from consistent across different departments and uses, implying that the long sought-after vision of a single version of the truth must be balanced against multiple visions of reality across the business.
Such tradeoffs have made it exceedingly difficult to build complete, successful data warehouses in large enterprises. Such failures have, in turn, led to the drive to define new ADPs that can address these underlying problems. Unfortunately, the thinking underpinning these ADPs is often too firmly embedded in technological issues and solutions, a problem that is very clear in the data lakehouse.
So, What is a Data Lakehouse?
Beneath the somewhat metaphorical definition of a lakehouse as implementing the best of lake and warehouse lies some conveniently circumscribed understanding of the warehouse definition above and what the data lake has become in the 2020s. The initial data lake of a decade ago was little more than a data dump: raw, unprocessed, “unstructured” data stored in any format you liked with minimal governance, and the ability to create the informational context at query/analysis time, known as schema on read. The original data lake was diametrically opposed to the warehouse: there can literally be no best of both worlds in this case.
However, the data lake has evolved to a cloud-based store of largely semi-structured / structured data (think CSV files) arriving from external IoT sources and clickstream-like data from e-commerce sites in real-time through streaming interfaces, often requiring early consolidation with internally sourced data. This scenario demands significant data management, data reconciliation—both semantic and temporal, well-defined metadata, and a schema-on-write model. These “modern” data lake requirements definitely do require adherence to data warehouse principles and the use of relational database and data management software that is still relatively immature in the cloud. The best-of-both-worlds argument in this case is essentially the data management of the warehouse combined with the timeliness, elasticity, and cost-effectiveness of cloud-based lakes. For a deeper dive into the defined characteristics of the lakehouse, please see my November 2021 post on the Data Virtualization Blog.
Is a Data Lakehouse Really a Data Warehouse?
The short answer is “yes,” at least in terms of technical functionality. Simply put, a data lakehouse uses cloud object storage as a data store, and builds database, data preparation, transactional processing, and data access (all components of data warehousing) on top of that store, using the elastic and compute-storage separation principles of the cloud (data lake).
Bill Inmon, the erstwhile father of the data warehouse, has even repurposed the title and much of the thinking of his 1992 book, Building the Data Warehouse, to write Building the Data Lakehouse in 2021. In it, the lakehouse is described almost entirely in terms of legacy warehouse function that has been extended using cloud concepts and open formats to handle textual and analog/IoT-sourced data. He writes, “The unique ability of the lakehouse [is] to manage data in an open environment, blend all varieties of data from all parts of the enterprise, and combine the data science focus of the data lake with the end user analytics of the data warehouse…” This sentence alone and, indeed, the book in its entirety—lacking as it is in much technical or architectural depth—serve to emphasize that the defining characteristics of a data lakehouse revolve around data management, governance, and organizational issues. The same characteristics as a data warehouse.
The longer answer to the above question is “not yet.” The data lakehouse is probably in about the same evolutionary stage as data warehousing was in 1992. The focus of lakehouse vendors is strongly on the required underlying software function to ingest, clean, store, integrate, manage, and access data in the cloud environment as reliably and efficiently as traditional relational database technology has been able to for many years. If lakehouse implementers learn quickly and well from the experience of data warehouse developers over the past 35 years, what the data lakehouse will deliver will be nothing more than a good data warehouse built on cloud technology. By my reckoning, and that of Gartner’s latest Hype Cycle for Data Management, that could take from two to five years.
In the meantime, vive le data warehouse!
- The Data Warehouse is Dead, Long Live the Data Warehouse, Part II - November 24, 2022
- The Data Warehouse is Dead, Long Live the Data Warehouse, Part I - October 18, 2022
- Weaving Architectural Patterns III – Data Mesh - December 16, 2021
I don’t agree with your conclusions very much! To say: “…what the data lakehouse will deliver will be nothing more than a good data warehouse built on cloud technology..” it’s not true right now.
No vendor lock-in, open table format, time travel, table schema evolution, and freedom of choice of the Engine, are the first features that come to my mind that no one data warehouse has today, despite 35 years of “evolution”. So there is a new king around and it’s here to stay. Long life to the Data Lakehouse.
Thanks for your considered reply to my post. Although I suspect that we may have to agree to differ on the conclusion, your choice of issues to discuss points to very different interpretations of what a data warehouse – or lakehouse- is fundamentally about. To your individual points:
1. Vendor lock-in is a software acquisition topic that applies far wider that data warehouse/lakehouse. It is an old debate between commercial and open source software. You can be just as locked in to your choice of an open source set projects as to a vendor. If the projects you choose lose out to other competing projects over time, you will end up migrating, just as you would if you choose a vendor who goes bust, favours short term profit over longer term relationships, or whatever…
2. Open table format(s) is a (set of) technical approaches to supporting various aspects of file level data management that typically underpins a relational database. Being open, it has the same competitive pros and cons just mentioned in point 1. Many if not all of the issues it addresses are well known in computer science and have been addressed over the years in relational databases and their underlying technologies… in many cases it is “reinventing the wheel”
3. Time travel in its various forms has also been long offered in mature relational databases and tackled with varying degrees of ease and success in data warehouses. For a comprehensive discussion, see Tom Johnston’s “Bitemporal Data, Theory and Practice.” Although a little out of date, I discussed in on TDWI Upside in 2017 – http://bit.ly/2C5Nk2N
4. Table schema evolution is an important aspect of modern data warehouses/lakehouses. As provided through open table formats, it is an interesting technology-level approach to addressing the problem of changes in business need over time. There are many more challenging issues and the modelling level that still need to be addressed!
5. Freedom of choice of the Engine. Perhaps I misunderstand you point, but once you have chosen your Engine, I suspect you no longer have freedom of choice!
While all these may be interesting and important detailed design points they are rather far from the fundamental definition of a data warehouse as I explored in my post. Data lakehouse, in my opinion, is trying to address many of the same technical problems that data warehouse tackled with varying degrees of success over the past 35 years and is, in many cases, reinventing solutions to the same underlying technical problems – driven in large part by the migration to the cloud rather than the fundamental business needs of decision-making support. In that sense, it adds little to the evolution of architectural or methodological thinking needed as a result of ongoing digital transformation.