Does a Logical Data Warehouse Architecture Need a Physical Data Warehouse?
If you want to know why a Physical Data Warehouse is no longer suitable for newer data types, register to the webinar “The Big BI Dilemma – Bimodal Logical Data Warehouse to the Rescue!” by Rick van der Lans on November 30th.
A question I get very often at seminars and workshops on the logical data warehouse architecture is whether a physical data warehouse is still needed? And quite often, this question is followed by this somewhat assertive question: Can we really run an analytical workload via a data virtualization server on our operational systems?
The answer to the first question is a typical IT answer: It depends. Many logical data warehouse architectures are developed without a traditional physical data warehouse; all the data sources are accessed directly. But sometimes a physical data warehouse is still required. This may sound like a rather paradoxical statement, because it implies that an architecture containing the word logical in its name contains a physical data warehouse. Therefore, let me give a detailed explanation.
In the logical data warehouse architecture, all the data consumers (reports, analytics, and queries) do not retrieve data by accessing specific data sources directly, but through some intelligent gateway system. This system makes all the data sources look like one logical database. In most logical data warehouse architectures this system is developed with a data virtualization server, such as the Denodo Platform. A data virtualization server hides from the data consumers whether a data source is a SQL database server, a NoSQL system, an Hadoop file, an Excel spreadsheet, a SOAP or REST service, a cloud application, a social media system, or anything else. In addition, the data consumer doesn’t see whether that data source acts as a data warehouse, a data mart, a data lake, a personal file, a staging area, or, and that’s important, an operational system. In fact, a data virtualization doesn’t even know whether a specific data source is a data warehouse or an operational system. Again, to the data consumers running their reports or doing their self-service analytics, all the data sources together look like one integrated data source.
For most of the data sources there is no need to develop a physical data warehouse into which data is first copied from the real data source. That need doesn’t even exist for many of the operational systems. In fact, if all the operational systems can be accessed live, then a physical data warehouse is not at all needed, which is the preferred approach. Unfortunately, for organizations with a large legacy of operational systems this is an unrealistic assumption.
There are several reasons why a physical data warehouse (and possibly also a staging area) is needed in this case. The first and most important reason is that some operational systems don’t keep track of historical data. Updates just overwrite old date, and deleted data is really removed from the data source. However, historical data may be needed for trend analysis and, therefore, has to be stored somewhere. And that’s where the physical data warehouse may come in.
A second reason is that adding an analytical workload on an operational system can lead to performance degradation and interference, possibly causing issues in the operational environment and hurting the business. In this case, the old-fashioned approach of copying the data to a physical data warehouse may be a solution.
But evidently, the goal when designing a logical data warehouse architecture is to minimize the use of a physical data warehouse; only when the limitations of the operational systems force us to do so. Note that for many (newer) operational systems a physical data warehouse is not needed in a logical data warehouse architecture. Data virtualization servers have sufficient features on board to delicately access these operational systems and causing no performance degradation or other operational problems.
A question we have to ask ourselves is whether we should still name it a data warehouse if we use a physical data warehouse this way, because it doesn’t fit the standard definition of data warehouse any longer. Maybe a more vague term, such as data repository, is more fitting.
So, to come back to the original question, depending on the limitations of some operational systems, a logical data warehouse architecture may still need a downsized physical data warehouse!