Does Data Always Need to End Up in a Centralized Repository
Reading Time: 3 minutes

This is an age-old question, and one that has been asked many times over the years.

As far back as 2011  Gartner proposed the concept of a logical data warehouse as a way to overcome some of the challenges organizations typically face when attempting to centralize data, such as the growing proliferation of data copies, which also causes risk, from a data governance perspective and cost from a storage perspective. The idea of trying to keep all of the data in one location might sound appealing, but it just does not make sense.

But you may ask why exactly that not make sense. If all of the data were in one location, say a data lake, then surely our lives would be easier. We would benefit from a simplified IT infrastructure, we would be able to access data more quickly, and we would always know where our data is.

Surely this would be a good thing, right?

Arguably, yes. However, that approach ignores many factors that work against the utopia of a single location for all data. Data lakes, for example, are designed to handle unstructured data, but this means that it is hard to govern data in a data lake. Data lakes tend to be built on the consumption principle, which means that you need to add in tools, like query optimization technology, to better manage performance. Data lakes also encourage the mushrooming of storage needs; just because you can store something easily does not necessarily mean you should. These drawbacks were identified by Dan Woods in his article in Forbes back in 2016, entitled  “Why Data Lakes are Evil.”

OK, so if you can’t store everything in a data lake, then what about a lakehouse?

Once the world realized that data lakes were not the magic answer to all of our data storage, data science, and analytical needs, the data lakehouse emerged. In the simplest terms, a data lakehouse is a combination of a data warehouse (for your structured data) and a data lake (for your unstructured data).

You may argue that that sounds ideal, and you might ask why it wouldn’t work. Fundamentally, you would inherit all of the problems of both the data warehouse and the data lake.

The Forces of Data Anti-Gravity

You may ask yourself, “but don’t I still need to get all of my data into one location?”

The real question is, is that actually feasible? Consider the idea of data gravity. Way back in the 1990’s, organizations tried to centralize all of their data into a single location. We can call that “data gravity.” The main forces have always been cost, ease of use, and governance. It certainly sounds appealing to have one source of truth and one place to go for all analytical and data consumption needs.

There are competing forces at work here, though, “data anti-gravity.” The three core areas of data anti-gravity are:

  • Geography – For regulatory reasons, it might simply not be possible to migrate all of the data to a single location. This is especially true for multinational organizations, but increasingly, even if an organization only trades in one geography, part of its supply chain, or 3rd-party organizations, may well be located outside of its primary region.
  • Technology – Data is often stored in legacy systems, for which building data pipelines and exports is often strategically and financially un-palatable. This is particularly true for well-established organizations with a rich depth of historic data.
  • Ownership – Creating a data-sharing culture is always a good thing. However, operational data ownership may prevent data from being merged into a single repository.

The Simple Answer

A single repository for all analytical data is, again, a utopian ideal. In reality, factors such as data anti-gravity, the need for real-time data access, and other factors, including cost, tend to drive organizations down the path of a logical approach to data management and integration, as Gartner observed.

Logical data fabric, powered by data virtualization, provides the foundation for a logical data warehouse and other modern configurations. It enables organizations to provide real-time data access across the organization (regardless of the location of the individual data sources), and it serves the needs of hybrid, multi-cloud, and increasingly, inter-cloud environments.

To answer the question, “Does data always need to end up in a centralized repository?” The simple answer is, “No.”