When A Data Catalog Is Not Just a Data Catalog
Reading Time: 3 minutes

Before use, there is always awareness, but before awareness, there is always knowability. These are the elements that, followed in sequence, enables the use of data and information to generate value, value that can be realized in accordance with the purpose (or the intentionality, as John Searle  would put it, in “Intentionality: An Essay in the Philosophy of Mind”) of those who use it.

When we talk about data, the journey towards the coveted goal of generating value translates into the need to know what information assets are available, the awareness of what these assets represent, and finally, the use of these assets to produce new knowledge about the world and achieve the wisdom of knowing how to use it to make decisions that move us forward according to our plans (Here, I am referring to the Data – Information – Knowledge – Wisdom pyramid, which posits a hierarchy of knowledge).

If we accept the complexity of this path, we must start the journey where knowability, awareness, and use can take place, without our being forced to look elsewhere, without our having to change our means of transport depending on our intermediate destination. We need someone to provide us with an integrated ticket for the entire journey and not force us to purchase individual tickets for every single stage along the way.

The Integrated Ticket

This ticket is nothing but the data catalog enabled by data virtualization, which automatically collects and describes all available information. A data catalog is not – and should not be – merely a descriptive collection of logical-linguistic elements, in relation to each other, but a “living” catalog that adds to the descriptive component (the logical or intentional component) the connection with the occurrences of the data’s heritage (the physical or extensional component).

With these characteristics, a data catalog can be the pivot point of the use of data, for a path without fractures, where the traveler is the data consumer, who starts on a wave of a need; passes through the search for data that enables the consumer to address that need and to understand the data in terms of what it represents and its origin; to eventually combine different data sets, so as to be able to represent in a precise way what is being investigated; and finally to use the data, completing the logical-physical (or intentional-extensional) connection, which sanctions the passage from the definition of the concept, of which a data point is a representation, to the set of occurrences of that concept that have been collected.

Data catalogs have the potential to extend control from where the data is, to where the data is used, overseeing the entire data lifecycle of data in compliance with existing rules. We must not forget that data democracy, which aims to give data access to those who need it, and to do so in compliance with a set of rules, is quite different from anarchy, in which data would simply be thrown into a common space, and anyone could do what they want, without any constraint or control. In a data democracy, a data catalog would act as a “data constitution.”  

Data Catalog and Data Virtualization

Of course, data catalogs are not the exclusive prerogative of data virtualization and could be a part – whether foundational or not – of any data integration strategy, whether traditional or innovative. However, it is precisely in data virtualization that a data catalog can unleash all its power and do so elegantly, especially by virtue of the logical/physical separation on which it is based. This enables a clear line of separation between meaning, signifier, and referent (Ferdinand de Saussure – “Cours de linguistique générale” – 1916), while still keeping them linked, so that one can seamlessly move from one to the other.

Compared to classical methods of data integration, which are based on creating replicas of the data and in which the data catalog, when present, is built precisely on these replicas (As Ferdinand de Saussure might put it, duplicating the references only to be able to describe them), in data virtualization the data catalog comes to life without creating physical replicas of the data, because in order to represent the meaning of data, it is not necessary to access the realizations (physical component) of data, but only the definition (logical component) of data. Besides the obvious benefit of saving resources, this approach enables a more elegant and natural conceptualization of the world that the data intends to represent.

More than Just a Data Catalog

Data catalogs play a central role. They are the point from which we trigger our hunger for data, where we understand our data, where we discover its origin, and finally, where we use it to do what we need to do. Often, we can operate autonomously when we use a data catalog, because data catalogs do what they do while hiding all of the underlying complexities that users don’t need to know or care about. As data consumers, we are the driver of the car and not the engineer who built it nor the mechanic who eventually has to repair it.

Andrea Zinno