Zero Km Data
I see a strong analogy between what inspired the “Zero Km Food” movement, which started in Italy but then spread to other countries, and the way in which data can be managed in its lifecycle from creation, through detection, to consumption. Zero Km Food affirms the importance of food’s original territory, as it emphasizes food that has travelled zero kilometers from producer to the final consumer.
Elementary data is like a product of nature, in that it can be consumed as it is or transformed to produce more complex, articulated forms. Just as we believe that it is right to reduce the environmental impact of food production, we should ask ourselves how we can follow similar principles in our use of data. To do so, we must first create parallels between the digital and the physical worlds.
The Need for Zero Km Data
In the digital world, roads are replaced by network connections, and product storage warehouses are replaced by storage systems, which also organize the data. For example, a data warehouse will represent the data according to the spirit that characterizes its own type of aggregation, modeling it, synthesizing it, and making it available in its specific way.
In the digital world, “producers” can be anything that can produce data, be it a sensor or an application; “transformers” are applications or people who, by exploiting elementary data, create new data, which can be put on the market to be, once again, used as it is, or further transformed. The final consumer is none other than the data consumer, who can use the available data for his or her own purposes.
What might “Zero Km Data” entail, and how could we justify it? Also, how could we implement it? Most importantly, why should we do so?
The Requirements of Zero Km Data
First, storage should be kept to a minimum. Above all, it should not be used exclusively for transport, but on an as-needed basis, such as when a dataset is duplicated because it is necessary, and not just because it facilitates delivery to the recipient. If we compare this process to the corresponding process in the physical world of the products of the earth, we see that there is an additional burden, since in the digital world, movement is almost always equivalent to duplication, which impacts the resources needed to support it.
Second, every storage system that is not functional to the lifecycle of a product (for example, wine must be stored to allow its fermentation) causes a deterioration of the product itself – a deterioration of its freshness, and this happens both in the physical world and the virtual one, as data replication always creates delays between the data in its last iteration and the copy present in the different intermediate points.
The third element to consider is transport, which should be limited, in terms of distance and quantity of goods transported, to the actual consumer needs, since both physical roads and the digital networks have finite capacity, and over-committing them is always a waste of resources. Logistics should always be managed according to the expected use of what is transported, delegating to the right actors, along the entire chain, any transformations that need to be done, which obviously has beneficial consequences on what needs to be moved.
Finally, a last point, which unfortunately has not yet been effectively implemented in the physical world, is that Zero Km products should be immediately accessible, or at the very least they should be able to be purchased with a minimum of effort. In the digital world, where there is no real physical movement of the various actors (it is the data that reaches them and not the other way around), we can see this aspect as the possibility of having a single point of access to data, a sort of data marketplace, as opposed to a scenario in which consumers need to pass through a set of individual data sources to access the data they need.
- There should be no limitation on accessing data, as every dataset has its own potential value.
- All data should be collected in one place, minimizing the burden on consumers looking for data in multiple places, and running the risk of not finding it.
- For each single attribute of each dataset, it must be possible to reconstruct the entire chain of lineage, to maximize the trust that consumers can place in the data.
- Data should only be duplicated when necessary and not merely to satisfy transport and delivery requirements.
- The data that travels on the network should only be data that is functional to the operations that must be applied on it.
How to Implement Zero Km Data
Data virtualization fulfills all of the above points:
- By the availability of connectors for all possible data sources.
- By the ability to establish a single point of access, from which, is it also possible to define a unified semantic model so that there is full awareness of what data is available and what the data means.
- By the availability of the complete data lineage, enabling stakeholders to know the origin of each attribute of every single dataset.
- By the logical-physical separation of data, which makes it possible to access data only when it needs to be used and to easily investigate the meaning of data through the logical component.
- By the availability of a query engine that knows how to delegate to the data sources those operations that can be best performed by them, reducing travel on the network.
The high availability of bandwidth, the high computational capacity of modern computers, and the low cost of storage, should not be excuses for not adopting an inherently efficient data management process, one that does not rely exclusively on the underlying resources. Such a process may be adequate today, but not tomorrow, and data management solutions should be chosen judiciously, and in the spirit of encouraging maximum efficiency and economy.