Epistemology of Data Virtualization
Data are the eyes with which we look at reality. They are the means – not the only one – with which every company can aspire to have full awareness of itself and of the context in which it operates and, wanting to move close to the boundaries between technology and philosophy, data are the measure of their own immanence.
Data are the fuel of knowledge, feeding it, but not enough to create it. They are also democratic, allowing everyone to read them in a different way, according to that intentionality that defines, for each of us, our being-in-the-world¹.
If everything comes from data² and in the data, and if we want these to be the fuel we need, then it is necessary that, for them and any software solution that allows access to them, some principles must be met:
- Data must be reachable, wherever they are and whatever their format;
- Data must be manipulable, so that their elementarity can be combined and produce increasingly richer information constructs, to generate knowledge and wisdom⁴;
- Data must be aggregable, so that starting from them we can define increasingly articulated concepts, with the ultimate goal of defining and making available the appropriate corporate ontology, formalization of what the company is and does;
- Data must be separable in their intensional and extensional components, so that the signified and the signifier³ are clearly distinguishable. In other words, there must be a clear separation between the meaning of the data and its realization;
- Data must be easy to consult, by people and applications, without having to be aware of the syntax that regulates their native format. In other words, for each data, its representation must be possible in a standard format, which is reasonably known, easy to read and, obviously, semantically equivalent to the native format.
Part of these principles are met by various solutions families that, with different objectives, are responsible for facilitating access to data (the term “access” is used in a very general sense), but there is a particular type, the one known as data virtualization, for which these principles should rise to real axioms, without which the entire system proposed by a solution of this type would be mere, if not incoherent.
Virtualizing data, in fact, cannot and should not be limited to making them accessible regardless of their technical and syntactical aspects: if that were the case we would have a chain, the one that represents the life cycle of data, strong in one of its rings and weak in all the others, since the ease of access would not then follow by a similar ease of use.
Virtualizing the data, therefore, means not only making them accessible regardless of where they are and how they are represented, but above all allowing the separation between the datum as occurrence and the data as meaning, so you can operate on them without being necessary to access to their realizations.
It is in fact the possibility of combining and aggregating data, in their extensional component, which allows a data virtualization solution to convey its value, making possible to model, incrementally, the ontology that represents the company as such, a nodal point in order to reach that awareness of oneself, an indispensable guide for all internal and external actions that will be put in place.
If ontology is the essence of the company, we can not necessarily assume, although this is perhaps desirable, the univocality of its interpretation by those working in the company: the set of concepts that globally contribute to defining the boundaries of the reference context, in fact, are subject to interpretation by individuals, each of which has its own role and responsibility, which make it absolutely legitimate and ineluctable to assume the existence of private interpretative models, which are nothing but a sort of reification of ontology, in itself abstract, so as to make it concrete in the specific context in which each one operates.
Obviously we are not talking – it would not be sensible or manageable, probably – to hypothesize as many ontologies as there are individuals working in a company, but at least to foresee the possibility that this could happen for groups of particular relevance (Sales, Marketing, Customer Care, …), in order to guarantee them that representative freedom, a prerequisite for the correct management of their tasks.
So, trying to recapitulate, a data virtualization solution, in addition to making data easily accessible, must allow their manipulation, allowing, as a minimum, to:
- Create a level of abstraction between the datum as occurrence, realizing the separation between the extensional representation (the occurrences of the datum) and the intensional one (the definition of the datum);
- Create a level of abstraction with respect to the previous one, where the corporate ontology can be defined through data operations, combining and enriching them;
- Create a further level of abstraction, where the corporate ontology can be adapted to the specific needs of the different business units, with the obvious constraint that such adaptation should not create inconsistencies with the corporate ontology, but only its possible specializations.
Certainly, once the level described at point 3 has been made available, it should be possible to create others, according to an iterative process and always as a combination and aggregation of the underlying ones, being limited only by the reasonableness of such operation, which must never be an end in itself, but clearly aimed at making the meaning of the context to be modeled as clear as possible.
Another important point is that this process we have called – perhaps inappropriately – of abstraction, is not such, in reality, at least not according to the classical sense attributed to this term, a progressive refinement which moves from elementary concepts to their generalization, following a typically tree structure; here we mean instead the possibility of arbitrarily combining the concepts (data) modeled in the lower levels, without any particular topological constraints.
Moreover, to operate at such levels necessarily impose the need to know what at these levels is represented or, in other words, to have full access to what is gradually created, be they elementary data or more complex concepts, built by combining the former. In other words, it is a matter of having a data virtualization system that allows us to know what we know, that is, in other words, autoepistemic, in the sense of clearly and legibly displaying its contents, allowing users to investigate them, understand them and, therefore, use them for their own purposes.
Finally, all the operations carried out in the levels described above must necessarily operate on intensional layer, without creating significant perturbations in the extensional one, so as not to cause unnecessary computational loads on the systems where the data are physically maintained.
Now, hoping to have clarified the importance of being able to manipulate the data, in order to compose with them informative constructs, more and more relevant for everyone’s purpose and for the value expected from this, it is equally important that in such operations it is possible to inspect the data, in their extensional component, with the minimum possible effort, without having to be forced to understand the different syntaxes with which they are represented.
From this point of view it is indeed important to understand that modeling and representation are two distinct moments, temporally and conceptually, where the former refers to the act of inclusion of a given data or concept within the data virtualization system, in respect of the adopted formalism, while the second gives evidence of what the system contains, in terms of the occurrences of the data on which it operates. Basically, while modeling aims to define their meaning, representation gives a snapshot here and now.
This representation, then, for obvious reasons of system usability, should take place in a single format, sufficiently powerful not to cause any semantic losses with respect to the original formats, reasonably known and widespread and, finally, allowing a clear and not ambiguous reading of data occurrences.
The availability of a unique format of representation, therefore, guarantees, on the one hand, the same expressiveness of native formats and, secondly, the necessary simplicity in the operations that take place on such data, reducing as much as possible the interpretative effort required by their readers.
Besides, if then we want to take a further step forward about representation, extending it to the whole system of data virtualization, then we can say that this should be such as:
- Give evidence of the meaning of the data in their intensional terms;
- Show how the data relate to each other and what the meaning of these relationships is;
- Give access to the data manifestations, that is to their extensional component, so that we can move between intensionality and extensionality in a fluid way, in one verse or another.
Having described access to data, their representation and the possibility of their manipulation through the separation of the intensional and extensional components, all that remains is to spend a few words on the theme of data consumption, that is, how everything that has been done and produced can be made available to potential users, be they people, who need, in any way, to analyze such data, or applications, that these data need to be able to perform the tasks for which they were made.
In this case, differently form the modeling phases, consumption will necessarily concern the data extensional component, since it is precisely their occurrences – the facts – that allow us to analyze what happened, what is happening and to foresee what could happen.
A consumption that must be simple, in the sense of being able to adapt, in formats, to as many scenarios of use, be they human or applicative, and even performing, allowing the fetching of data occurrences in a short time, taking advantage of all the optimizations possible and, even better, the characteristics of the source systems, delegating to them, when possible, the execution of particular commands, so as to act according to the spirit of “let to do things to those who are able to do them better”.
Modeling, separation, representation and consumption, therefore, as key elements, fundamental, of any solution that wants to propose itself in the context of data virtualization. Four key elements, but obviously not unique, since then such a solution will have to meet additional requirements, typically non-functional, such as, for example, those related to security, performance and overall system governance, requirements that here, by conscious choice, it was decided not to examine in depth.
In conclusion, therefore, we could say that there is nothing new under the sun, if not the usual need to have a solution that allows, quickly, economically, manageable and safely, to work on data in order to extract the value that resides in them⁵, a value that is a precious mineral for companies and, like all minerals, often does not appear on the surface, but emerges only through analysis, inspections, coring and field operations, which allow that necessary distillation to get their most precious part.
¹Martin Heidegger – “Being and Time” – 1927.
² If we place ourselves in the perspective of materialism, actually everything comes from lived reality, of which data are only a synthesis, a measure, an abstraction. For the purpose of the document, however, simplification should be more than acceptable.
³ With this term we indicate here the ability of a company to be effective in the context in which it operates. What normally allows to move from knowledge to wisdom is pragmatism, which can be read here as the capacity that knowledge has to direct behavior.
⁴ Ferdinand de Saussure – “Cours de linguistique générale” – 1916
⁵ It could be argued that the data, in themselves, have no value, but that they acquire it only on the basis of the way they are used, exactly like a brick, which has value only through the house it contributes to building. For the purposes of this document, however, this simplification can be considered licit.