Data Virtualization and the Fulfilling of Ted Codd’s Dream
Reading Time: 3 minutes

When E.F. (Ted) Codd developed the relational model in the early 1970s, he had a dream. He wanted applications to be independent of the data storage and data access layer. Changes made to one should not lead to changes to the other one. This independence would improve productivity and flexibility. At that time database servers did not offer this level of independence; changing the data storage layer had a direct impact on the application code. It looks as if data virtualization servers are the perfect implementation of Codd’s dream.

Ted Codd opened one of his groundbreaking articles with the sentence “Future users of large data banks must be protected from having to know how the data is organized.” A little further in the article he writes “application programs should remain unaffected when the internal representation of data is changed”. This concept of abstraction or decoupling is fundamental to the relational model. In 1981, when receiving the ACM Turing Award, he published an article in which he named this concept the data independence objective. Applications must remain independent of the data storage and access layer. If the application or data structure changes, it should not have impact on the other.

Another groundbreaking article was written by David Parnas. In 1972 he introduced the concept of information hiding. With information hiding he meant that application structures must remain independent of data storage structures. Changing one should not lead to forced changes of the other one. In other words, he had come to the same conclusion as Codd. Parnas looked at this concept from the application perspective and Codd from the data perspective.

This concept of data independence or information hiding is exactly what data virtualization technology is all about. It is a complete decoupling of how and where data is stored and the way it is being used. If applications prefer to access data in Hadoop Parquet files using SQL they can; if they want to use SQL to retrieve data from a NoSQL databases server they can; and, if they want to access a SQL database through a JSON/REST interface, they can as well. Maybe more importantly, a migration of data from a SQL database server to Hadoop can be hidden completely by a data virtualization server. This is exactly what Codd and Parnas had in mind with their concepts data independence and information hiding.

As Alberto Pan (CTO of Denodo Technology) rightfully noted “This illustrates one fundamental principle of computer science which is at the root of data virtualization: applications should be independent of the complexities of accessing data. They should not have to bother about where data is located, how it is accessed or what is its native format. They should also be independent of changes in any of those aspects.”

SQL database servers were and are good at separating data storage from data usage, because a SQL query specifies the requested data, but not how the data must be retrieved. This is up to the database server to decide. But they don’t support the level of data independence that data virtualization servers deliver.
Besides, current NoSQL database servers offer a much lower level of data independence than we are used to with SQL database servers. In these products a closer tie exists between data usage and data storage. Data virtualization can improve the level of data independence by operating in between the application and the NoSQL products.

In a way the current generation of data virtualization servers data are fulfilling Codd’s and Parnas dreams. Data independence (aka decoupling, abstraction, and information hiding) may not be the most popular concept that’s being discussed in the IT industry today, but it’s still an significant one because it’s directly related to productivity and flexibility. Therefore, it deserves more attention when designing data architectures than it receives today.

Rick F. van der Lans