The business value of applying data science in organizations is incontestable. Prescriptive and descriptive models can help improve business and decision making processes. Data science work can be divided into analytical and data preparation work. Examples of data preparation activities are organizing access to data sources, extracting source data, transforming, integrating, and cleansing data, and removing outliers. Studies and practice have shown that, unfortunately, data scientists spend a large part of their time on data preparation.
In fact, some say that they spend approximately 80% of their time on data preparation and, thus, only 20% on the real analytical work. Data scientists would be more productive if the data preparation phase were shortened, because it would allow them to spend more time on creating new data science models.
Many technologies and techniques can shorten the data preparation phase. This article describes how one of them, data virtualization, can help to shorten it.
Integrating Data from Source Systems
In most situations, the data needed for creating models has to come from several source systems. For each system, data scientists need to organize access, deal with a specific security mechanism, extract data, and so on. Deploying data virtualization simplifies access to all the source systems. It feels like all the data is stored in one system. This simplifies and speeds up access to all the source systems. It also eases the integration of data from different source systems; integration becomes a simple join.
One Language to Bind Them All
Different source systems may support different interfaces, languages, and database concepts, such as those supported by Hadoop systems, SQL databases, proprietary cloud applications, and NoSQL systems, and some sources may employ a CSV file format. Data scientists need to understand them all in detail. When data virtualization is deployed, however, all the systems can be accessed through one and the same interface. Regardless of what the real interface or language of the source system is, data scientists can select one through which to access them all.
The data in source systems almost always needs to be processed, filtered, masked, validated, transformed, and so on, before it can be used for analytics. Normally, this is done by taking the source data, applying all the required operations, and then storing the result in a separate database or file. This leads to redundant data that needs to be managed and secured. With data virtualization, all those operations can also be defined, but are executed on demand and without having to store the data. This takes away a lot of the hassle that comes with storing redundant data.
Sharing of Specifications
Everything defined for one data scientist can be shared by others. For example, specifications for integrating two source systems, masking data, securing data, filtering data, and how to transform data values, need to be defined only once. All data scientists can reuse these specifications. With data virtualization, even colleagues using other data science tools can share these specifications.
Cache When Necessary
If needed, for example for performance reasons or to minimize interference on the source systems, data can be physically copied to another data storage technology using caching technology supported by data virtualization. This will not change the interface nor the applications of the data scientists.
The Power of Metadata
Without descriptive and defining metadata, data is useless. To interpret data correctly, data scientists need access to metadata. With data virtualization, every data scientist can enter, access, and easily search for metadata. Data and metadata can also be combined and presented to all data scientists.
More Time for Analytics
Due to the above features, data virtualization shortens the data preparation phase for data scientists. So many specifications are much easier to define and to share with data virtualization. But there is still work to be done. For example, how the data virtualization server needs to connect with a source system still needs to be studied and defined. I recommend that this technical work is left to data engineers, because they have more experience with this.
Currently, many data scientists are not familiar with data virtualization. If you want to shorten the data preparation phase, I strongly recommend that you study this technology, and see how it can benefit you.