Data Virtualization is the CDO’s Best Friend
Reading Time: 4 minutes

According to CIO magazine, the first chief data officer (CDO) was employed at Capital One in 2002, and since then the role has become widespread, driven by the recent explosion of big data.

The CDO role has a variety of definitions, but Wikipedia’s is as good as any: “a corporate officer responsible for enterprise-wide governance and utilization of information as an asset, via data processing, analysis, data mining, information trading, and other means.”

This definition also illustrates a real problem in the name. The CDO is responsible for information as an asset but could not be called the chief information officer (CIO) because that title was already taken. The distinction is important. Data is best thought of as information from which the context has been stripped, as I described in my book, Business unintelligence. That context includes meaning, usage, ownership, and more—all significant considerations for success as a CDO.

How the CDO Became a Very Important Person

In its raw form, big data, coming mainly from external sources, is mostly loaded into data lakes—loosely structured and lightly defined data stores based on the extended Hadoop ecosystem. Often, data lakes are poorly managed, filled with duplicate and ill-defined data, leading to the moniker “data swamps,” used descriptively by Michael Stonebraker as far back as 2014.

In the same time period, businesses began to recognize the value of running analytics on social media and Internet-of-Things data—if only the data in the lake could be trusted. Suddenly, data governance, which data warehouse experts had been recommending for years, gained respectability and value. Not only that, but businesses finally accepted that responsibility for data quality rested with them, and that they needed focus and drive from the executive level. Within a few years, the CDO role had become very desirable (although, of course, it can never match the sexiness of the data scientist!)

However, appointing a CDO is only the start of a long journey to good data governance. A significant proportion of that journey will be consumed with organizational issues, methodology questions, and a struggle to keep (or gain) business commitment to the value of quality data, even when that conflicts with shorter-term financial gain. However, even with such a focus, CDOs need tools and technologies to support the undertaking, if they want to have any hope of success. But which tools and technologies will provide the most effective support?

The Answer is a Logical Data Warehouse with Data Virtualization

Getting a handle on data quality in a data lake that is becoming more swamp-like with every new onboarding of external data is a difficult challenge. A data warehouse, in which governance is traditionally well-embedded, would be a better starting point from which to expand efforts. In addition, a data warehouse is a key source of core business information—such as customer IDs, product catalogs, and key account information—one that is vital to the process of contextualizing and cleansing the data in the lake.

But when the data lake and data warehouse are in different locations and use different technologies, how can one combine data from the two stores for cleansing and contextualizing? The traditional answer would have been to copy the data from the lake into the warehouse and do the work there. Of course, this cannot work; there is too much data in the lake, and it arrives too quickly, in too many formats (the three Vs of big data) to attempt this approach. The only possibility for such data is to “play it where it lies.” That means accessing and joining data, invisibly to the requester, across multiple stores using a data virtualization tool, such as the Denodo Platform. This approach is also called a logical data warehouse, a term popularized by Gartner, and first introduced as far back as 2012.

The approach just described is a real-time process that occurs while a businessperson (or an app) runs a query that retrieves data from two or more disparate stores and joins the results into a single response. How, you may ask, does that help a CDO to drive data governance or ensure data quality? The run-time process does not; it is the prior design phase, enabling queries to be split up and distributed, that supports the CDO.

Data Virtualization Design Is the Secret to Good Governance

Data virtualization implementations are built on models of the source data stores, and their relationships, at both logical (meaning, usage, ownership, etc.) and physical (data structure, location, access, etc.) levels. Such modelling is the low-level, detailed foundations for practical data governance. The output is a map that enables and delivers governed, real-time responses at runtime using data virtualization.

Such modelling is largely identical to that required to build extract, transform, and load (ETL) systems in traditional data warehousing. So why are ETL systems not also the secret to good governance? Ironically, the answer lies in the fact that ETL systems run as offline processes, in which quality can be ensured behind the scenes before the data is made available to the business. With data virtualization, any errors in the modelling phase directly impact run-time results, and business executives hear about resulting problems very early.

The CDO Best Friend

Data virtualization thus ensures—and actually forces—closer attention to be paid to data quality and governance up front in the design process for all applications that use a combination of externally and internally sourced data, in effect across the entire scope of digital transformation. The CDO gains significant benefit both from the early and high visibility of data quality issues and from the extra governance work consequently undertaken in the design phase.

On the often long and tortuous journey of data governance, data virtualization is the CDO’s tool, companion, and best friend.

Barry Devlin