Integrating Data Quality into the Data Flow
I have long been an advocate of integrating data quality techniques as a core component of an organization’s end-to-end data processes. When I began evangelizing data quality management about twenty years ago, the common practices for data quality involved post-processing application of data standardization and quality tools that (presumably) “cleansed” values within a selected data set. I had (and still have, in fact) some issues with that approach, mostly because the expectations do not align with key data quality best practices:
- Changing the data cleansing process, by definition, takes a data set in the form in which it was provided and modifies the values that a third party’s product has deemed to be “correct.” While there may be some circumstances where modifying data is acceptable, there are many more in which a limited set of individuals are authorized to modify the data. For example, I once worked with a health care insurance company that used Postal Service tools to identify when a member’s address had changed. However, only the members themselves had the authority to update their own personal data (including delivery address). Therefore, even though the company knew that the data was out of date, it had no right to change that data, only the directive to attempt to contact the members through alternate means and ask them to update their own data.
- Presumption of correctness: Just because a product updates a value to what it deems a “correct” value, the result is not always what it seems. I once met a gentleman whose first name was “Micheal,” which was constantly incorrectly modified by data quality tools to “Michael”.
- Inconsistency with source: When data cleansing is applied post-process, the resulting data set’s values are no longer comparable to the original data set. This inconsistency with source can confuse those who had accessed the data set at different times of the process.
- Lack of synchronization among equals: When the same data set is used as a source for multiple back end applications, each of the consumer applications might apply different data quality rules and end up with different results. This leads to inconsistency across different downstream consumers.
In many cases, it is this last issue that is the most insidious, mostly because there is no knowledge that different data consumers have inconsistent views.
In the best scenario, you are able to embed data validation into the points where data enters your processing systems. Data quality validation rules can be collected from among the downstream consumers and merged together into a common set of validity expectations applied when a data set is acquired, or where the data are first created. Data instances that do not comply with the validity rules can be pushed back to the origination point for remediation. This addresses each of my concerns:
• you are not changing the data, you are asking the owner to do it;
• you are not relying on a product to correct the data;
• there won’t be an issue of inconsistency with the source; and
• all the downstream consumers will see the same values.
However, in many cases, we will not have access to the original owner of ingested data, requiring a different thought process to try to address the same concerns. One idea is to introduce a buffer between the data sets and the consumers and integrate the data validation into that buffer. In this approach, the common data validation rules can be accumulated between the points of access and the points of delivery. Data instances that don’t meet the standards can be augmented with a flag indicating which rules were not obeyed, a limited set of standards can be applied consistently, and the resulting output data set will be the same for all the consumers.
Data virtualization provides this buffer. Not only can data virtualization provide a common model for representation, the data validation steps can be applied when accumulating the results of federated queries and the limited standardization can be part of the rendering process for the combined result set presentation. While this does not address all of my issues, it critically enforces a synchronized view presented to all the data consumers, as well as institutes a level of control over the types of modifications and corrections that are applied.
Latest posts by David Loshin (see all)
- Integrating Data Quality into the Data Flow - August 24, 2018
- Cloud Migration, Data Services, and Effective Data Virtualization - May 16, 2018