IT excels in copying data. It is well known organizations are storing data in volumes that continue to grow. However, most of this data is not new or original, much of it is copied data. For example, data about a specific customer can be stored in a transactional system, a staging area, a data warehouse, several data marts, and in a data lake. Even within one database, data can be stored several times to support different data consumers. Additionally, redundant copies of the data are stored in development and test environments. But business users also copy data. They may have copied data from central databases to private files and spreadsheets. The growth of data is enormous within organizations, but a large part consists of non-unique, redundant data.
In addition to all these intra-organizational forms of data redundancy, there is also inter-organizational data redundancy. Organizations exchange data with each other. In almost all of these situations, the receiving organizations store the data in their own systems, resulting in more copies of the data. Especially between government organizations, a lot of data is sent back and forth.
Some redundant data is not stored in its original form but in an aggregated or compressed form. For example, many data marts contain somewhat aggregated data, such as individual sales transactions that are aggregated by the minute.
Data needs to be stored multiple times for a variety of compelling reasons. But database server performance and network speed have improved enormously, making it needless in most cases. Unfortunately, new data architectures are still designed to store data redundantly. Some data infrastructures are just overloaded with data lakes, data hubs, data warehouses, and data marts. We think too casually about copying data and storing it redundantly. We create redundant data too easily, which leads to the following 5 drawbacks:
- Change Control Issues
First, when source data is changed, how can we guarantee that all of the copies, stored in internal IT systems, business users’ files, and by other organizations, are changed accordingly and consistently? This is very difficult, especially if there is no overview of which copies exist, internally and externally.
- Compliance Struggles
Second, storing data in multiple systems complicates complying with GDPR. As JT Sison writes: “An important principle in the European Union’s General Data Protection Regulation (GDPR) is data minimization. Data processing should only use as much data as is required to successfully accomplish a given task. Additionally, data collected for one purpose cannot be repurposed without further consent.” Implementing the “right to be forgotten” becomes much more complex when customer data is scattered across many databases and files. And if data has been stored that many times, do we know where all the copies are?
- Manual Labor Woes
Third, the more often data is stored redundantly, the less flexible the data architecture. If the source data definition changes, all of the copies also need to be changed. For example, if the data type of a column changes, or if the codes used to identify certain values change, changing all of the copies is an enormous undertaking.
- Data-Access Slowdowns
The fourth drawback relates to data latency. Accessing copied data implies working with non-real-time data. The frequency of copying largely determines data latency; the lower the frequency, the higher the data latency. In some situations the copying processes takes hours, which also increases data latency. Data latency becomes very high when data is copied multiple times before it becomes available for usage.
- Data Quality Challenges
Finally, copying data can also lead to incorrect data, causing data quality issues. Processes that are responsible for copying data can crash midway leaving the target data store in an unknown state. To fix this, data may need to be unloaded and reloaded. This frequent unloading, fixing, and reloading can lead to incorrect data. Some data may be loaded twice, not loaded at all, or loaded incorrectly. Each copying process introduces a potential data quality leak.
The Way Forward
Data minimization should be a guiding principle for any new data architecture. This should apply to redundant data storage within an organization, but also to inter-organizational data flows. Also, data should not be sent to other organizations periodically; organizations should request the data when they need it, without having to store it themselves. In this respect, data should be available just like video-on-demand. We do not have to download and copy a video just to view it, and when we want to watch it, it is simply streamed to us. Similarly, data should not be unnecessarily copied, and data minimization should be our guiding principle. Data infrastructures should offer data-on-demand to internal and external users.
- The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022
- Use the Cloud More Creatively - January 28, 2022
- Data Minimization as Design Guideline for New Data Architectures - May 6, 2021