Too often the design of new data architectures is based on old principles: they are still very data-store-centric. They consist of many physical data stores in which data is stored repeatedly and redundantly. Over time, new types of data stores, such as data lakes, data hubs, and data lake houses, have been introduced, but these are still data stores into which data must be copied. In fact, when data lakes are added to a data architecture, it’s quite common to introduce not one data store, but a whole set of them, called zones or tiers. Having numerous data stores also implies that many programs need to be developed, maintained, and managed to copy data between the data stores.
Modern Forms of Data Usage
Organizations want to do more with data and support new forms of data usage, from the most straightforward ones to the most complex and demanding ones. These new, more demanding use cases are driven by initiatives described by such phrases as “becoming data-driven” and “digital transformation.” Most organizations have evaluated their current ICT systems and have found them unable to adequately support these new forms of data usage. Conclusion: a new data architecture is required.
Legacy Design Principles will not Suffice
As indicated, the inclination is still to develop data processing infrastructures that are based on a data architecture that is centered around data stores and copying data. This repeated copying and storing of data reminds me of designing systems in the mainframe era. In these legacy architectures, data was also copied and transformed step-by-step in a batch-like manner. Shouldn’t the goal be to minimize data stores, data redundancy, and data copying processes?
Data-store-centric thinking exhibits many problems. First, the more often data is physically copied before it’s available for consumption, the higher the data latency. Second, with each copying process, a potential data quality problem may be introduced. Third, physical databases can be time-consuming to change resulting in inflexible data architectures. Fourth, from a GDPR perspective, it may not be convenient to store, for example, customer data in several databases. Fifth, such architectures are not very transparent, leading to report results that are less trusted by business users. And so on.
New data architectures should be designed to be flexible, extensible, easy to change, and scalable, and they should offer a low data latency (to some business users) with high data quality, deliver highly trusted reporting results, and enable easy enforcement of GDPR and comparable regulations.
Agile Architecture for Today’s Data Usage
During the design of any new data architecture, the focus should be less on storing data (repeatedly) and more on the processing and using of the data. If we are designing a new data architecture, then deploy virtual solutions where possible. Data virtualization enables data to be processed with less need to store the processed data before it can be consumed by business users. Some IT specialists might be worried about the performance of a virtual solution, but if we look at the performance of some newer database servers, that worry is unnecessary.
Most organizations need new data architectures to support the fast growing demands for data usage. Don’t design a data architecture based on old architectural principles. Don’t make it data-store-centric. Focus on the flexibility of the architecture. Prefer a virtual solution over a physical one. This will enable ICT systems to keep up with the speed of business more easily while providing better, faster support for new forms of data usage.
- The Data Lakehouse: Blending Data Warehouses and Data Lakes - April 21, 2022
- Use the Cloud More Creatively - January 28, 2022
- Data Minimization as Design Guideline for New Data Architectures - May 6, 2021