Organizations are rethinking their current data architectures. Unfortunately, the majority considers it a challenge. Obviously, one of the reasons is that they don’t do this every day. Also, insights about how to design them has changed over time. This article covers two important new insights for designing data architectures.
Insight 1: Data architectures and technology go hand in hand.
In the old days, a new data architecture was first designed conceptually and independently of the technology and products. Only when the conceptual architecture was finished, did we select the right products. In a way, we assumed that products were interchangeable and that we would always be able to find products to fit the architecture. Perhaps, at the time, many products were interchangeable. Just remember the homogeneity of the SQL database servers and ETL tools. The differences were in the details.
In the last decade, however, many new technologies have been introduced that have very powerful features and are not only highly scalable, but also quite unique. Take the popular Snowflake SQL engine. If you take a quick glance at this product, you may think it’s just another SQL product. But if you look more closely, you will see that it offers some unique features. Or, take the Fivetran ETL product that dictates a large part of your data architecture. These products are not interchangeable at all. To really exploit their strengths, the data architecture has to be based on these tools.
Therefore, the recommendation for designing a new data architecture is to solve two puzzles simultaneously: What’s the most fitting architecture and with which products?
Insight 2: Minimize data-store-centric thinking.
Many architects focus on data stores when they design data architectures. Just look at most diagrams depicting new data architectures — they are overloaded with data stores with names like data hub, data lake, data warehouse, data mart, staging area, and data lake house. It is as if they form the main building blocks of a data architecture.
Reduce this focus on data stores. The focus should be on which data processing specifications (DPS) are needed and how should they be implemented. DPSs indicate how the data needs to be filtered, aggregated, cleansed, integrated, pseudonymized, masked, or calculated, when it is transmitted from a source system to a dashboard, Java app, or data science tool. In fact, every data architecture is about extracting data from data sources, processing it, and then making it available for data consumption.
We need to focus on the DPSs, because they represent our intellectual property. Our focus should be on how and where we implement the DPSs. For example, in many classic data warehouse architectures, DPSs are scattered across the entire environment. They are implemented within ETL programs, database stored procedures, little Python scripts, and BI tools. In data lakes for data scientists, they are implemented in hundreds of little Python or R programs. This is far from ideal because it is bad for productivity, maintenance, correctness, consistency, and auditability.
It’s important that we design architectures and select tools that enable us to centralize the DPSs. It should be possible to define them once and reuse them as many times as possible, and they should also be both documentable and transparent.
In fact, how and where we implement DPSs determines whether a data store has the role of staging area, data lake, or data warehouse. In other words, the DPSs determine the nature of the data stores. For example, when the data that enters a specific data store has been heavily cleansed, it’s probably not a data lake. Or, when data is loaded into a data mart, it is commonly aggregated and the data structure is transformed into a star schema. The responsible DPSs define this as a data mart.
Data virtualization for centralizing data processing specifications.
One technology that helps to implement and document DPSs is data virtualization. It can be seen as a mechanism to centralize implementation. Therefore, it’s also a technology that influences the design of a data architecture. A data architecture with or without data virtualization looks very different.
When designing new data architectures, some of the old principles no longer apply. It has become a complex puzzle that needs to be solved in a holistic way. Data architecture, technology, and security considerations should all be addressed simultaneously. The focus should be on centralizing the implementation of data processing specifications to meet ever-changing business requirements for data processing and usage.