Why Data Mesh Needs Data Virtualization
“Data mesh” is a new data analytics paradigm proposed by Zhamak Dehghani, one that is designed to move organizations from monolithic architectures such as the data warehouse and the data lake to more decentralized architectures. As long-time supporters of logical and distributed architectures, we at Denodo share many of the data mesh principles. In this post, I will briefly summarize them and discuss why data virtualization is a key foundation for data mesh.
What is Data Mesh?
The data mesh paradigm arises from the insight that centralized, monolithic data architectures suffer from some inherent problems:
- A lack of business understanding in the data team. Too frequently, centralized data teams need to deal with data they do not fully understand to solve business problems that they also do not completely understand. This forces continuous back-and-forth between the data team and the business groups, slowing down the process and affecting the quality of the final result.
- The lack of flexibility of centralized data platforms. Centralizing all data into a single platform may be problematic, because the needs of big organizations are too diverse to be addressed by a single platform: one size never fits all.
- Slow data provisioning and response to changes. Every new data requisition from a business unit always requires ingesting the data in the centralized system and performing changes in the pipelines at all stages of the platform. This makes the system rigid and brittle when changes happen.
Data mesh aims to solve these problems by making organizational units (called “domains”) responsible for managing and exposing their own data to the rest of the organization. Domains have a better understanding of how their data should be used, which results in fewer iterations until business needs are met as well as in higher quality. This also removes the bottleneck of the centralized infrastructure and gives domains autonomy to use the best tools for their particular situation.
Nevertheless, this also introduces some obvious risks such as creating data silos, duplicated effort across domains, and the lack of unified governance. To address these risks, data mesh introduces several additional concepts:
- Data as a product. The data exposed by the different domains must be easily discoverable, understandable, and usable by other units.
- Self-serve data platforms. Building and managing data infrastructure is complex. Not all domains will have the proper resources, and duplication of effort needs to be avoided. There should be a platform that domains can use in a self-service way to automate or simplify tasks such as data integration and transformation, implementation of security policies, data lineage, and identity management.
- Federated computational governance. To ensure that the data products created by the different domains can interoperate with each other, some level of standardization is needed. This includes the semantics of entities common across domains (e.g. customer and product entities) and about technical aspects such as data product addressability and identity management. Some security policies may also need to be applied globally. When possible, all these standardizations and policies should be automatically enforced.
Given data virtualization’s long-time focus on providing unified data access, data security, and a data governance layer on top of distributed and heterogeneous data systems, it’s clear that data virtualization also lends itself very well to support data mesh concepts. Let’s look into the details.
Creating Data Products with Data Virtualization
Data mesh can be created using data virtualization, leveraging an architecture such as the one shown in this image:
Data virtualization enables domains to quickly implement data products by creating virtual models on top of any data source. Thanks to the simplicity of use and the minimization of data replication enabled by data virtualization, the creation of data products is much faster than using traditional alternatives. It is also faster to iterate multiple data product versions until business needs are met (Gartner estimates productivity savings above 45% when using data virtualization).
Virtual models implement a semantic layer, exposing data in a business-friendly form while decoupling consumers from complexities such as the data location and the native source formats. In advanced data virtualization solutions like the Denodo Platform, data products can be made accessible through any method such as SQL, REST, OData, GraphQL, or MDX, without the developer needing to write any code. Data products can also be automatically registered in a global, company-wide data catalog that acts like a data marketplace for the business.
Maintaining Domains’ Autonomy
Another key benefit provided by data virtualization in this architecture is that domains can select and evolve autonomously the data sources that implement their products. For instance, many business units will already have domain-specific data analytics systems (e.g. data marts) they can reuse with almost no effort and without introducing new skills in their teams. They can also directly reuse applications specifically tailored to their domains (e.g. SaaS applications). When needed, they can leverage the caching/acceleration capabilities of the data virtualization tool to ensure adequate performance and avoid interfering with other internal processes happening on those systems. For further isolation and autonomy across domains, the data virtualization servers used by each domain can also be scaled independently.
Of course, domains can still choose to go through the data warehouse/data lake process for some data products when they have the appropriate skills. For instance, a central data lake infrastructure may be a good fit for products requiring machine learning. But not all domains need to do it for all of their data products.
Even in that case, the resulting products can be accessed through the unified virtual data layer for consistency and governance, and also so that the organization can benefit from additional capabilities such as building a semantic layer, data cataloging, and data access through multiple technologies.
Federated Computational Governance
Data virtualization also lends itself naturally to implement the federated governance principle. First, the layered nature of virtual models enables the easy reuse of definitions across domains. This in turn enables the definition of common entities with a consistent representation across all data products, ensuring their interoperability. It also enables developers to easily reuse the data products of other domains, avoiding duplicated effort.
The data virtualization layer also enables organizations to automate the enforcement of global data security policies (e.g. masking the salary data in all data products unless the user has a certain HR role) and provides a single point to enforce other standardizations across domains (e.g. naming conventions, addressability, and versioning).
Finally, the Denodo Platform provides, “out of the box,” data products with functionality such as support for tracking data lineage, self-documentation, change impact analysis, identity management, SSO, and many others. Because these features are standardized for consistency and interoperability, the development of data products is simplified.
Summarizing Data Mesh
Data mesh is a new decentralized paradigm for data analytics that aims to remove bottlenecks and take data decisions closer to those who understand the data. To minimize data silos, avoid duplication of effort, and ensure consistency, the data mesh paradigm proposes a unified infrastructure enabling domains to create and share data products while enforcing standards for interoperability, quality, governance, and security.
Data virtualization solutions like the Denodo Platform have been designed precisely to provide a unified, governed, and secure data layer on top of multiple distributed data systems, so they are a natural fit for implementing data mesh principles.
- Why Data Mesh Needs Data Virtualization - August 19, 2021
- No Single Data Repository Can Be Your Silver Bullet - April 14, 2021
- Unlocking the Potential of Machine Learning in a Data Lake - March 27, 2019