Since its first incarnation almost 35 years ago in my IBM Systems Journal article, the data warehouse (DW) has remained a key architectural pattern for decision-making support. By decision-making support, I mean everything from simple reporting and querying to AI-based predictive analytics. Of course, the first DW architecture was designed for queries and reports. A variety of additional concepts of varying breadth and quality—such as data mining, the logical data warehouse, and data lakes—have expanded the scope of the original DW thinking or have sought to displace it entirely. None have succeeded in killing off the DW. Now, over the past couple of years, three new adversaries have emerged: data fabric, data mesh, and the data lakehouse.
Will any or all of them kill off the data warehouse? It’s a fun question, but the wrong one! A better question—and one with a more useful answer—is: how can they complement DW thinking, and how could they improve decision-making support in an era of rapidly expanding digital business? We’ll look at each of these new approaches in this series. In this post, we’ll focus on data fabric.
What is a Data Fabric?
Data fabric in its current meaning dates to 2016 and in particular to a Forrester Wave report on “Big Data Fabric.” The Big has been replaced by Enterprise in the latest, 2020 version of the report, reflecting the industry-wide shift from big data to all data. A 2019 definition of data fabric as “a distributed data management platform, where the sole objective is to combine various types of data storage, access, preparation, analytics, and security tools in a fully compliant manner, so that data management tasks become easy and smooth” shows the broad scope of the framework but suggests the challenges in specifying it in practice.
However, Denodo’s Ravi Shankar explained it well in his 2017 article, describing six functional layers:
- Data ingestion from every potential source, of every possible type and structure of data and information
- Processing and persistence in any form of data store, such as a cloud-based repository, Hadoop, database system, or file
- Orchestration of cleansing and transformation processes, to enable integration with other data from multiple sources
- Data discovery, using data modelling, data virtualization, and other tools, to enable data to be accessed and integrated correctly and usefully across different sources or “silos”
- Data management and intelligence to define and enforce data governance and security
- Data access, delivering data to businesspeople or their applications
This list immediately and clearly exposes the fundamental meaning of data fabric. In every way, it can be compared directly with the concept of the logical data warehouse promoted by Gartner, among others, since the early 2010s. Furthermore, data virtualization lies at its heart in layer 4, providing the technology to access and join data across multiple sources. The remaining layers reprise the basic functionality of a data warehouse or, to a lesser extent, a data lake.
What’s New with Data Fabric in 2021?
Given that it has been around for five years, one might ask why it has seemingly become one of the flavors du jour now. For example, data fabric is listed as one of Gartner’s Top 10 Data and Analytics Trends for 2021. Logical data warehouse and data virtualization are well-known and widely implemented. So, what is new?
The key realization expressed in the data fabric concept is that the real-time integration of data envisaged in layers 3 and 4 above is actually highly complex to define and manage over time in an environment where data sources change rapidly and unpredictably in both content and structure.
This leads to the inclusion of “active metadata” to drive AI algorithms that can simplify and automate the design and operation of integration and discovery functions. Active metadata means that metadata can change and grow automatically as the environment evolves. This is achieved via the provision of advanced analytics over a “connected knowledge graph”—a deep ontology of all the information/data in the environment stored and managed in a graph database/engine.
As Forrester’s 2020 report phrases it, this amounts to “a unified, intelligent, and integrated end-to-end platform to support new and emerging use cases.” The key word here is intelligent, emphasizing artificial intelligence function to automate multiple aspects of defining and operating the environment, including “process integration, transformation, preparation, curation, security, governance, and orchestration to enable analytics and insights quickly.”
Architectural and Product Considerations
Data fabric emphasizes the ongoing shift to a hybrid, multi-cloud/on-premises data processing platform. As with most platforms defined by big analyst firms, such as Gartner and Forrester, data fabric is framed around the offerings of a variety of software vendors, who then compete to include (usually by acquisition) the additional function demanded by the platform. Such function accretion enables larger data management vendors to support the entire architecture.
However, implementers of data fabric should focus more closely on the key functional enablers and the vendors that offer the best support for them. In the case of data fabric, data virtualization is central, as it was in the logical data warehouse, and continues to be even more important in data fabric.
The second key functional enabler is comprehensive, active metadata. This has been a long-term challenge for data warehousing and, even more so, for data lakes. Many metadata management / data catalog products have emerged over recent years, often focusing on the metadata collection issue. Many have been acquired by various larger data management vendors. More recently, knowledge graph approaches have taken center-stage. Vendors such as Cambridge Semantics and Stardog have been capitalizing on the popularity of knowledge graphs as used by Google and Facebook and are positioning themselves as key to data fabric implementation.
Data fabric, as an architectural platform, has a solid historic foundation and rich product-set availability in its core storage and data virtualization functional areas. However, the key novel functionality of using AI and knowledge graphs to automate management, and its use via active metadata, is still an emerging product area.
In subsequent posts, I’ll be looking at the two other emerging architectural patterns: data mesh and the data lakehouse.