Data Exploration and Self Service BI: Welcome to the DataWeb
Data science disciplines have evolved rapidly in recent years. Fueled by the rise of big data and the internet of things, advanced analysis of large data sets are driving science, business and marketing decisions. At the same time, the exponential growth of data makes finding information harder than ever before. Although good governance and well documented models can help, identifying what data is useful to solve a particular problem can be challenging. Adding the fact that interesting data may be spread across multiple siloed locations, frustration is guaranteed.
In this post, we’ll analyze how the abstraction capabilities of data virtualization can ease this challenge. Abstraction is one of the key ingredients that made data virtualization evolve from old-school federation. The fact that data virtualization is a layer that lies above a variety of data sources, abstracting the underlying technology, is already a big plus in a data exploration scenario: the end user doesn’t need to master multiple technologies nor ask for different credentials in order to start the search. However, we will explore a different perspective to abstraction: how data virtualization can also abstract the way you ask for data.
Data exploration is one of those cases where writing SQL queries may not be the best approach. Sometimes you simply don’t know where valuable information may be: in what table, in what field, or in what system. As an analogy, take the everyday scenario of searching the internet for information. Most readers will open a browser, go to Google, search and follow result links to related pages until finding the desired information. Then read the details about it. We are so used to “google things” that maybe we don’t realize it, but we are actually using different query paradigms and technologies here:
- Keyword base search based on an index powered by Google.
- Navigation to related resources using HTTP links (aka “The Web”).
- Reading to extract valuable details from a piece of text.
In Denodo, we have built our data virtualization model in a way that allows you to use a similar approach. Although with a small but important twist: different ways to access your data leverage the same data model and security settings. Your results will be consistent regardless of the method. In this case, the options that the Denodo Platform for data virtualization stack provides are:
- Keyword search based on an index built atop your data.
- RESTful web services with HTTP navigation to related resources based on a network of relationships.
- SQL queries.
This completely changes the game when you are looking for information. You can start using your data as the Web! You can open your browser and search for a specific piece of information in a google-like fashion. Get the results as rows from different tables in your model, then navigate using RESTful services through a network of links to related information, and finally create a report or dashboard using SQL with the data you need.
For example, imagine that a problem with a particular component of one of your products has been reported by a customer. You need to analyze the impact of the issue in case that problem affects other customers.
Using Denodo’s capabilities, you start with a keyword-based search by customer name that returns a few sales orders. One of them seems to be related with the product in question. From the details of that sales order, you retrieve the product id, its bill of materials and the production lot associated with this particular device. From there, you can jump to the testing results from the factory to review if there is some parameter that slipped the quality testing. If there is something wrong, you can quickly pin-point the impact of the issue by understanding what sales orders contained pieces from this lot, and prevent expansion of the problem by removing affected products from storage before they reach the market. Individual emails and a replacement product could be sent to each customer affected to mitigate social media impact, etc. Results can also be immediately displayed and presented to management in an ad-hoc dashboard created with a reporting tool on top of the same data.
This approach to data exploration can be very valuable in many scenarios like marketing research, fraud detection, supply chain defects investigation and all kind of “what if” simulations. These are scenarios where data is spread across different systems, initial inputs (Names, SSN, SKU numbers, etc) can point to multiples tables, schemas are complex and not well understood, and in general, the process is more of an “investigation”.
Just based on the agility that data virtualization adds to this process, Denodo has turned into a fundamental part of the BI stack for many companies. And more often than not the user will be surprised on how well data virtualization performs. Just take a look at the performance articles our CTO has written lately.
From the initial features that years ago enabled the concept of Data-as-a-Service on top of your virtual data models, data virtualization has evolved into a more mature integrated platform. The flexibility on how to access your data assets has been a driving point in Denodo’s roadmap and one of the key values that data virtualization adds to the table. Data exploration is just another good example of how powerful this approach can be. The concept of “end user self-service”, that encompass not only the example used here, but also all the security constraints, resource allocation management and metadata governance to support it, is one of the core pieces for data virtualization’s future.
Latest posts by Pablo Álvarez (see all)
- The Data Landscape is Fragmented, But Your (Logical) Data Warehouse Doesn’t Have to Be - May 21, 2019
- The Virtual Data Lake for the Business User - December 12, 2018
- The Virtual Data Lake for a Data Scientist - November 22, 2018