Using AI to Further Accelerate Denodo Platform Performance

Using AI to Further Accelerate Denodo Platform Performance

Data virtualization  has a privileged position in modern architectures for data discovery and use cases such as data fabric and logical data warehousing. Data virtualization provides unified data access, data integration, and a delivery layer, bridging the gap between distributed IT data management infrastructure and business users and applications. This enables data virtualization systems like the Denodo Platform to collect active metadata about what is happening across the whole data ecosystem, enabling stakeholders to answer the “five W” questions about data access: who is accessing what data, when, where, why, and how. This offers multiple opportunities for automatically deriving intelligence to optimize every step in the data discovery, integration, and preparation process.

In this post, I will describe one of the latest functionalities introduced by Denodo in this area: using artificial intelligence (AI) to provide automatic recommendations for query acceleration based on past activity.

Query Optimization and Acceleration in the Denodo Platform

In the past, we have mentioned that  the query optimizer plays an important role in a data virtualization solution. This is because, for each query the virtual platform receives, the query optimizer decides what is the best strategy to obtain the final result, and the impact of that decision on the execution time can be many orders of magnitude.

In fact, it is because of its powerful query optimizer that the Denodo Platform achieves a level of performance comparable to that of a physical data warehouse. The Denodo Platform 8.0 optimizer includes a variety of different techniques that minimize data transfer and post-processing, such as aggregation push-down and on-the-fly data movement (See this post for more information about some of these strategies). In addition, it includes a new feature called Smart Query Acceleration, which dynamically selects pre-stored data to avoid performing some of the same data combinations. This new optimization is based on the fact that in analytic scenarios, queries usually follow the same pattern: they combine data from facts and dimension tables to calculate aggregated results. In most of these cases, it is possible to calculate the final aggregation based on a dataset that is already partially aggregated. Smart Query Acceleration consists of: 1) pre-computing interesting data aggregates called “summaries,” that then 2) the query optimizer can automatically decide to use in future queries.

Using AI for Query Acceleration

This post describes the Smart Query Acceleration feature in detail, including a sample scenario and the performance gains on different queries. In this post, I will focus on an advanced tool included in Denodo Platform 8.0 that makes use of AI techniques to infer which are the best datasets to precompute and store in summaries based on the real workload of the customer.

To do that, I will use the same simplified scenario from the above-mentioned post, consisting of a retailer company with distributed data about sales and stores:

Let’s say some employees use a BI tool connected to the Denodo Platform to create dashboards combining data from these systems. The Denodo Platform administrator has little control over the queries reaching the system, as users can “play” with the data in a self-service manner. Therefore, it is a good idea to use Smart Query Acceleration to accelerate queries from this tool. To choose the summary or summaries to create, the administrator has been monitoring the activity of the system for a while. The following are some examples of queries that reach the Denodo Platform:

SELECT d_year, sum(ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales

FROM store_sales INNER JOIN analytics.date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk

GROUP BY d_year

SELECT d_year , d_qoy, store.s_city, store.s_store_name, sum(ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales

FROM analytics.store_sales INNER JOIN analytics.date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk

INNER JOIN analytics.store AS store ON store_sales.ss_store_sk = store.s_store_sk

GROUP BY d_year , d_qoy, store.s_city, store.s_store_name

SELECT store.s_city, store.s_store_name, sum(ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales

FROM analytics.store_sales INNER JOIN analytics.date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk

INNER JOIN analytics.store AS store ON store_sales.ss_store_sk = store.s_store_sk

WHERE d_year = 2003 AND d_qoy = 1

GROUP BY store.s_city, store.s_store_name

select c_email_address, sum(ss_net_paid_inc_tax) AS total_sales, SUM(ss_list_price – ss_sales_price) as discount

from ttm_store_sales JOIN analytics.date_dim ON ttm_store_sales.ss_sold_date_sk = date_dim.d_date_sk

JOIN customer ON(ss_customer_sk = c_customer_sk)

WHERE d_year = 2003 and c_preferred_cust_flag = ‘Y’

GROUP BY c_email_address

Considering that this is just a tiny, simplified representation of a real workload, you can imagine how hard it might be to manually analyze, find patterns in, and decide the best datasets to persist in summaries.

Fortunately, Denodo Platform 8.0 includes the Summary Recommendations tool, which makes this process much easier. This tool analyzes information about real queries and automatically recommends summaries to accelerate users’ workload.

The Denodo Administrator just needs to specify a date range to consider (and optionally adjust other parameters like the minimum duration of the queries to focus on for the analysis) and the Summaries Recommendations tool will return a ranking with the most promising summaries, based on the analysis of the workload, in the specified date range.

Finally, the Denodo Administrator can review the recommendations and confirm the ones to create.

How it Works

The Summary Recommendations tool reads information about the queries in the selected workload; extracts relevant features for each query block like the duration, views used in the block, the size of those views, among others; and infers a new set of query blocks that, if persisted as summaries, would maximize the performance of that selected workload:

To get the final result, it first decides what are the most relevant view sets, then it generates a collection of summary candidates using those interesting view sets, and finally, it evaluates the impact of each candidate to decide the final recommendations.

In our sample scenario, the tool would suggest the summary used in the previous post, defined as follows:

SELECT ss_store_sk AS store_id, ss_sold_date_sk AS date_id, sum(ss_net_paid_inc_tax) AS total_sales, count(*) AS count_sales

FROM store_sales

GROUP BY ss_store_sk, ss_sold_date_sk

That would accelerate all queries shown in the previous list, obtaining the high performance gains you can see in the following table:

Putting AI to Work

Because the Denodo Platform sits in a strategic position between users and stored data, it can extract valuable knowledge about user-data interactions. AI helps in making use of this knowledge to provide valuable recommendations in different areas, from data management to query optimizations.

In particular, the Summary Recommendations tool uses AI techniques to apply fine performance-tuning, customized to each particular use case. As a result, it improves the user experience with minimum intervention from system administrators and avoiding the cost of additional resources.

Leave a Reply

Your email address will not be published. Required fields are marked *