Intelligent Caching in Data Virtualization

Caching is one of the most important capabilities of a Data Virtualization platform to provide the right combination of high performance, low latency of information, minimum source impact, and reducing cost of needless data replication. In fact, the options for caching and how it can be flexibly configured to work in tandem with real-time query optimization and schedule batch operations are among the top differentiators between standard federation products and best-of-breed data virtualization platforms, such as the Denodo Platform. This is because Data Virtualization is being used today as an integral information fabric or data services platform in different scenarios to meet different objectives – real-time BI, EDW extension, data abstraction layer, application data access, secure data services, etc. – and the caching capabilities must be powerful and flexible to meet these needs.

Caching serves many purposes

Caching can be useful for several reasons; to manage real-time performance across disparate sources with varying latencies, to minimize data movement based on frequent query patterns, to reduce or manage the impact of data virtualization on source systems, and, finally, to mitigate the problems of a source system being only intermittently available.

Caching for performance

When using the Denodo Platform to integrate various data sources and publish the derived data entities to consuming applications, you might be faced with the situation where some of your data sources are slower than the others and cause overall performance degradation. This might be because the data sources are inherently slower than the others or it might be because the data sources are already heavily used and this results in slower response times. For example, getting data from web services, from flat files that must be parsed, or from web sites (using ITPilot) is typically slower than querying data in a relational database or data warehouse. If you are combining data from these different sources into a derived view within the Denodo Platform, the slower sources can reduce the overall performance of queries on the derived view in certain cases. In these situations, the cache in the Denodo Platform can be used to reduce the performance bottlenecks.

You can configure the Denodo Platform to cache data from the slow data sources and use the cached data in response to any queries against that data source. To return to the above example, if data from the web service is cached and this cached data is used for subsequent queries against the web service – and, by definition, queries against the derived view using the web service data – the performance can increase dramatically by removing the latency of the web services invocations from the execution path. Obviously, the cache in the Denodo Platform should be used judiciously – caching every data source is just another form of data replication and also means that the data retrieved for every query is the cached data and not the live data from the originating data source. But using caching for selected base views can dramatically improve performance of queries on the base view and also queries on any derived views that are using the base view. It is important to note that, for the usage pattern of performance improvement our recommendation is to cache the data from base views – and only when necessary for performance reasons. If you cache data from derived views, you could be caching data not only from the ‘slow’ data source but also from other data sources that have perfectly acceptable performance characteristics.

Caching to optimize frequent user queries
When there is a pattern of queries with a high frequency of users calling for the same data, these queries can be cached. Subsequent queries that match or are even a subset of the original query can be served from the cache using post-processing. The real-time needs of such queries must be analyzed to determine the time-to-live in the cache. Also the cache patterns may be regional or departmental in a federated data virtualization deployment with multiple Denodo Platform servers. For example, the retail store inventory status for European, Asian, American stores may be cached on distributed regional Denodo Platform servers and shared among them. As contrasted with the performance improvement scenario, caching for frequent user queries can be at a higher level derived view in the integration tree, and not just base views.

Caching to minimize source system impact

Organizations that expose their source systems to data virtualization are both excited and alarmed at first. A multitude of worrying questions can spring to mind…what happens if anyone and everyone start querying my operational systems in real-time? What will be the performance impact on my operational users who depend on these systems? This is where intelligent caching combined with role-based security or custom policies can help. While all users can be exposed to consistent canonical views of disparate data, the Denodo Platform can modulate different SLAs for different users. Based on granular user and role-based security (discussed in other articles) as well as custom policies that can be parameterized based on any external input such as network traffic, source loads, time of day, etc., the Denodo Platform can serve a real-time view of data to certain priority users and partially cached data to others. Also cache refresh can be triggered based on event messages sent to the Denodo Platform based on a certain threshold of changes to the sources. In this way, intelligent caching is able to minimize source impact, while meeting differentiated user needs.

Caching to protect against intermittent system availability

The Denodo Platform can provide access to a wide variety of data sources and, due to the varied nature of these sources, there will be different availability profiles for these data sources. Even the data sources within the organization will have different availability depending upon the nature of the data source. For example, an operational database might be configured for 24/7 availability with high availability clustering and redundancy whereas a data source in a regional sales office might only be available during local office hours. When the data sources are external to the organization – often owned and controlled by a totally different entity – then the question of system availability becomes more pressing. Caching data from these sources within the Denodo Platform can help mitigate against the actual source data not being available. If feasible, the data can be cached and queries for this data can be served from the cache rather than from the actual data source – which may or may not be available. If the data source is available, then the cache can be refreshed from the source to keep the cached data up to date.

Suresh Chandrasekaran

Leave a Reply

Your email address will not be published. Required fields are marked *