Logical Architectures for Big Data Analytics
If you check the reference logical architectures for big data analytics proposed by Forrester and Gartner, or ask your colleagues building big data analytics platforms for their companies (typically under the ‘enterprise data lake’ tag), they will all tell you that modern analytics need a plurality of systems: one or several Hadoop clusters, in-memory processing systems, streaming tools, NoSQL databases, analytical appliances and operational data stores, among others (see Figure 1 for an example architecture).
This is not surprising, since different data processing tasks need different tools. For instance: real-time queries have different requirements than batch jobs, and the optimal way to execute queries for reporting is very different from the way to execute a machine learning process. Therefore, all these on-going big data analytics initiatives are actually building logical architectures, where data is distributed across several systems.
Figure 1: The Logical Architectures of an Enterprise Big Data Analytics Platform
This will not change anytime soon. As Gartner’s Ted Friedmann said in a recent tweet, ‘the world is getting more distributed and it is never going back the other way’. The ‘all the data in the same place’ mantra of the big ‘data warehouse’ projects of the 90’s and 00’s never happened: even in those simpler times, fully replicating all relevant data for a large company in a single system proved unfeasible. The analytics projects of today will not succeed in such task in a much more complex world of big data and cloud.
That is why the aforementioned reference logical architectures for big data analytics include a ‘unifying’ component to act as the interface between the consuming applications and the different systems. This component should provide: data combination capabilities, a single entry point to apply security and data governance policies, and should isolate applications from the changes in the underlying infrastructure (which, in the case of big data analytics, is constantly evolving). Figure 2 shows the revised logical architecture for the example in Figure 1 (in this case, with Denodo acting as the ‘unifying component’).
Figure 2: Denodo as the Unifying Component in the Enterprise Big Data Analytics Platform
Four types of software products have been usually proposed for implementing the ‘unifying component’: BI tools, enterprise data warehouse federation capabilities, enterprise service buses, and data virtualization . Nevertheless, in our experience, only data virtualization is a viable solution in practice and, actually, that is the option recommended by leading analyst firms. To understand why, let me compare data virtualization to each of the other alternatives.
Data Virtualization vs BI Tools
Some BI tools support performing joins across several data sources so, in theory, they could act as the ‘unifying component’, at least for reporting tasks. Nevertheless, these tools lack advanced distributed query optimization capabilities. For instance, they typically execute distributed joins by retrieving all data from the sources (see for instance what IBM says about distributed joins in Cognos here), and do not perform any type of distributed cost-based optimization.
In big data analytics scenarios, such approach may require transferring billions of rows through the network, resulting in poor performance. In my previous posts (see for instance here and here), I explained the main optimization techniques Denodo implements to achieve very good performance for distributed queries in big data scenarios: BI tools do not implement any of them.
Another problem with using BI tools as the “unifying” component in your big data analytics architecture is tool ‘lock-in’: other data consuming applications cannot benefit from the integration capabilities provided by the BI tool. In turn, data virtualization tools expose unified data views through standard interfaces any consuming application can use, such as JDBC, ODBC, ADO.NET, REST or SOAP.
Of course, BI tools do have a very important role to play in big data logical architectures but, not surprisingly, it is in the reporting arena, not in the integration one.
Federation at Enterprise Data Warehouses vs Data Virtualization
Some big data and enterprise data warehouse (EDW) vendors have recognized the key role that data virtualization can play in the logical architectures for big data analytics, and are trying to jump into the bandwagon by including simple data federation capabilities. Nevertheless, they support a limited set of data sources, lack high-productivity modeling tools and, most importantly, use optimization techniques inherited from conventional databases and classical federation technologies. These techniques may be useful for operational applications, but will result in poor performance when dealing with large data volumes. Therefore, although they can be a viable option for simple reports where almost all data is stored physically in the EDW, they will not scale for more demanding cases.
Enterprise Service Bus vs Data Virtualization
Some companies aim to expose part of the data in their data lakes as a set of data services. ESBs have been marketed for years as a way to create service layers, so it may seem natural to use them as the ‘unifying’ component. Nevertheless, there are three key problems that we consider that make this approach unfeasible in practice:
- ESBs are designed to process-oriented tasks, which are very different from data oriented tasks. This means they lack out of the box components for many common data combination/ data transformation tasks.
- ESBs do not support ad-hoc queries. Long story short: you cannot point your favorite BI tool to an ESB and start creating ad-hoc queries and reports.
This is because ESBs perform integration through procedural workflows. Procedural workflows are like program code: they declare step-by-step how to access and transform each piece of data. This means you can create a workflow to perform a certain pre-defined data transformation, but you cannot specify new queries on the fly over the same data. Therefore, every new query needed by any application, and every slight variation over existing queries (e.g. aggregating results by a different criteria) will require a new workflow created and maintained by the team in charge of the ESB.
In turn data virtualization tools, in the same way as databases, use a declarative approach: the tool exposes a set of generic data relations (e.g. ’customer’, ‘sales’, ‘support_tickets’…) and users and applications send arbitrary queries (e.g.using SQL) to obtain the desired data. New information needs over the existing relations do not require any additional work. Users and applications simply issue the queries they want (as long as they have the required privileges).
- ESBs do not have any automatic query optimization capabilities. As explained in the previous point, the creator of ESB workflows needs to decide each step of the data combination process, without any type of automatic guidance. This means manually implementing complex optimization strategies. Even worse, as you will know if you are familiarized with the internals of query optimization, the best execution strategy for an operator (e.g. a join) can change radically if you add or remove a single filter to your query. It is simply impossible to expect a manually-crafted workflow to take into account all the possible cases and execution strategies. It is like going back in time to 1970, before databases existed, when software code had to painfully specify step by step the way to optimize joins and group by operations.
In turn, data virtualization systems like Denodo use cost-based optimization techniques which consider all the possible execution strategies for each query and automatically implement the one with less estimated cost.
And finally, Data Virtualization vs …. Data Virtualization
Not all data virtualization systems are created equal. If you choose a DV vendor which does not implement the right optimization techniques for big data scenarios, you will be unable to obtain adequate performance for many queries.
At risk of repeating myself, my advice is very simple: when evaluating DV vendors and big data integration solutions, don’t be satisfied with generic claims about “ease of use” and “high performance”: ask for the details and test the different products in your environment, with real data and real queries, to make the final decision.
- No Single Data Repository Can Be Your Silver Bullet - April 14, 2021
- Unlocking the Potential of Machine Learning in a Data Lake - March 27, 2019
- 4 Key Takeaways from the Gartner Magic Quadrant for Data Integration Tools - August 2, 2018