Data Virtualization or SQL-on-Hadoop for Logical Data Architectures?
The concepts of logical data warehouse and logical data lake are getting traction. Organizations are starting to understand that these forms of data warehouses and data lakes are more practical and flexible than their more traditional physical counterparts.
The technology for building logical data architectures is available in the form of mature data virtualization platforms. These products can hide how and where data is stored for the data scientists and other users. To them, all the data sources look like one integrated database.
Currently, vendors of so-called SQL-on-Hadoop engines are also presenting their products as the right technology for building logical data architectures. The reason is that they too offer SQL interfaces on Hadoop files and other storage technologies. This allows users to deploy their favorite reporting or analytical tool to access the data. These tools use the SQL interface of a SQL-on-Hadoop engine to access data stored in, for example, Parquet files, or flat Hadoop files in which values are separated by commas. In other words, through these SQL-on-Hadoop engines, users can easily analyze big data.
But the question is whether SQL-on-Hadoop engines support all the required functionality to develop a logical data architecture? Being able to run SQL queries on Hadoop is not sufficient. The following key features for developing logical data architectures are missing in most SQL-on-Hadoop engines:
- Allowing applications and users to access all the data through another interface than SQL, but through, for example, REST/JSON, XQuery, MDX, SOAP/XML, or JMS.
- Allowing all types of data sources to be accessed, including Hadoop files, SQL databases, flat files, spreadsheets, packaged applications, social media networks, and messaging devices. Note that not all the data is stored in Hadoop files.
- Detailed lineage and impact analysis capabilities that show from which data sources results are derived and which transformations have been applied to the data.
- A searchable data catalog that contains definitions, descriptions, tags, and classifications of all the available data. Such a feature is indispensable for more investigative and experimental users, such as data scientists.
- Advanced query optimization techniques for federated queries across a homogeneous set of data sources. Joins combining data stored in Hadoop with data stored in a SQL database should be executed as efficiently as possible.
- Advanced query pushdown and parallel processing capabilities to handle complex queries on real big data. It’s important that query processing is executed as close to where data is stored as possible, regardless of the data storage technology used.
- Centralized data security features for specifying on a granular level i.e. which user is allowed to do what, with which data.
In general, SQL-on-Hadoop engines are very valuable, high-performance technologies for making Hadoop data available through a SQL interface. That’s probably the reason why data virtualization servers themselves use this technology in some cases to efficiently access Hadoop files. Unfortunately, there is key functionality missing to make them suitable for building entire logical data warehouse and logical data lake architectures. They form an important puzzle piece to develop a logical data warehouse or logical data lake, but they are not the complete puzzle.
For more information on data virtualization technology and developing logical data lakes, see the whitepaper.
- Data Minimization as Design Guideline for New Data Architectures - May 6, 2021
- Data Fabrics Need to Coexist with Data Warehouses and Other Database-Centric Technologies - March 29, 2021
- Streamlining External Data Access to Enrich Analytics - December 23, 2020