Big Data Warehouses require Hybrid Data Storage
There’s no stopping the big data revolution. The potential business advantages are clear. Big data systems make advanced forms of analytics possible that allows organizations to optimize business processes, increase customer delight and customer care, improve product development, and personalize products and services.
However, do organizations use the right technology to develop such big data systems?
To implement big data systems, numerous organizations have selected NoSQL systems, such as Hadoop, MongoDB, Cloudera, and MapR. These NoSQL technologies make it possible to reach levels of scalability and availability that may be hard or impossible with classic SQL systems.
However, they have their limitations. For example, Hadoop HDFS together with MapReduce and Hive form a powerful software stack. It allows us to run complex queries on massive amounts of data and still get an acceptable performance. It’s excellent for statistics, data mining, and data science type of work, however, due to, for example, its batch-oriented nature and its lack of a mature optimizer, it’s not the right stack for ad-hoc query environments.
This means that today we can’t develop a big data warehouse by storing all the data in one Hadoop data store, and run all types of reporting/analytical workloads on it. SQL systems are still more suitable for certain types of reports and analytics. This means that when a data warehouse environment has to support analytics on big data plus all the other more classic forms of reporting, it may have to consist of NoSQL and SQL storage systems. In other words, a big data warehouse for big analytics may use a hybrid data storage system.
In a data warehouse environment that deploys a hybrid data storage system, business analysts shouldn’t have to be aware of in which of the two data stores their data is stored. This should be completely hidden for them. A data virtualization server can help out here. It can make the hybrid solution look like one big logical database. This transparency is needed for productivity, maintenance, and to easily move data from one data store to another.