The Data Lake: A High-Level Perspective of Challenges and a Solution
The concept of big data is funny to me. It is funny because there isn’t any small data when it comes to large corporations. 99% of the companies we work with have big data and it is getting bigger. Companies’ ability to store data has doubled every 40 months for 40 years and they are managing to fill all that storage. Every day over 2.5 exabytes of data is generated and some of that is even useful! With these exabytes, petabytes, even zettabytes of data floating around it is no wonder the concept of the data lake has become a popular implementation. This blog will explore why data lakes are popular, what some of the potential pitfalls are, and how data virtualization fits.
There is a lot of data out there and it comes in a variety of sources. Rather than have this data spread out across different sources, databases, and structures, it is tempting to be able to throw all these sources and rows into a consolidated repository. When this repository can hold these large volumes of raw data in any source or form the data lake is born and it is easy to see value. The first problem with this approach is replication of data between the source system and the lake itself. Replication is expensive and also creates quality challenges because when updating data because both sets will need to be updated every time. Governance is another issue with this approach. Having a single place to go to find your data is convenient but also makes restricting specific data sets a challenge.
Lastly, a key value proposition is that a data lake will ‘de-silo’ data sources, it can often become a silo itself. This happens in two ways. Often the process of building a lake falls into one department’s hands. During this department’s process they build it to best suit their needs, usually unintentionally, and only that group can use it to its full potential. The other way is when your lake becomes a swamp. We commonly see companies that didn’t either take the time or spend the resources to load their data lake the right way which leads to a mixture of data types, sources, and copies that only the highly trained architect can muddle through, much less a business user. These swamps contain all the required data and are a single centralized source but the jumble of mixed up data is almost useless.
Data virtualization has the capability to solve all these issues with a single agile and cost-effective approach. Your data lake, or maybe data swamp, can be transformed into a valuable logical data lake by creating an abstraction layer with data virtualization. This layer enables users to combine sources of information or configure new databases and clusters faster than traditional methods and without any replication of data. For governance issues, within the Denodo semantic layer, there are options for security and governance down to the user level that ensure compliance. Finally, even the nastiest of data swamps can be cut through with ease with Denodo’s Metadata Catalog and Self-Service BI tools which gives utility to even the least technically savvy business users. If you are looking at implementing a data lake at your company, or maybe you already have and are experiencing some of these challenges, look at data virtualization to realize the full potential of a data lake.