Aligning Data Virtualization, Streaming, and ETL
What do data virtualization, streaming, and extract, transform, and load (ETL) processes have in common? They all deliver data from disparate sources to the business intelligence (BI), analytics, and other applications used by businesspeople, data scientists, and analysts. Like utilities, these tools operate in the background until something serious goes wrong, like a burst pipe or an electrical short.
However, other subtler problems can arise when using these tools, which relate to the quality and consistency of the data delivered. These issues arise from the data modelling and preparation that all three tools require upfront. They may arise separately and individually in each of these tools but can be most damaging when they arise across the tools when multiple modes of data delivery are being used in the same environment. With the explosion of data driven by external sources, feeding a plethora of data warehouses and lakes, many organizations are now using all three modalities of data delivery.
For clarity and simplicity, this is how I distinguish between these three tools:
- ETL tools deliver data off-line, usually overnight, and often in batches—large or small. Sometimes ETL processes are used during business hours to deliver smaller batches of data. In either case, the data can usually be inspected before it is used and cleansed, if necessary.
- Streaming delivers data continuously from sources that pump out data nonstop. In most cases, such data is most useful when it is fresh and can be analyzed immediately, even before storing it (if required) on disk.
- Data virtualization also delivers data in real time, but only when requested by a user or application.
The differences, therefore, are of delivery timing (from immediate to intermittent), request type (planned, automated, or on demand), and data checking before use (planned or just not done). Given these differences, it should be clear that if data quality or consistency issues arise, they are most easily detected and fixed in ETL tools, less so in data virtualization implementations, and present significant challenges in streaming approaches. When these tools are used individually, developers make provisions for quality and consistency checks, as well as fixes, if the individual tool makes it possible. However, where multiple tools are used together, more care is needed.
Data Quality and Consistency Across Data Delivery Tools
Let’s look at an imaginary energy company called Bright Sparks. As part of a green initiative, they have installed smart meters for half of their customers, to better manage, and ultimately reduce, energy consumption.
Bright Sparks has a traditional data warehouse/data marts environment that supports a billing inquiries call center. A traditional ETL system combines data from multiple operational billing systems based on either manual or electronic bimonthly meter readings. Significant data modelling and preparation work is undertaken in the design and delivery of the ETL system to overcome the challenges of historically fragmented customer, location, and meter identification data. The effort has paid off; customer satisfaction levels have improved dramatically. It’s just the basics: linking the right meter to the right location to the right customer. But change is coming.
Customers with smart meters who opt in to a new green billing scheme will be able to save money by switching to off-peak electricity usage. The larger the percentage of off-peak use they guarantee, the lower the unit price for both on- and off-peak they will receive, as long as they meet their targets, with a sliding scale of penalties if they miss. The billing department thought the scheme too complex, but the engineers won. As a result, meter data is now streamed continuously into a new data lake environment and subjected to ongoing analysis to calculate the correct discounted rate for billing. Note that getting this billing right depends on getting the same meter/location/customer relationship aligned with the definitions that first went into the ETL system.
As you might imagine, this green scheme is guaranteed to cause some considerable spike in questions to the call center, where reps will now have to combine data from the traditional data warehouse with that from the data lake in order to get a comprehensive picture of a customer’s billing situation. This is a job for data virtualization, as I described in an earlier blog post. The definition of the data virtualization model presents yet a third opportunity to ensure alignment with the meter/location/customer relationship already defined for ETL and streaming.
Semantic vs. Temporal Consistency
Whether alignment is between three data delivery systems or, as is more commonly the case, between two, designers must consider two very different types of consistency requirements.
Semantic consistency speaks to the meaning and structure of the data fields available in different sources. Customer or product IDs in different source systems often have inconsistent definitions or structures, overlapping or mismatched value sets, or other problems that arise from the original design choices made for the source systems. In the case of internal data sources, a review of design documentation followed by an examination of the source system data and code will identify likely issues. External data sources may be more of a challenge, given the common lack of adequate documentation and no access to source code.
Temporal consistency issues, although less common than the semantic variety, are more challenging. They arise because different data elements are created and changed at different times. Traditional financial systems, for example, often have end of day jobs that reconcile the day’s changes across multiple accounts. Until such jobs run, data within and across the systems is inconsistent. As a result, ETL jobs run after this reconciliation. If the financial system is accessed during the day via a data virtualization tool, its designers must be aware of the “unfinished nature” of the data that is returned and the implications of joining it with data that was previously sourced via ETL processes.
Data virtualization, ETL, and streaming are complementary data delivery approaches that provide data to BI, analytics, and similar applications. They differ in the timeliness, freshness, and consistency of the data delivered, characteristics that are balanced and traded off in system design. When two or more of these tools are employed in a solution, designers should take particular care to ensure that semantic and temporal consistency concerns are addressed, so that end users see combined data results that are intrinsically correct.