Delta Lake: Powering Change Data Capture(CDC) Through Lakehouse Architecture
We have been seeing data warehouse for decades in business intelligence and decision making systems. It has a long history of playing a crucial role there. With time, organizations started collecting huge amounts of data from a variety of data sources. Eventually the need for dealing with unstructured data, semi-structured data and data with high variety, velocity and volume increased. Hence, people began building datalakes for these use cases.
Though datalake meets all above mentioned needs but it lacks some of the core features of data warehouse like transactions, data quality checks, consistency etc. In order to meet all needs, organizations started implementing & maintaining both the systems — datalake & data warehouse.
Data warehouse powers business intelligence(BI), data reporting etc use cases whereas datalake powers SQL analytics, real-time monitoring, data science and machine learning etc
Need:
Over the years, organizations started struggling to maintain both datalake & data warehouse — Having single source of truth for data, cost of multiple data pipelines etc. Importance of unified architecture to serve use cases from BI to AI increased within organizations.
Data Lakehouse
Lakehouse is combination of data warehouses and data lakes, implementing similar data structures and data management features to those in a data warehouse, for unstructured data & semi-structured data directly on the kind of low cost storage used for data lakes to serve BI to AI use cases.
It helps to eliminate redundant ETL jobs & reduce data redundancy. It saves time and effort administrating multiple platforms
Key Features:
- Transaction support
- Schema enforcement and governance
- Standardized storage formats
- BI support
- Storage is decoupled from compute
- Support for diverse data types & workloads
- Support for streaming
Many of lakehouse architectural features have been implemented in proprietary services like Databricks Delta, Azure Synapse etc and in managed services like Google Bigquery, AWS Redshift etc. Adoption of lakehouse architecture is taking pace with open sourced projects like Delta lake, Apache Iceberg & Apache Hudi.
Delta Lake
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It gives all key features of lakehouse architecture.
Why Delta Lake:
- ACID Transactions
- Scalable Metadata Handling with Spark
- Time Travel — Access and revert to earlier versions of data
- Open Format — Stores data in parquet format
- Unified Batch & Streaming Source & Sink
- Schema Evolution
- Fast performance with Apache Spark
CDC Use Case
Came across one of the complex use cases where we were retrieving insights for users of our customers through numerous analytics, machine learning jobs. Over more than hundred customers, multiple types of user insights for millions of users, complexity was high.
Each computed insight for users corresponds to downstream activation task — engaging users through various messaging channels, product recommendations, user targeting etc (fig-1)
Multiple data science jobs, analytics jobs etc compute user insights on daily/hourly basis & writes into datalake. Irrespective whether insights of given user has been changed, all user insights were being pushed to downstream activation jobs. Cost of activation job is proportional to data volume. With time, data size grew & resulted in issues for maintaining activation jobs. Also activating unchanged insights for user was not driving much ROI, comprising customer experience. (fig-1)
Decided to capture only insights which observed change in last day/job run. It threw a challenge at us to find out store which gives
1. Reliable transactional capabilities (for writer side)
2. Highly performant data store — Minimal IO(for reader side)
After experimenting with few open source file formats, we decided to go ahead with Delta Lake as it meets all our primary needs & it was proven at scale.
Implementation Details:
We used spark to write to & read from Delta Lake.
Necessary spark config for Delta Lake:
- Set spark sql extensions with Delta spark session extensions (Doc)
2. If using AWS S3 as underneath storage for Delta Lake (Doc)
3. Include hadoop-aws
JAR in the classpath. Delta Lake needs the org.apache.hadoop.fs.s3a.S3AFileSystem
class.
Writer Jobs:
Writer spark jobs used Delta Lake MERGE operation to write data in Delta Lake. Used partitioning on basis of customer, type of computed insight & date of computation.
Here’s snippet for MERGE operation:
Reader job:
Spark reader job read changed insights in last job run.
Thus we could reduce around 35%-40% input data size towards downstream activation jobs keeping only relevant after change data capture
Best Practices:
- Reduce the search space for matches by adding known constraints in the match condition
events.date = current_date() AND events.country = 'INDIA'
- Compact files into larger files to improve read throughput
- Control the shuffle partitions for writes through spark.sql.shuffle.partitions
- Repartition output data before write by setting
spark.delta.merge.repartitionBeforeWrite=true
- Run
vacuum
operation to remove files no longer referenced by a Delta table and are older than the retention threshold. (Check out vacuum optimization from my previous blog)
Current lakehouses address all lakehouse features by reducing cost but their performance can still lag specialized system. Improvements along with performance will be addressed as the technology continues to mature and develop.