Delta Lake: Powering Change Data Capture(CDC) Through Lakehouse Architecture

5 min readMar 20, 2021

We have been seeing data warehouse for decades in business intelligence and decision making systems. It has a long history of playing a crucial role there. With time, organizations started collecting huge amounts of data from a variety of data sources. Eventually the need for dealing with unstructured data, semi-structured data and data with high variety, velocity and volume increased. Hence, people began building datalakes for these use cases.

Though datalake meets all above mentioned needs but it lacks some of the core features of data warehouse like transactions, data quality checks, consistency etc. In order to meet all needs, organizations started implementing & maintaining both the systems — datalake & data warehouse.

Data warehouse powers business intelligence(BI), data reporting etc use cases whereas datalake powers SQL analytics, real-time monitoring, data science and machine learning etc

Need:

Over the years, organizations started struggling to maintain both datalake & data warehouse — Having single source of truth for data, cost of multiple data pipelines etc. Importance of unified architecture to serve use cases from BI to AI increased within organizations.

Data Lakehouse

Lakehouse is combination of data warehouses and data lakes, implementing similar data structures and data management features to those in a data warehouse, for unstructured data & semi-structured data directly on the kind of low cost storage used for data lakes to serve BI to AI use cases.

It helps to eliminate redundant ETL jobs & reduce data redundancy. It saves time and effort administrating multiple platforms

Key Features:

Transaction support
Schema enforcement and governance
Standardized storage formats
BI support
Storage is decoupled from compute
Support for diverse data types & workloads
Support for streaming

Many of lakehouse architectural features have been implemented in proprietary services like Databricks Delta, Azure Synapse etc and in managed services like Google Bigquery, AWS Redshift etc. Adoption of lakehouse architecture is taking pace with open sourced projects like Delta lake, Apache Iceberg & Apache Hudi.

Delta Lake

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It gives all key features of lakehouse architecture.

Why Delta Lake:

ACID Transactions
Scalable Metadata Handling with Spark
Time Travel — Access and revert to earlier versions of data
Open Format — Stores data in parquet format
Unified Batch & Streaming Source & Sink
Schema Evolution
Fast performance with Apache Spark

CDC Use Case

Came across one of the complex use cases where we were retrieving insights for users of our customers through numerous analytics, machine learning jobs. Over more than hundred customers, multiple types of user insights for millions of users, complexity was high.

Each computed insight for users corresponds to downstream activation task — engaging users through various messaging channels, product recommendations, user targeting etc (fig-1)

Multiple data science jobs, analytics jobs etc compute user insights on daily/hourly basis & writes into datalake. Irrespective whether insights of given user has been changed, all user insights were being pushed to downstream activation jobs. Cost of activation job is proportional to data volume. With time, data size grew & resulted in issues for maintaining activation jobs. Also activating unchanged insights for user was not driving much ROI, comprising customer experience. (fig-1)

Decided to capture only insights which observed change in last day/job run. It threw a challenge at us to find out store which gives
1. Reliable transactional capabilities (for writer side)

2. Highly performant data store — Minimal IO(for reader side)

After experimenting with few open source file formats, we decided to go ahead with Delta Lake as it meets all our primary needs & it was proven at scale.

Implementation Details:

We used spark to write to & read from Delta Lake.

Necessary spark config for Delta Lake:

Set spark sql extensions with Delta spark session extensions (Doc)

2. If using AWS S3 as underneath storage for Delta Lake (Doc)

3. Include hadoop-aws JAR in the classpath. Delta Lake needs the org.apache.hadoop.fs.s3a.S3AFileSystem class.

Writer Jobs:

Writer spark jobs used Delta Lake MERGE operation to write data in Delta Lake. Used partitioning on basis of customer, type of computed insight & date of computation.

Here’s snippet for MERGE operation:

Reader job:

Spark reader job read changed insights in last job run.

Thus we could reduce around 35%-40% input data size towards downstream activation jobs keeping only relevant after change data capture

Best Practices:

Reduce the search space for matches by adding known constraints in the match condition

events.date = current_date() AND events.country = 'INDIA'

Compact files into larger files to improve read throughput
Control the shuffle partitions for writes through spark.sql.shuffle.partitions
Repartition output data before write by setting

spark.delta.merge.repartitionBeforeWrite=true

Run vacuum operation to remove files no longer referenced by a Delta table and are older than the retention threshold. (Check out vacuum optimization from my previous blog)

Current lakehouses address all lakehouse features by reducing cost but their performance can still lag specialized system. Improvements along with performance will be addressed as the technology continues to mature and develop.