Explaining DataBricks Delta Lake

5 min readAug 19, 2019

Delta Lake is an open source storage layer that brings reliability to Data Lakes. Data Lake is a centralized repository of storage that allows to store data at any scale. Mostly this data would be in raw format. The structure of the data or schema is not defined when the data is captured. This means you can store all of your data without careful design or the need to know what questions you might need answers for in the future. One problem with Data Lake is that it lacks reliability and there can be bad data in the Data Lake. Delta lake is a storage layer that sits on top of Data Lake. Delta Lake looks at the data coming from the Data Lake and makes sure that data adheres to the schema you specified. This way, data that makes into the Delta Lake will be correct and reliable. Delta lake can handle both batch and streaming data. Delta lakes also optimizes performance when compared to Data Lakes.

Key features of the Delta lake are the following:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level. ACID stands for Atomicity, Consistency, Isolation and Durability.

Scalable Metadata Handling: In big data, even the metadata itself can be “big data”. Delta Lake treats metadata just like data, leveraging Spark’s distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.

Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.

Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic back fill, and interactive queries all just work out of the box.

Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.

Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.

100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

Delta Lake Batch Operations:

Create: Creating a Data lake is as easy as changing the file type while performing a write.

The below comment saves a DataFrame with name “df” to a data lake in the folder “data”.

df.write.format("delta").save("/data")

Append: Append is used to adding to the existing data lake.

df.write.format("delta").mode("append").save(Path)

Upsert: Upsert is a combination of update and insert. The operation tries to insert a row and if the row exist the operation update the row.

Managed Delta Lake:

Delta Lake, managed and queried via DataBricks, platform includes additional features and optimizations.

These include:

OPTIMIZE: This is compacting many smaller files to a larger file. Often data is distributed across many smaller files. Some time these smaller can spread across different machines or even different geographies. These smaller files makes data read inefficient as disproportionate amount of time will go into opening these files than actual reading of data. In Optimize, small files are compacted togather into new larger files up to 1 GB. One GB limit was set by the DataBricks as a trade off between query speed and run-time performance when running Optimize. The frequency of running Optimized needs to determined based on trade of between better end user query performance and optimizing cost of compute hours, as Optimize is a time consuming activity.

Data Skipping: Data Skipping is a mechanism for speeding up queries that contain filters. Consider a data set partitioned by country. A query using

WHERE country = 'INDIA'

wouldn’t access data that resides in partitions that corresponds to country != INDIA

ZORDER: Zordering is another technique to speed up queries. Zordering is a technique to colocate related information in the same set of files. Zordering maps multidimensional data to one dimension while preserving locality of the data points. Z ordering sorts the data based on the Zorder column specified with in a partition using the algorithm described here. Z order column should be different from the partition column.

Below diagram explains the functioning of ZORDER. Refer this blog for more details. As shown in the diagram, ZORDER improves the execution speed by decreasing the number of files scanned and the occurance of false positives.

Legend:

Gray dot = data point e.g., chessboard square coordinates
Gray box = data file; in this example, we aim for files of 4 points each
Yellow box = data file that’s read for the given query
Green dot = data point that passes the query’s filter and answers the query
Red dot = data point that’s read, but doesn’t satisfy the filter; “false positive”

The notation for ZORDER in Delta Lake is below

OPTIMIZE {TABLENAME} ZORDER BY COL1, COL2

The query ensures that data with same col1 is colocated and then data with same col2 is colocated. Colocation of data minimizes the reading of data which doesn’t satisfy the filter criteria.

VACUUM:

Vacuum command is used to clean up invalid files which are no longer needed. Invalid files are small files which has been merged into a larger file by the OPTIMIZE command. The syntax of the VACUUM command is

VACUUM {table_name} RETAIN {number_of_hours} HOURS

number_of_hours paramenter is the retention interval, specified in hours.

Delta Lake Architecture:

Delta Lake has three stages of Data enrichment.

Bronze Tables: contain the raw data ingested from various sources like json files, data from RDBMS systems, IoT data etc.

Silver Tables: provide more refined view of the Bronze data.

Gold Tables: provides business level aggregates often used for reporting and dash boarding.