Delta Lake gives Apache Spark data sets new powers

A new open source project from Databricks adds ACID transactions, versioning, and schema enforcement to Spark data sources that don't have them

Credit: Thinkstock

Databricks, the company founded by the original developers of Apache Spark, has released Delta Lake, an open source storage layer for Spark that provides ACID transactions and other data-management functions for machine learning and other big data work.

Many kinds of data work need features like ACID transactions or schema enforcement for consistency, metadata management for security, and the ability to work with discrete versions of data. Features like those don’t come standard with every data source out there, so Delta Lake provides those features for any Spark DataFrame data source.

Delta Lake can be used as a drop-in replacement to access storage systems like HDFS. Data ingested into Spark through Delta Lake is stored in Parquet format in a cloud storage service of your choice. Devlopers can use their choice of Java, Python, or Scala to access Delta Lake’s API set.

Delta Lake supports most of the existing Spark SQL DataFrame functions for reading and writing data. It also supports Spark Structured Streaming as a source or destination, although not the DStream API. Every read and write through Delta Lake has an ACID transaction guarantee, so that multiple writers will have their writes serialized and multiple readers will see consistent snapshots.

Reading a specific version of a data set—what the Delta Lake documentation calls “time travel”—works by simply reading a DataFrame with an associated time stamp or version ID. Delta Lake also ensures the schema of the DataFrame being written matches the table it’s being written to; if there’s a mismatch, it throws an exception rather than change the schema. (Spark’s file APIs will replace the table in such a case.)

Future releases of Delta Lake may support more of Spark’s public API set, although DataFrameReader/Writer are the main focus for now.

Topics

About

Policies

Our Network

More

Delta Lake gives Apache Spark data sets new powers

A new open source project from Databricks adds ACID transactions, versioning, and schema enforcement to Spark data sources that don't have them

More from this author

How to get started with GraphQL

Cool for you: Python Polars swims through dataframes

How to install Python the smart way

4 keys to writing modern Python

Python pick: Monkeytype automates type hints

5 popular Rust web frameworks—which one is right for you?

5 newer data science tools you should be using with Python

The Python report: News for Python developers

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI

Delta Lake gives Apache Spark data sets new powers

A new open source project from Databricks adds ACID transactions, versioning, and schema enforcement to Spark data sources that don't have them

Related content

Beyond the usual suspects: 5 fresh data science tools to try today

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

Safety off: Programming in Rust with `unsafe`

More from this author

How to get started with GraphQL

Cool for you: Python Polars swims through dataframes

How to install Python the smart way

4 keys to writing modern Python

Python pick: Monkeytype automates type hints

5 popular Rust web frameworks—which one is right for you?

5 newer data science tools you should be using with Python

The Python report: News for Python developers

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI