Apache Parquet paves the way for better Hadoop data storage

Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed

Apache Parquet, which provides columnar storage in Hadoop, is now a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem.

Already adopted by Netflix and Twitter, Parquet began in 2013 as a co-production between engineers at Twitter and Cloudera to allow complex data to be encoded efficiently in bulk.

Databases traditionally store information in rows and are optimized for working with one record at a time. Columnar storage systems serialize and store data by column, meaning that searches across large data sets and reads of large sets of data can be highly optimized.

Hadoop was built for managing large sets of data, so a columnar store is a natural complement. Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, and Drill projects already do this, as well as conventional MapReduce.

As another benefit, per-column data compression further accelerates performance in Parquet. A textual data column is compressed differently than a column loaded with only integer data, and being able to compress columns separately provides its own performance boost. Parquet also implements column compression so that it’s “future-proofed to allow adding more encodings as they are invented and implemented.”

Early adopters and project leads have used Parquet for some time and built functionality around it. Cloudera, the project’s co-progenitor, uses Parquet as a native data storage format for its Impala analytics database project, and MapR has added data self-description functions to Parquet. Netflix — never one to shy away from a forward-looking technology (such as Cassandra) — has 7 petabytes of warehoused data in Parquet format, according to the ASF.

Parquet isn’t the only way to store columnar data in Hadoop, but it’s shaping up as the leader. Hive has its own columnar-data format, called ORC, although it’s mainly intended as an extension to Hive rather than as a general data store for Hadoop.

Hortonworks, a Cloudera competitor (in more ways than one), claimed earlier in Parquet’s lifecycle that ORC compresses data more efficiently than Parquet. And IBM ran its own performance comparisons in September 2014 and found that while ORC used the least amount of HDFS storage, Parquet had the best overall query and analysis time, which are the metrics that typically matter most for Hadoop users.

Topics

About

Policies

Our Network

More

Apache Parquet paves the way for better Hadoop data storage

Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed

More from this author

Beyond the usual suspects: 5 fresh data science tools to try today

Safety off: Programming in Rust with `unsafe`

How to get started with GraphQL

Cool for you: Python Polars swims through dataframes

How to install Python the smart way

4 keys to writing modern Python

Python pick: Monkeytype automates type hints

5 popular Rust web frameworks—which one is right for you?

Most popular authors

Show me more

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI

Apache Parquet paves the way for better Hadoop data storage

Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed

Related content

ActiveState's Python taps Intel MKL to speed data science and machine learning

CrateDB 2.0 Enterprise stresses security and monitoring—and open source

Waah! WannaCry shifts the blame game into high gear

Faster machine learning is coming to the Linux kernel

More from this author

Beyond the usual suspects: 5 fresh data science tools to try today

Safety off: Programming in Rust with `unsafe`

How to get started with GraphQL

Cool for you: Python Polars swims through dataframes

How to install Python the smart way

4 keys to writing modern Python

Python pick: Monkeytype automates type hints

5 popular Rust web frameworks—which one is right for you?

Most popular authors

Show me more

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI