Newly graduated from the Apache Incubator, the Parquet project allows column-stored data to be handled at high speed Apache Parquet, which provides columnar storage in Hadoop, is now a top-level Apache Software Foundation (ASF)-sponsored project, paving the way for its more advanced use in the Hadoop ecosystem. Already adopted by Netflix and Twitter, Parquet began in 2013 as a co-production between engineers at Twitter and Cloudera to allow complex data to be encoded efficiently in bulk. Databases traditionally store information in rows and are optimized for working with one record at a time. Columnar storage systems serialize and store data by column, meaning that searches across large data sets and reads of large sets of data can be highly optimized. Hadoop was built for managing large sets of data, so a columnar store is a natural complement. Most Hadoop projects can read and write data to and from Parquet; the Hive, Pig, and Drill projects already do this, as well as conventional MapReduce. As another benefit, per-column data compression further accelerates performance in Parquet. A textual data column is compressed differently than a column loaded with only integer data, and being able to compress columns separately provides its own performance boost. Parquet also implements column compression so that it’s “future-proofed to allow adding more encodings as they are invented and implemented.” Early adopters and project leads have used Parquet for some time and built functionality around it. Cloudera, the project’s co-progenitor, uses Parquet as a native data storage format for its Impala analytics database project, and MapR has added data self-description functions to Parquet. Netflix — never one to shy away from a forward-looking technology (such as Cassandra) — has 7 petabytes of warehoused data in Parquet format, according to the ASF. Parquet isn’t the only way to store columnar data in Hadoop, but it’s shaping up as the leader. Hive has its own columnar-data format, called ORC, although it’s mainly intended as an extension to Hive rather than as a general data store for Hadoop. Hortonworks, a Cloudera competitor (in more ways than one), claimed earlier in Parquet’s lifecycle that ORC compresses data more efficiently than Parquet. And IBM ran its own performance comparisons in September 2014 and found that while ORC used the least amount of HDFS storage, Parquet had the best overall query and analysis time, which are the metrics that typically matter most for Hadoop users. Related content news ActiveState's Python taps Intel MKL to speed data science and machine learning The MKL libraries for accelerating math operations debuted in Intel's own Python distribution, but now other Pythons are following suit By Serdar Yegulalp May 18, 2017 3 mins Data Science Machine Learning Open Source news CrateDB 2.0 Enterprise stresses security and monitoring—and open source The open source database for processing high-speed freeform data with SQL queries now has enterprise features, available as open source for faster developer uptake By Serdar Yegulalp May 16, 2017 3 mins NoSQL Databases Technology Industry Databases news analysis Waah! WannaCry shifts the blame game into high gear Every security crisis presents the opportunity to point fingers, but that's just wasted energy. The criminals are at fault—and we need to work together to stop them By Fahmida Rashid May 16, 2017 7 mins Small and Medium Business Technology Industry Malware news Faster machine learning is coming to the Linux kernel The addition of heterogenous memory management to the Linux kernel will unlock new ways to speed up GPUs, and potentially other kinds of machine learning hardware By Serdar Yegulalp May 15, 2017 3 mins Technology Industry Machine Learning Open Source Resources Videos