Review: Apache Hive brings real-time queries to Hadoop

Hive's SQL-like query language and vastly improved speed on huge data sets make it the perfect partner for an enterprise data warehouse

Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers. Developed by Facebook engineers and contributed to the Apache Foundation as an open source project, Hive is now at the forefront of big data analysis in commercial environments.

Hive, like the rest of the Hadoop ecosystem, is a fast-moving target. This review covers version 0.13, which addresses several shortcomings in previous versions. It also brings a significant speed boost to SQL-like queries across large-scale Hadoop clusters, building on new capabilities for interactive query introduced in prior releases.

[ Also on InfoWorld: Know this about Hadoop right now | Learn how Hadoop works and how you can reap its benefits: Download InfoWorld’s Hadoop Deep Dive PDF. | Discover what’s new in business applications with InfoWorld’s Technology: Applications newsletter. ]

Hive is fundamentally an operational data store that’s also suitable for analyzing large, relatively static data sets where query time is not important. Hive makes an excellent addition to an existing data warehouse, but it is not a replacement. Instead, using Hive to augment a data warehouse is a great way to leverage existing investments while keeping up with the data deluge.

A typical data warehouse includes many expensive hardware and software components such as RAID or SAN storage, optimized ETL (extract, transform, load) procedures for cleaning and inserting data, specialized connectors to ERP and other back-end systems, and schemas designed around the questions an enterprise wants to ask such as sales by geography, product, or channel. The warehouse ecosystem is optimized around bringing enriched data to the CPU to answer the classes of questions the schema was designed for.

By contrast, a Hive data store brings together vast amounts of unstructured data — such as log files, customer tweets, email messages, geo-data, and CRM interactions — and stores them in an unstructured format on cheap commodity hardware. Hive allows analysts to project a databaselike structure on this data, to resemble traditional tables, columns, and rows, and to write SQL-like queries over it. This means that different schemas may be projected over the same data sets, depending on the nature of the query, allowing the user to ask questions that weren’t envisioned when the data was gathered.

Hive queries traditionally had high latency, and even small queries could take some time to run because they were transformed into map-reduce jobs and submitted to the cluster to be run in batch mode. This latency wasn’t usually a problem, because the overhead for query planning and starting up the map-reduce job was dwarfed by the processing time for the query itself, at least when running on the very large data sets Hive was designed for. However, users soon found that such long-running queries were inconvenient and troublesome to run in a multi-user environment, where a single job could dominate the cluster.

InfoWorld Scorecard	Scalability (20.0%)	Value (10.0%)	Management (25.0%)	Availability (20.0%)	Performance (25.0%)	Overall Score (100%)
Apache Hive 0.13	10.0	10.0	7.0	8.0	7.0	8.1

In order to address these shortcomings, the Hive community began a program (sometimes called the “Stinger” initiative) to improve query speed, with the goal of making Hive suitable for real-time, interactive queries and exploration. These improvements were delivered in three phases in versions 0.11, 0.12, and 0.13 of Hive.

Finally, although HiveQL, the query language, is based on SQL-92, it differs from SQL in some important ways due to its running on top of Hadoop. For instance, DDL (Data Definition Language) commands need to account for the fact that tables exist in a multi-user file system that supports multiple storage formats. Nevertheless, SQL users will find the HiveQL language familiar and should not have any problems adapting to it.

Hive platform architecture From the top down, Hive looks much like any other relational database. Users write SQL queries and submit them for processing, using either a command line tool that interacts directly with the database engine or by using third-party tools that communicate with the database via JDBC or ODBC. The picture looks something like this:

Apache Hive architecture — The Hive architecture.

By using the JDBC and ODBC drivers, available for Mac and Windows, data workers can connect their favorite SQL client to Hive to browse, query, and create tables. For power users, there is still the original, thick client CLI that interacts directly with the Hive driver. This client is the most robust, but it requires direct access to Hadoop and therefore is most suitable for local network operations where firewalls, DNS, and network topology aren’t an issue.

The Hive metastore, HCatalog, previously a separate Hadoop project, has been rolled up into the Hive distribution. Backed by its own relational database, it saves schemas that are defined in Hive, simplifying new queries, as well as making the schemas available to other tools in the Hadoop tool chain such as Pig.

Working with Hive As noted above, Hive speaks a simple, SQL-like language called HiveQL that supports data definition and manipulation statements. Any SQL user should find working with Hive a familiar experience. HiveQL was designed to ease the transition from SQL and to get data analysts up and running on Hadoop right away.

Most BI and SQL developer tools can connect to Hive as easily as to any other database. Using the ODBC connector, users can import data and use tools like PowerPivot for Excel to explore and analyze data, making big data accessible across the organization.

There are a few significant differences in HiveQL and standard SQL. Hive 0.13 was designed to perform full-table scans across petabyte-scale data sets using the YARN and Tez infrastructure, so some features normally found in a relational database aren’t available to the Hive user. These include transactions, cursors, prepared statements, row-level updates and deletes, and the ability to cancel a running query.

The absence of these features won’t significantly affect data analysis, but it might affect your ability to use existing SQL queries on a Hive cluster. Queries might need to be written a bit differently than you would for an engine that supports the full SQL language, but a seasoned user of traditional databases should have no trouble writing Hive queries. Many traditional SQL editing environments now support Hive via connectors, and Hive tables can be accessed from many SQL editors, including those from Oracle and Microsoft.

One major difference for a database user is the need to be aware of storage details. In a traditional database environment, the database engine controls all reads and writes to the database. In Hive, the database tables are stored as files in the Hadoop Distributed File System (HDFS), where other applications could have modified them. Although this can be a good thing, it means that Hive can never be certain if the data being read matches the schema. Therefore, Hive enforces schema on read. If the underlying data format has changed, Hive will do its best to compensate, but you will likely get unexpected results.

The Hive user must be aware of two other aspects of data storage: file formats and compression. Tuning Hive queries can involve making the underlying map-reduce jobs run more efficiently by optimizing the number, type, and size of the files backing the database tables. Hive’s default storage format is text, which has the advantage of being usable by other tools. The disadvantage, however, is that queries over raw text files can’t be easily optimized.

Hive can read and write several file formats and decompress many of them on the fly. Storage requirements and query efficiency can differ dramatically among these file formats, as can be seen in the figure below (courtesy of Hortonworks). File formats are an active area of research in the Hadoop community. Efficient file formats both reduce storage costs and increase query efficiency.

File formats and file sizes on Hadoop — File formats and file sizes on HDFS. (Image courtesy of Hortonworks.)

Processing large data sets is often a multistep process, and HiveQL includes language constructs to specify the ETL pipeline. Often, depending on the nature of the particular problem, a job requires storing temporary tables, and moving several terabytes to HDFS can be impractical. Hive provides three types of UDFs (user-defined functions) that can be used within queries for custom processing.

For example, let’s say you want to do a query that’s not part of the built-in SQL. Without a UDF, you would have to dump a temporary table to disk, run a second tool (such as Pig or Java) for your custom query, and possibly produce a third table in HDFS that would be analyzed by Hive. By using a UDF, your custom query logic could be embedded into your SQL query, saving those steps and also leveraging the query planning and caching infrastructure of Hive. (See Hivemall for a neat example of enabling machine learning in Hive via UDFs.)

Because UDFs run as part of the Hive query and have direct access to the data, they run efficiently and eliminate intermediate steps in the pipeline. UDFs must be written in Java, so they aren’t for the average SQL programmer, but they fill an important niche in the Hive toolkit. Without them, certain classes of problems would be much more difficult to solve.

Hive query performance Hive 0.13 is the final piece in the Stinger initiative, a community effort to improve the performance of Hive, and there’s no question the effort was successful. I tested each of the major speed features from the 0.11 to 0.13 releases to see how much they added to performance. The most significant feature of 0.13 is the ability to run queries on the new Tez execution framework.

In my testing, I saw query times drop by half when run on Tez, and on queries that could be cached, times dropped another 30 percent. On larger data sets, the speedup was even more dramatic. Using the ORC (Optimized Row Columnar) file format, a feature introduced in 0.11, reduced query times by about 15 percent. Vectorization, a Microsoft contribution to Hive 0.13, sped it up by about 10 percent. Enabling cost-based query optimization, another new feature in 0.13, provided a 20 percent boost.

I should point out that these tests were run on my laptop, on small data sets, using ad-hoc queries, and don’t represent a real-world measurement of the new performance capabilities in Hive 0.13. What is clear is that each of these features can be a valuable tool in a performance-tuning toolkit, especially for large-scale queries.

The amount of data to be collected and analyzed isn’t going to diminish in coming years. The most commonly used metric in the enterprise data warehouse business — the price per terabyte — will only increase in importance. Although it’s easy to calculate and understand, like most simple metrics it can also lead buyers astray without the complete picture in mind. Nevertheless, nearly everyone agrees that the cost of data storage in Hive is vastly lower (think 100 times lower) than in a traditional data warehouse.

Current users of traditional data warehouses should evaluate Hadoop to store unstructured data for analysis and inclusion in the data warehouse. Using Hive, it’s possible to execute petabyte-scale queries to refine and cleanse data for later incorporation into data warehouse analytics. Hadoop and Hive could also be used in the reverse scenario: to off-load data summaries that would otherwise need to be stored in the data warehouse at much greater cost. Finally, Hive could be used to run experimental analytics on unstructured data that, if they prove their worth, could be moved into the data warehouse.

Organizations or departments without a data warehouse can start with Hive to get a feel for the value of data analytics while keeping startup costs to a minimum. Although Hive doesn’t offer a complete data warehouse solution, it does make a great, low-cost, large-scale operational data store with a fair set of analytics tools. If your analytics needs outgrow those satisfied by Hive, many traditional data warehouse vendors offer connectors and tools to bring the data into the warehouse, preserving your investments.

Until the playing field levels, companies making the best decisions — decisions based on data and analytics — will have a competitive advantage. Hive offers near linear scalability in query processing, an order of magnitude better price/performance ratio than traditional enterprise data warehouses, and a low barrier to entry. With 10TB enterprise data warehouse solutions costing around $1 million, managing large unstructured data sets with Hive makes a lot of sense.

Apache Hive at a glance


Pros	Good compatibility with SQL Multiple schemas can be projected on the same data — no ETL required Works with popular query tools that connect via JDBC and ODBC Able to query very large data sets interactively
Cons	DDL requires a bit of learning and different thinking Tuning Hive is a different skill set than tuning a RDBMS
Platforms	Works with Hadoop 0.20.x, 0.23.x.y, 1.x.y, 2.x.y
Cost	Free open source under the Apache License, Version 2.0

This article, “Review: Apache Hive brings real-time queries to Hadoop,” was originally published at InfoWorld.com. Follow the latest developments in big data and open source at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

Topics

About

Policies

Our Network

More

Review: Apache Hive brings real-time queries to Hadoop

Hive's SQL-like query language and vastly improved speed on huge data sets make it the perfect partner for an enterprise data warehouse

Apache Hive at a glance

More from this author

Review: AWS Bottlerocket vs. Google Container-Optimized OS

6 ways Alibaba Cloud challenges AWS, Azure, and GCP

Review: Nvidia’s RAPIDS brings Python analytics to the GPU

A brief history of artificial intelligence

Greenplum 6 review: Jack of all trades, master of some

Review: The best Linux distros for Docker and containers

CoreOS review: Linux for containers and Kubernetes

Bossie Awards 2017: The best databases and analytics tools

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI

Review: Apache Hive brings real-time queries to Hadoop

Hive's SQL-like query language and vastly improved speed on huge data sets make it the perfect partner for an enterprise data warehouse

Apache Hive at a glance

Related content

Beyond the usual suspects: 5 fresh data science tools to try today

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

Safety off: Programming in Rust with `unsafe`

More from this author

Review: AWS Bottlerocket vs. Google Container-Optimized OS

6 ways Alibaba Cloud challenges AWS, Azure, and GCP

Review: Nvidia’s RAPIDS brings Python analytics to the GPU

A brief history of artificial intelligence

Greenplum 6 review: Jack of all trades, master of some

Review: The best Linux distros for Docker and containers

CoreOS review: Linux for containers and Kubernetes

Bossie Awards 2017: The best databases and analytics tools

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI