steven_nunez
Contributing Editor

Review: Apache Hive brings real-time queries to Hadoop

reviews
Jul 23, 201411 mins
Data ManagementData WarehousingHadoop

Hive's SQL-like query language and vastly improved speed on huge data sets make it the perfect partner for an enterprise data warehouse

Apache Hive is a tool built on top of Hadoop for analyzing large, unstructured data sets using a SQL-like syntax, thus making Hadoop accessible to legions of existing BI and corporate analytics researchers. Developed by Facebook engineers and contributed to the Apache Foundation as an open source project, Hive is now at the forefront of big data analysis in commercial environments. 

Hive, like the rest of the Hadoop ecosystem, is a fast-moving target. This review covers version 0.13, which addresses several shortcomings in previous versions. It also brings a significant speed boost to SQL-like queries across large-scale Hadoop clusters, building on new capabilities for interactive query introduced in prior releases. 

[ Also on InfoWorld: Know this about Hadoop right now | Learn how Hadoop works and how you can reap its benefits: Download InfoWorld’s Hadoop Deep Dive PDF. | Discover what’s new in business applications with InfoWorld’s Technology: Applications newsletter. ]

Hive is fundamentally an operational data store that’s also suitable for analyzing large, relatively static data sets where query time is not important. Hive makes an excellent addition to an existing data warehouse, but it is not a replacement. Instead, using Hive to augment a data warehouse is a great way to leverage existing investments while keeping up with the data deluge.

A typical data warehouse includes many expensive hardware and software components such as RAID or SAN storage, optimized ETL (extract, transform, load) procedures for cleaning and inserting data, specialized connectors to ERP and other back-end systems, and schemas designed around the questions an enterprise wants to ask such as sales by geography, product, or channel. The warehouse ecosystem is optimized around bringing enriched data to the CPU to answer the classes of questions the schema was designed for.

By contrast, a Hive data store brings together vast amounts of unstructured data — such as log files, customer tweets, email messages, geo-data, and CRM interactions — and stores them in an unstructured format on cheap commodity hardware. Hive allows analysts to project a databaselike structure on this data, to resemble traditional tables, columns, and rows, and to write SQL-like queries over it. This means that different schemas may be projected over the same data sets, depending on the nature of the query, allowing the user to ask questions that weren’t envisioned when the data was gathered.

Hive queries traditionally had high latency, and even small queries could take some time to run because they were transformed into map-reduce jobs and submitted to the cluster to be run in batch mode. This latency wasn’t usually a problem, because the overhead for query planning and starting up the map-reduce job was dwarfed by the processing time for the query itself, at least when running on the very large data sets Hive was designed for. However, users soon found that such long-running queries were inconvenient and troublesome to run in a multi-user environment, where a single job could dominate the cluster.

InfoWorld Scorecard
Scalability (20.0%)
Value (10.0%)
Management (25.0%)
Availability (20.0%)
Performance (25.0%)
Overall Score (100%)
Apache Hive 0.13 10.0 10.0 7.0 8.0 7.0 8.1

In order to address these shortcomings, the Hive community began a program (sometimes called the “Stinger” initiative) to improve query speed, with the goal of making Hive suitable for real-time, interactive queries and exploration. These improvements were delivered in three phases in versions 0.11, 0.12, and 0.13 of Hive.

Finally, although HiveQL, the query language, is based on SQL-92, it differs from SQL in some important ways due to its running on top of Hadoop. For instance, DDL (Data Definition Language) commands need to account for the fact that tables exist in a multi-user file system that supports multiple storage formats. Nevertheless, SQL users will find the HiveQL language familiar and should not have any problems adapting to it.

Hive platform architecture From the top down, Hive looks much like any other relational database. Users write SQL queries and submit them for processing, using either a command line tool that interacts directly with the database engine or by using third-party tools that communicate with the database via JDBC or ODBC. The picture looks something like this:

Apache Hive architecture
The Hive architecture.

By using the JDBC and ODBC drivers, available for Mac and Windows, data workers can connect their favorite SQL client to Hive to browse, query, and create tables. For power users, there is still the original, thick client CLI that interacts directly with the Hive driver. This client is the most robust, but it requires direct access to Hadoop and therefore is most suitable for local network operations where firewalls, DNS, and network topology aren’t an issue.

The Hive metastore, HCatalog, previously a separate Hadoop project, has been rolled up into the Hive distribution. Backed by its own relational database, it saves schemas that are defined in Hive, simplifying new queries, as well as making the schemas available to other tools in the Hadoop tool chain such as Pig.

Working with Hive As noted above, Hive speaks a simple, SQL-like language called HiveQL that supports data definition and manipulation statements. Any SQL user should find working with Hive a familiar experience. HiveQL was designed to ease the transition from SQL and to get data analysts up and running on Hadoop right away.

Most BI and SQL developer tools can connect to Hive as easily as to any other database. Using the ODBC connector, users can import data and use tools like PowerPivot for Excel to explore and analyze data, making big data accessible across the organization.

There are a few significant differences in HiveQL and standard SQL. Hive 0.13 was designed to perform full-table scans across petabyte-scale data sets using the YARN and Tez infrastructure, so some features normally found in a relational database aren’t available to the Hive user. These include transactions, cursors, prepared statements, row-level updates and deletes, and the ability to cancel a running query.

The absence of these features won’t significantly affect data analysis, but it might affect your ability to use existing SQL queries on a Hive cluster. Queries might need to be written a bit differently than you would for an engine that supports the full SQL language, but a seasoned user of traditional databases should have no trouble writing Hive queries. Many traditional SQL editing environments now support Hive via connectors, and Hive tables can be accessed from many SQL editors, including those from Oracle and Microsoft.

One major difference for a database user is the need to be aware of storage details. In a traditional database environment, the database engine controls all reads and writes to the database. In Hive, the database tables are stored as files in the Hadoop Distributed File System (HDFS), where other applications could have modified them. Although this can be a good thing, it means that Hive can never be certain if the data being read matches the schema. Therefore, Hive enforces schema on read. If the underlying data format has changed, Hive will do its best to compensate, but you will likely get unexpected results.

The Hive user must be aware of two other aspects of data storage: file formats and compression. Tuning Hive queries can involve making the underlying map-reduce jobs run more efficiently by optimizing the number, type, and size of the files backing the database tables. Hive’s default storage format is text, which has the advantage of being usable by other tools. The disadvantage, however, is that queries over raw text files can’t be easily optimized. 

Hive can read and write several file formats and decompress many of them on the fly. Storage requirements and query efficiency can differ dramatically among these file formats, as can be seen in the figure below (courtesy of Hortonworks). File formats are an active area of research in the Hadoop community. Efficient file formats both reduce storage costs and increase query efficiency.

File formats and file sizes on Hadoop
File formats and file sizes on HDFS. (Image courtesy of Hortonworks.)

Processing large data sets is often a multistep process, and HiveQL includes language constructs to specify the ETL pipeline. Often, depending on the nature of the particular problem, a job requires storing temporary tables, and moving several terabytes to HDFS can be impractical. Hive provides three types of UDFs (user-defined functions) that can be used within queries for custom processing.

For example, let’s say you want to do a query that’s not part of the built-in SQL. Without a UDF, you would have to dump a temporary table to disk, run a second tool (such as Pig or Java) for your custom query, and possibly produce a third table in HDFS that would be analyzed by Hive. By using a UDF, your custom query logic could be embedded into your SQL query, saving those steps and also leveraging the query planning and caching infrastructure of Hive. (See Hivemall for a neat example of enabling machine learning in Hive via UDFs.)

Because UDFs run as part of the Hive query and have direct access to the data, they run efficiently and eliminate intermediate steps in the pipeline. UDFs must be written in Java, so they aren’t for the average SQL programmer, but they fill an important niche in the Hive toolkit. Without them, certain classes of problems would be much more difficult to solve.

Hive query performance Hive 0.13 is the final piece in the Stinger initiative, a community effort to improve the performance of Hive, and there’s no question the effort was successful. I tested each of the major speed features from the 0.11 to 0.13 releases to see how much they added to performance. The most significant feature of 0.13 is the ability to run queries on the new Tez execution framework.

In my testing, I saw query times drop by half when run on Tez, and on queries that could be cached, times dropped another 30 percent. On larger data sets, the speedup was even more dramatic. Using the ORC (Optimized Row Columnar) file format, a feature introduced in 0.11, reduced query times by about 15 percent. Vectorization, a Microsoft contribution to Hive 0.13, sped it up by about 10 percent. Enabling cost-based query optimization, another new feature in 0.13, provided a 20 percent boost.

I should point out that these tests were run on my laptop, on small data sets, using ad-hoc queries, and don’t represent a real-world measurement of the new performance capabilities in Hive 0.13. What is clear is that each of these features can be a valuable tool in a performance-tuning toolkit, especially for large-scale queries.

The amount of data to be collected and analyzed isn’t going to diminish in coming years. The most commonly used metric in the enterprise data warehouse business — the price per terabyte — will only increase in importance. Although it’s easy to calculate and understand, like most simple metrics it can also lead buyers astray without the complete picture in mind. Nevertheless, nearly everyone agrees that the cost of data storage in Hive is vastly lower (think 100 times lower) than in a traditional data warehouse.

Current users of traditional data warehouses should evaluate Hadoop to store unstructured data for analysis and inclusion in the data warehouse. Using Hive, it’s possible to execute petabyte-scale queries to refine and cleanse data for later incorporation into data warehouse analytics. Hadoop and Hive could also be used in the reverse scenario: to off-load data summaries that would otherwise need to be stored in the data warehouse at much greater cost. Finally, Hive could be used to run experimental analytics on unstructured data that, if they prove their worth, could be moved into the data warehouse.

Organizations or departments without a data warehouse can start with Hive to get a feel for the value of data analytics while keeping startup costs to a minimum. Although Hive doesn’t offer a complete data warehouse solution, it does make a great, low-cost, large-scale operational data store with a fair set of analytics tools. If your analytics needs outgrow those satisfied by Hive, many traditional data warehouse vendors offer connectors and tools to bring the data into the warehouse, preserving your investments.

Until the playing field levels, companies making the best decisions — decisions based on data and analytics — will have a competitive advantage. Hive offers near linear scalability in query processing, an order of magnitude better price/performance ratio than traditional enterprise data warehouses, and a low barrier to entry. With 10TB enterprise data warehouse solutions costing around $1 million, managing large unstructured data sets with Hive makes a lot of sense.

Apache Hive at a glance

 
Pros
  • Good compatibility with SQL
  • Multiple schemas can be projected on the same data — no ETL required
  • Works with popular query tools that connect via JDBC and ODBC
  • Able to query very large data sets interactively
Cons
  • DDL requires a bit of learning and different thinking
  • Tuning Hive is a different skill set than tuning a RDBMS
Platforms Works with Hadoop 0.20.x, 0.23.x.y, 1.x.y, 2.x.y
Cost Free open source under the Apache License, Version 2.0

This article, “Review: Apache Hive brings real-time queries to Hadoop,” was originally published at InfoWorld.com. Follow the latest developments in big data and open source at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.

steven_nunez
Contributing Editor

Steve Nuñez is technologist-turned-executive currently working as a management consultant helping senior executives apply artificial intelligence in a practical, cost effective manner. He takes an incremental approach to AI adoption, emphasizing the organizational change required for analytics and A.I. to become part of the company DNA.

Before moving to consulting Steve led the professional services and technical pre-sales organizations in Asia Pacific for MapR, a “big data unicorn” acquired by HP Enterprise. While leading the field organization, Steve served clients including Toyota, Bank of China, Philips, Samsung, and the government of India in their bio ID program.

Steve has been a contributing editor and reviewer for InfoWorld since 1999.

More from this author