Bossie Awards 2014: The best open source big data tools

The best open source big data tools

Although Hadoop is more popular than ever, MapReduce seems to be running out of friends. Everyone wants the answer faster, faster, now, and often in response to SQL queries. This year’s Bossies in big data track important new developments in the Hadoop stack, underscore a maturing NoSQL space, and highlight a number of useful tools for data wrangling, data analysis, and machine learning.

IPython

Two important elements of scientific work, including data science, are sharing and validating conclusions. You don’t want to be the first cold fusion data scientist after all. IPython’s Notebooks provide an environment where researchers can document and automate the data analysis workflow. The notebook is a single location researchers can use to share code, documentation, ideas, and data visualizations, and access them from a browser-based environment.

It’s difficult to understate the importance of these features in modern data science. Building and managing data products has become a complicated business requiring input from multiple parts of the organization, such as operations and monitoring, as well as a way to share analytics knowledge. To be sure, IPython is much more than IPython Notebooks. It includes multiple language support in data pipelines, parallel computing capabilities, and much more.

— Steven Nunez

Pandas

Pandas is a Python DSL (domain-specific language) for the manipulation of tabular data. It has roots in the hedge fund industry and emphasizes high performance and ease of use.

Although a relative newcomer to the scene, Pandas has a large and growing community. In many respects it is similar to the R language and can be used for many of the same data wrangling tasks, though it lacks the comprehensive and well-tested set of libraries R has. On the positive side, later Pandas development (beginning in 2008) has taken the best ideas of R and several of its packages and incorporated them into the base Pandas package. For data science shops using Python, Pandas is the best Swiss Army knife out there.

— Steven Nunez

RCloud

RCloud, from AT&T Labs, was created to address the need for a collaborative data analysis environment for R. Similar to IPython, RCloud allows researchers to analyze large data sets and share their results across an organization. For example, one group of data scientists might use RCloud to document the data workflow for semantic analysis of Web documents. This notebook could be annotated and reused by the machine learning group in the company.

Conceptually, RCloud is similar to an R CRAN (Comprehensive R Archive Network) package, but augmented by wiki-like collaboration features. Notebooks and code are stored in GitHub. Although a relative newcomer with little documentation outside of AT&T, RCloud shows a lot of promise.

— Steven Nunez

R Project

A specialized computer language for statistical analysis, R continues to evolve to meet new challenges. Since displacing lisp-stat in the early 2000s, R is the de-facto statistical processing language, with thousands of high-quality algorithms readily available from the Comprehensive R Archive Network (CRAN); a large, vibrant community; and a healthy ecosystem of supporting tools and IDEs. The 3.0 release of R removes the memory limitations previously plaguing the language: 64-bit builds are now able to allocate as much RAM as the host operating system will allow.

Traditionally R has focused on solving problems that best fit in local RAM, utilizing multiple cores, but with the rise of big data, several options have emerged to process large-scale data sets. These options include packages that can be installed into a standard R environment as well as integrations into big data systems like Hadoop and Spark (that is, RHive and SparkR).

— Steven Nunez

Hadoop

No technology in recent memory has made as big or as quick an impact as Hadoop. Hadoop encompasses many topics: HDFS, YARN, MapReduce, HBase, Hive, Pig, and a growing ecosystem of tools and engines. In the last year Hadoop has gone from being that thing everyone is talking about to being that thing even fairly conservative companies are deploying. Whether you’re trying to spark information sharing in your organization, mine data for new uses, replace expensive data warehousing technology, or offload ETL processing, Hadoop is the platform of technologies you should be looking at today.

— Andrew C. Oliver

Hive

Hive enables interactive queries over petabytes of data using familiar SQL semantics. Version 0.13 realized the three-part community vision that focused on speed, scale, and SQL compliance.

Hive preserves existing investments in SQL tools, skills, and processes and allows them to be applied to large-scale data sets. Tez integration, vector-based execution, and a cost-based optimizer significantly improve interactive query performance for users, while also lowering resource requirements. Hive 0.13 is the de-facto SQL interface to big data, with a large and growing community behind it, including Microsoft, which contributed several key pieces of SQL server technology.

— Steven Nunez

Hivemall

Hivemall is a scalable machine learning library from the National Institute of Advanced Industrial Science and Technology in Japan. It provides machine learning algorithms and feature-engineering functions. Hive’s SQL semantics are great for slicing and dicing large data sets before they are fed into Hivemall’s machine learning algorithms.

Hivemall scales well — in terms of both the number of features and the number of instances — and it outperforms most other Hadoop-based machine learning frameworks. Hivemall is a useful addition to Hive for any data scientist familiar with SQL. Installation is simply a one-line Hadoop command that adds the JAR to the Hive classpath.

— Steven Nunez

Cloudera Impala

There’s no shortage of real-time SQL engines for Hadoop: Hive on Tez (Stinger), Facebook’s Presto, Shark on Spark. It makes sense. Businesses have long relied on SQL skill sets and have BI expediency demands that batch processing can’t satisfy.

Cloudera Impala still leads the pack for interactive SQL queries on Hadoop. Its speed, maturity, and overall ecosystem (which includes Sentry for role-based authorization) bring robust SQL performance to Hadoop’s scalability, while tapping HDFS and HBase directly to sidestep costly ETL. And the momentum hasn’t slowed, with Amazon endorsing Impala by adding it to its Elastic MapReduce service this year.

Impala still outpaces Hive on overall performance and on real-time queries. Although efforts to modernize Hive are under way, Impala remains the strongest choice for real-time data analysis with SQL applications built on Hadoop.

— James R. Borck

MongoDB

At this point, the reasons for adopting MongoDB are plain and simple. If you’re looking for a cross-platform, document-oriented database that will make data integration faster and easier, you’ve come to the right spot. Also, it’s open source. With the enterprise-oriented improvements in the recent 2.6 release and document-level locking coming soon, you can expect even broader and faster adoption of MongoDB. With big clients like eBay, Craigslist, and the New York Times, this is no longer “just” some database for hipster hackers and startups.

— Andrew C. Oliver

Cassandra

Apache Cassandra is a NoSQL database that offers a compelling mix of simplicity, functionality, and good design. Cassandra implements a Bigtable-like data model, using a storage system inspired by Amazon’s Dynamo for data partitioning. The result is an easy-to-configure system that linearly scales from small clusters to multiple data centers, while efficiently distributing data across nodes built from commodity hardware.

Cassandra’s “ring” topology is flexible in the way it places data in the cluster, with several useful strategies available out of the box. With few moving parts (a single JVM per node), native file system for storage, linear scalability, and high availability, Cassandra is a top contender for a NoSQL solution.

— Steven Nunez

Neo4j

Neo4j is a NoSQL database with an almost organic feel. It’s a graph database, so it doesn’t see the world as documents or as rows and columns, but as connected objects — nodes connected by relationships, in Neo4j’s language. Nodes can represent people or restaurants or cities or stellar objects. Relationships can be “knows” or “has eaten at” or “lives in” or “is X light years from.”

The community edition gives you everything you need to run an embedded version, or a client-server system. Drivers exist for nearly every language you can think of, but if you want to manipulate the database at a higher level, you can use Cypher, Neo4j’s equivalent to SQL. The Neo4j website’s interactive documentation is some of the best we’ve seen. Even if the concepts are new, you’ll have no problem getting up to speed with this exceptional graph database.

— Rick Grehan

Pentaho

This granddaddy of open source BI covers everything from data mining to data visualization with a feature-rich set of enterprise-grade tools, including a J2EE BI server, data transformation facilities, and a graphical report designer. Yet Pentaho adroitly walks the line between comprehensive and overwhelming. It meets the big data processing and predictive analytics needs of large organizations, yet smoothly handles SMB reporting tasks.

Pentaho’s free community edition lacks the more advanced features such as mobile support, interactive dashboards, and data visualizations. To realize the full potential of Pentaho, you’ll need to upgrade to Professional or Enterprise editions. But with Kettle for ETL, the Mondrian OLAP engine for analysis, and Weka-based data mining and machine learning, Pentaho connects all of the dots between data integration, business analytics, and meaningful decisions.

— James R. Borck

Talend Open Studio for Big Data

Analyzing large volumes of data requires moving large volumes of data. Talend Open Studio for Big Data brings all of the Apache integration tools, a centralized repository, and ETL components (extract, transform, load) to big jobs, along with connectors capable of assimilating most modern “big data” ecosystems and tools, including Hadoop, HBase, Hive, Sqoop, Pig, and NoSQL databases such as Cassandra and MongoDB.

Using Talend’s visual mapping tools, you can graphically construct large-scale data transformations, replete with flow testing and debugging, and directly leverage the power of a Hadoop cluster to complete the job. You’ll need to pay for features like dashboards and metadata management. But when it comes to cutting through the complexities of big data integration — particularly syncing on-premise with cloud-based data sources and analytics — Talend Open Studio may be the sharpest tool in the shed.

— James R. Borck

Spark

Spark, the new kid on the Hadoop block, is an in-memory distributed computation engine that’s getting a lot of attention, and for good reasons. Spark offers direct and elegant expression of data transformations in a functional paradigm using the Scala collections sequence APIs. Part of Spark’s power is the use of a single underlying data representation, i.e. an RDD (Resilient Distributed Dataset), which is a distributed collection of items. By specializing RDDs, Spark can be used to power SQL queries, train machine learning models, process real-time data streams, and build complex graph structures, all in a unified processing stack. Programmers can mix and match these programming models using the same basic Scala semantics. One paradigm to rule them all.

— Steven Nunez

Storm

Storm is a distributed, real-time computation system that makes unbounded streams of data easy to process. Companies like WebMD, The Weather Channel, and Groupon are already taking advantage of this scalable, fault-tolerant, and effortless technology. Most of Hadoop is measured in minutes. Storm is about the milliseconds. You can do low-latency streaming one object at a time or in so-called microbatches. The questions remain: When will we see 1.0 and when will Storm be done “incubating”?

— Andrew C. Oliver

Kafka

This open source tool was originally developed by LinkedIn and used by Web companies to expedite messages from Web applications to their designated data systems. Kafka is able to handle hundreds of megabytes of reads and writes per second from thousands of clients. It also boasts a contemporary cluster-centric design that offers durability, elasticity, and transparency. Kafka is frequently paired with Storm for a streaming pub-sub type of model — kind of like JMS, only actually distributed and scalable.

— Andrew C. Oliver

Tez

Tez provides a distributed execution framework for data applications. Built on top of YARN, another Apache project that enables applications to share resources, Tez frees application developers from programming error-prone tasks like scheduling, resource management, and fault-tolerance in a distributed cluster. This means that higher-level applications like Cascading, Pig, and Hive can focus on their domain-specific problems, while Tez performs the heavy lifting within the cluster.

Tez lets developers focus less on managing the cluster and more on their business logic, so they can write applications that perform better, are more reliable, and make more efficient use of important cluster resources like network I/O, CPU, and RAM.

— Steven Nunez

Mahout

Mahout is a computation-engine-independent collection of algorithms for classification, clustering, and collaborative filtering, along with the underlying math required to implement them. It focuses on highly scalable machine learning.

Currently at release 0.9, with the 1.0 release around the corner, the project is moving from Hadoop’s MapReduce execution paradigm to a Domain Specific Language for linear algebra that is optimized for execution on Spark clusters. Map-reduce contributions are no longer accepted. Given the iterative nature of machine learning, this makes good sense and shows that the community isn’t afraid to do what’s right when technology changes.

— Steven Nunez

KNIME

Build predictive analytics applications without coding? Yes you can, thanks to this handy toolkit from Swiss analytics vendor KNIME. KNIME’s Eclipse-based visual workbench lets you drag and configure nodes on a canvas, then wire them together into workflows. KNIME supplies thousands of nodes to handle everything from data source connectivity (databases, Hive, Excel, flat files) to SQL queries, ETL transforms, data mining (supporting Weka and R), and visualization.

This year brought welcome connectors to Twitter and Google Analytics, as well as support for OpenStreetMap visualizations. The KNIME engine is open source with no crippling restrictions on data volume, memory size, or processing cores. The company also offers affordable commercial tiers that add collaboration, authentication, and performance enhancements for cluster environments.

— James R. Borck

Cascading

The learning curve for writing Hadoop applications can be steep. Cascading is an SDK that brings a functional programming paradigm to Hadoop data workflows. With the 3.0 release, Cascading provides an easier way for application developers to take advantage of next-generation Hadoop features like YARN and Tez.

The SDK provides a rich set of commonly used ETL patterns that abstract away much of the complexity of Hadoop, increasing the robustness of applications and making it simpler for Java developers to utilize their skills in a Hadoop environment. Connectors for common third-party applications are available, enabling Cascading applications to tap into databases, ERP, and other enterprise data sources.

— Steven Nunez

Read about more open source winners

InfoWorld’s Best of Open Source Awards for 2014 celebrate more than 130 excellent open source projects in nearly every corner of computing. Follow these links to more open source winners:

Bossie Awards 2014: The best open source applications

Bossie Awards 2014: The best open source application development software

Bossie Awards 2014: The best open source big data tools

Bossie Awards 2014: The best open source data center and cloud software

Bossie Awards 2014: The best open source desktop and mobile software

Bossie Awards 2014: The best open source networking and security software

Bossie Awards 2014: The best open source big data tools

InfoWorld's top picks in distributed data processing, data analytics, machine learning, NoSQL databases, and the Hadoop ecosystem

The best open source big data tools

IPython

Pandas

RCloud

R Project

Hadoop

Hive

Hivemall

Cloudera Impala

MongoDB

Cassandra

Neo4j

Pentaho

Talend Open Studio for Big Data

Spark

Storm

Kafka

Tez

Mahout

KNIME

Cascading

Read about more open source winners

Related content

Beyond the usual suspects: 5 fresh data science tools to try today

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

Safety off: Programming in Rust with `unsafe`

More from this author

InfoWorld Technology of the Year Awards 2023 Nominations Now Open

InfoWorld Technology of the Year Awards promotional information

InfoWorld Bossies

Free course: Getting started with Kubernetes

InfoWorld’s 2018 Technology of the Year Award winners

InfoWorld Bossie Awards promotional information

What is big data? Everything you need to know

InfoWorld’s 2017 Technology of the Year Award winners

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI