InfoWorld picks the best open source software for large-scale search, SQL, NoSQL, and streaming analytics Credit: IDG The best open source databases and data analytics toolsWe’ve watched a parade of NoSQL databases march through the Bossies the last few years. Well, SQL has had enough! Our 2017 winners include two examples of scale-out SQL databases, along with the usual gang of analytics-oriented projects from the Apache Software Foundation, a name that has become synonymous with big data. [ InfoWorld presents the Best of Open Source Software Awards 2017: The best open source software development tools. | The best open source machine learning tools. | The best open source cloud computing software. | The best open source networking and security software. ]CockroachDB Image by Cockroach LabsCockroachDB is a cloud-native SQL database for building global, scalable cloud services that survive disasters. Built on a transactional and strongly consistent key-value store, CockroachDB scales horizontally, survives disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention, supports strongly consistent ACID transactions, and provides a familiar SQL API for structuring, manipulating, and querying data. CockroachDB was inspired by Google’s Spanner and F1 technologies.— Martin HellerMapD Image by MapD TechnologiesMapD is a GPU-powered in-memory database and visualization platform designed for lightning-fast, immersive data exploration that eliminates the disconnect between analyst and data. By bringing the power of GPU supercomputing to data analytics, MapD can query and visualize billions of records in milliseconds. The MapD Core database takes advantage of the parallelism of GPUs with a state-of-the-art query compilation engine to achieve orders-of-magnitude speedups for analytic SQL queries, powering exploration of big datasets with near zero latency.— Martin HellerCrateDB Image by Crate.ioCrateDB is a distributed SQL database built on top of a NoSQL foundation. It combines the familiarity of SQL with the scalability and data flexibility of NoSQL. CrateDB provides packages and executables that will work on any operating systems capable of running Java including Linux, MacOS, and Windows. Typical use cases for CrateDB are IoT and sensor data, log and event analytics, time series, geospatial search, and machine learning.— Martin HellerGreenplumImage by Pivotal SoftwareGreenplum is a MPP (massively parallel processing) SQL database with 100-percent ANSI SQL compliance. Greenplum offers machine learning, an advanced parallel query engine, and a highly scalable distributed architecture. Previously a proprietary solution, Greenplum became an Apache project in 2015. The first open source release arrived in September 2017. — Steven NunezApache Spark Image by The Apache Software FoundationApache Spark is still on fire. Version 2.2, released in July, brought a large number of new features to Core (which now includes the SQL query engine), improvements to the Kafka streaming interface, and additional algorithms in MLlib and GraphX. The SparkR library supporting distributed machine learning on R also sees enhancements, especially in the area of SQL integration. And the list of bug fixes and general improvements is long, making this release one of the best yet.— Steven NunezApache Solr Image by The Apache Software FoundationBuilt on the Lucene index technology, Apache Solr is the distributed index/document database that could, would, and does. Whether you need a simple index or to handle complex embedded documents, Solr is there for you. Finding things in a mountain of text is obviously Solr’s strength but modern Solr lets you do much more including the ability to execute SQL in a distributed fashion or even execute graph queries. The engine has continued to develop with new ”point” types, which execute numeric queries with better performance and smaller indexes.— Andrew C. OliverApache Arrow Image by The Apache Software FoundationApache Arrow is a high-speed, columnar, cross-system data layer that speeds up big data. By storing data in memory using Arrow, applications can skip the costly serialization/deserialization steps that create bottlenecks in today’s data processing pipelines. Involving developers from many Apache Big Data projects—Cassandra, Parquet, Kudu, Spark, and Storm among others—the Apache Arrow project will be processing large chunks of the world’s data in the coming months and years.— Ian PointerApache Kudu Image by The Apache Software FoundationApache Kudu is poised to become a major component of big data infrastructure. Kudu is optimized for situations where large amounts of data see frequent updates and where analytics are needed on a timely basis. This has been been a challenge with traditional Apache Hadoop architectures, normally resulting in a complex HDFS and HBase solution that can lead to code and data synchronization issues. Kudu promises better and simpler architectures for IoT, time series, and streaming machine learning processing.— Ian PointerApache ZeppelinImage by The Apache Software FoundationApache Zeppelin is a sort of Rosetta Stone for developers, analysts, and data scientists alike. A slew of interpreters allow you to pull from different data stores and analyze in multiple languages. This means you can pull data from your Apache Solr index, cross-reference with your Oracle database, and analyze the results in Spark with code you wrote in Scala. Then your statistician can analyze your DataFrame in R and visualize the results before your data scientist uses a favorite Python library to do some deep learning—all from the same workbook!— Andrew C. OliverR ProjectImage by The R FoundationThe R programming language should need little introduction. In 2017 we find adoption continuing to grow, with support from Microsoft, Oracle, and IBM as well as several smaller players. CRAN, the Comprehensive R Archive Network, contains nearly every statistical computing algorithm of significance, ready to run, along with sophisticated graphics and a scripting language that makes it easy to save and rerun analyses on updated data sets.— Steven Nunez Apache Kafka Image by The Apache Software FoundationApache Kafka is a distributed streaming platform that is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant, fast, and runs in production in thousands of companies. Kafka lets you publish and subscribe to streams of records, much like a message queue or enterprise messaging system. It lets you store streams of records in a fault-tolerant way, and it lets you process streams of records as they occur.— Martin HellerCruise ControlImage by LinkedInApache Kafka is a powerful and stable distributed streaming platform, but it can be a pain to manage. It handles failures in an automated fashion but often leads to unbalanced loads across the system. LinkedIn noticed that its SRE’s spent a ton of time on Kafka resource monitoring and rebalancing, so developed Cruise Control to lighten their burden. It was just open sourced in late August, but is a promising timesaver for anyone running Kafka in production.— Jonathan FreemanJanusGraphImage by JanusGraph AuthorsJanusGraph is a distributed graph database built on a column family database. Unlike other popular open source graph databases, it supports large graphs and distributed processing natively. JanusGraph has a ton of features, integrates with the likes of Apache Spark and Apache Solr, and supports Apache TinkerPop’s implementation of the Gremlin graph query language. If you have a graph-shaped problem and data that lends itself to a graph structure, JanusGraph has answers.— Andrew C. OliverApache TinkerPopImage by The Apache Software FoundationPowering nearly all of the popular graph processing frameworks including Spark, Neo4j, and Titan, TinkerPop allows users to model their problem domain as a graph and analyze it using a graph traversal language. For some domains, especially those that involve relationships, this technique is very effective. Graph computing is rapidly coming of age, and TinkerPop is leading the way in open source implementations.— Steven Nunez Related content analysis Beyond the usual suspects: 5 fresh data science tools to try today The mid-month report includes quick tips for easier Python installation, a new VS Code-like IDE just for Python and R users, and five newer data science tools you won't want to miss. By Serdar Yegulalp Jul 12, 2024 2 mins Python Programming Languages Software Development analysis Generative AI won’t fix cloud migration You’ve probably heard how generative AI will solve all cloud migration problems. It’s not that simple. Generative AI could actually make it harder and more costly. By David Linthicum Jul 12, 2024 5 mins Generative AI Artificial Intelligence Cloud Computing news HR professionals trust AI recommendations HireVue survey finds 73% of HR professionals trust AI to make candidate recommendations, while 75% of workers are opposed to AI making hiring decisions. By Paul Krill Jul 11, 2024 3 mins Technology Industry Careers how-to Safety off: Programming in Rust with `unsafe` What does it mean to write unsafe code in Rust, and what can you do (and not do) with the 'unsafe' keyword? The facts may surprise you. By Serdar Yegulalp Jul 11, 2024 8 mins Rust Programming Languages Software Development Resources Videos