NoSQL buyer’s guide: How to choose the right NoSQL database

how-to
Feb 05, 202415 mins
Data ManagementEnterprise Buyer’s GuidesNoSQL Databases

Buyers have plenty of choice in NoSQL databases, so how do you choose? Here are five questions that could help you narrow it down.

Choosing a database
Credit: Stokkete/Shutterstock

NoSQL databases explained

NoSQL databases arose in response to the limitations of using SQL (Structured Query Language) for database queries. NoSQL databases store and manage data in ways that enable high operational speed and a level of flexibility not found in traditional relational database management systems (RDBMSs).

A recent report by Allied Market Research notes the demand for NoSQL databases is on the rise. In 2022, the worldwide NoSQL market generated $7.3 billion in sales, and is estimated to generate $86.3 billion by 2032 — an annual growth rate of 28% for that period. Key factors driving global NoSQL market growth, according to the report, are the exploding demand for big data analytics, a need for more scalable and flexible enterprise database solutions, and the ubiquity of cloud computing platforms and technology.

[ Download our editors’ PDF NoSQL database enterprise buyer’s guide today! ]

In this buyer’s guide

  • NoSQL databases explained
  • What to look for in NoSQL databases
  • Leading vendors for NoSQL databases
  • Essential reading

If your enterprise is considering migrating to NoSQL, you may wonder how to choose the best NoSQL database for your data storage needs. With more than two dozen open source and commercial NoSQL databases available, you have plenty of options to choose from.

What to look for in NoSQL databases

There are five key questions to ask before choosing a NoSQL database.

1. Is NoSQL the right choice? Before choosing a NoSQL database, it’s important to be certain that NoSQL is the best choice for your needs. Carl Olofson, a research vice president at IDC, says “back-office transaction processing, high-touch interactive application data management, and streaming data capture” are all good reasons for choosing NoSQL.

Even with these needs in mind, it is important to rule out the possibility that NoSQL is not the right fit for your enterprise, especially because there are trade-offs to choosing NoSQL over a traditional relational database management system (RDBMS). “The first decision you need to make is why do you need a NoSQL database system,” says Craig Mullins, president and principal consultant at Mullins Consulting. “You need to first understand why an existing relational DBMS cannot fulfill your use case. Relational/SQL database systems are widely installed, and most organizations have existing systems and applications deployed on RDBMSs with skilled technicians to manage them.”

An alternative to replacing the RDBMS, says Mullins, is polyglot persistence — using multiple data storage technologies in a single system to meet different data storage needs. Rather than “force-fitting everything into a relational mindset,” polyglot persistence lets developers and administrators “choose the appropriate data technology for each use case,” he says.

NoSQL’s core strength is likely its decentralized, scalable, fault-tolerant design, Mullins says. “Most NoSQL database technology is implemented to scale and survive outages,” he says. “Additionally, most NoSQL options are lightweight and require less overhead than a relational DBMS, in terms of CPU and support.”

2. Which NoSQL data model do you need? The four main types of NoSQL data models are key-value, document, column store, and graph. Each one fits a different use case. Mullins summarizes the strengths of each type as follows:

  • A key-value database is designed to be good for the high-availability, low-latency requirements of applications such as retail and mobile.
  • A document database is best suited for event logging, online shopping, content management, and in-depth analytical processing.
  • A column store database is good for event logging, content management, and counting and/or categorizing for analytics. Column stores can also be set up to automatically expire data.
  • A graph database is well-suited for applications where data elements are interconnected and the number of relationships between them is undetermined. Examples in this use case include social media networks, recommendation engines, logistics and routing, location-aware systems, public transportation links, and network topologies.

“Choosing the right model is essential,” says Noel Yuhanna, a vice president and principal analyst at Forrester Research. “The document model is the most popular, including the ability to store JavaScript Object Notation (JSON) documents optimally. The graph model focuses on interconnected data, while the key-value model focuses on a simple key-value pair retrieval, which is not as widely used.”

What data will be stored and how it will be accessed are essential in deciding which data model to choose, Yuhanna says. “Also, some vendor products support all models, which is the multi-model database, offering the flexibility of having multiple models.”

3. What is the latency requirement? Is the latency requirement millisecond, subsecond, seconds, minutes, or more?

“If the latency requirement is extremely small, as for a streaming data capture or real-time data-sharing application, one should look at a key-value store,” Olofson says. “Likewise if the data is a simple list or matrix.”

If the data is highly changeable in form and includes defined fields, a JSON document database might be more appropriate, Olofson says. This is also true for a high-touch interactive application, which is typically changed frequently to adjust for shifting requirements of the application and user.

“If the latency requirement is not so great and complex combinations must be supported, including bill-of-materials structures or complex groups of interrelated data, then one might consider a graph DBMS,” Olofson says.

4. How important are scalability and data consistency? NoSQL databases can break down data into segments — or shards — which can be useful for large deployments running hundreds of terabytes, Yuhanna says.

“Sharding is an essential capability for NoSQL to scale databases,” Yuhanna says. “Customers often look for NoSQL solutions that can automatically expand and shrink nodes in horizontally scaled clusters, allowing applications to scale dynamically.”

Unlike relational databases, which focus on ensuring data consistency for every transaction using ACID [Atomicity, Consistency, Isolation, and Durability] compliance, with NoSQL, “you can choose data consistency to be eventually consistent or even relaxed,” Yuhanna says. “With eventual consistency, you can scale quickly and deliver high performance.”

5. How do you want to deploy it? Some NoSQL databases can run on-premises, some only in the cloud, while others in a hybrid cloud environment, Yuhanna says.

“Also, some NoSQL has native integration with cloud architectures, such as running on serverless and Kubernetes environments,” Yuhanna says. “We have seen serverless as an essential factor for customers, especially those who want to deliver good performance and scale for their applications, but also want to simplify infrastructure management through automation.”

Leading vendors for NoSQL databases

Asking yourself and your organization the five questions just explained will help you choose the right NoSQL database for your needs. Now, let’s look at some of the leading NoSQL databases on the market today; all support distributed database scenarios.

Aerospike: Aerospike is an open source distributed, real-time, high-performance NoSQL database designed for applications that cannot tolerate downtime and need high read and write throughput.

Aerospike is a multimodel NoSQL and graph database that supports simultaneous data models, has unlimited scale, and enables organizations to act in real-time across billions of transactions. According to the product documentation, Aerospike uses massive parallelism and a unified storage model to ensure the smallest possible server footprint.

The platform ingests and acts on streaming data at the edge and can combine edge data with data from systems of record, third-party sources, data warehouses, or data lakes for operational, transactional, or analytical workloads. Aerospike can run on premises or as a cloud-managed service.

Amazon DocumentDB: Amazon DocumentDB is a fast, scalable, highly available, and fully managed document database service that supports MongoDB workloads. Amazon DocumentDB is designed from the ground up to give you the performance, scalability, and availability you need when operating mission-critical MongoDB workloads at scale.

Amazon DocumentDB implements the Apache 2.0 open source MongoDB 3.6 API by emulating the responses that a MongoDB client expects from a MongoDB server, allowing you to use your existing MongoDB drivers and tools with DocumentDB. The database service in the Amazon cloud also uses a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64 TB per database cluster.

In Amazon DocumentDB, the storage and compute are decoupled, allowing each to scale independently. Developers can increase the read capacity to millions of requests per second by adding up to 15 low-latency read replicas in minutes, regardless of the size of the data. Amazon DocumentDB is designed for 99.99% availability and replicates six copies of your data across three AWS Availability Zones.

Amazon DynamoDB: Amazon DynamoDB is a serverless, NoSQL, fully managed database service that provides single-digit millisecond response times at any scale. A strong selling point of this database is that it enables organizations to develop and run applications while only paying for what they use.

This cloud-based service offers encryption at rest to protect sensitive data. It also lets users create database tables that can store and retrieve any amount of data and serve any level of request traffic. Users can scale a table’s throughput capacity up or down without downtime or performance degradation, according to AWS. Developers and admins can use the AWS Management Console to monitor resource utilization and performance metrics.

DynamoDB also provides on-demand backup capability, allowing users to create full backups of tables for long-term retention and for regulatory compliance needs.

Cassandra: Apache Cassandra is a highly available distributed data store that values availability and partition tolerance over consistency. The design of Cassandra combines the partitioning and replication of the Amazon Dynamo key-value store with the log-structured ColumnFamily data model of Google Bigtable. Cassandra scales linearly as you add nodes.

Consistency is not completely lost in Cassandra; it’s a tradeoff against latency. The user can specify the consistency level of each read and write, ranging from requiring only one node, through requiring a cluster quorum, to requiring all nodes. Another intermediate option is to require a local quorum, which is a way to attain consistency within a data center without waiting for remote nodes to update.

Couchbase: Couchbase Server, distributed by Couchbase Inc., is a multimodel JSON document support database platform. It’s an open source NoSQL key-value and document database with built-in cache. It’s suitable for enterprises that need a database that can deliver performance, multi-model, scale, and automation.

Organizations use the platform to support social media and mobile applications, content and metadata stores, e-commerce transactions, and other applications. It provides full support for documents, flexible data model, indexing, full-text search, and MapReduce for real-time analytics.

CouchDB: Apache CouchDB is an open source document model database with a query engine, replication, and conflict resolution. It uses a RESTful HTTP API for queries as well as updates. CouchDB is implemented in Erlang.

The CouchDB file layout and commitment system feature all ACID properties. On-disk, CouchDB never overwrites committed data or associated structures, ensuring the database file is always in a consistent state. This is a “crash-only” design where the CouchDB server does not go through a shutdown process; rather, it’s simply terminated.

CouchDB read operations use a multiversion concurrency control (MVCC) model where each client sees a consistent snapshot of the database from the beginning to the end of the read operation. Documents are indexed in B-trees by their name (DocID) and a Sequence ID.

CouchDB is a peer-based distributed database system. It allows users and servers to access and update the same shared data while disconnected. Those changes can then be replicated bi-directionally later. CouchDB allows for any number of conflicting documents to exist simultaneously in the database, with each database instance deterministically deciding which document is the “winner” and which are conflicts. When distributed edit conflicts occur, every database replica sees the same winning revision and each has the opportunity to resolve the conflict.

DataStax: DataStax Astra DB is a fully managed, cloud-native, database as a service built on Apache Cassandra. It scales dynamically and accelerates application development via a range of APIs and programming language options, so developers can build real-time applications fast and scale them without limits, according to the company. Among other improvements, DataStax eliminates the need to run repair scripts and eliminates the cluster outages that can occur when manual repairs fail; automatically keeps DataStax Enterprise nodes from overloading with client or replica requests; and uses a thread-per-core architecture that improves throughput up to 2X for both read and write operations.

Developers can readily ensure data security with Astra DB’s built-in security mechanisms such as Private Link, IP access controls, single sign-on, application tokens, and data encryption. Astra DB’s serverless architecture (built on microservices and API-first principles) scales automatically based on demand.

FaunaDB: FaunaDB is a distributed, strongly consistent online transaction processing (OLTP) NoSQL database that is ACID-compliant and offers a multimodel interface. It has an active-active architecture and can span clouds as well as continents.

FaunaDB supports document, relational, graph, and temporal datasets from a single query. In addition to its own FQL query language, the company has announced support for GraphQL now, plus Cassandra Query Language (CQL) and SQL in the future.

Google BigTable: Bigtable from Google is an enterprise-grade NoSQL database service with low single-digit millisecond latency, limitless scale, and 99.999% availability, according to the company. It supports multitenant, mixed operational, and real-time analytical workloads.

Google says Bigtable is a key-value and wide-column store, ideal for fast access to structured, semi-structured, or unstructured data. Latency-sensitive workloads such as personalization are also a good fit for the platform. Bigtable automatically scales resources to adapt to server traffic, handling the associated sharding, replication, and query processing as needed.

MarkLogic: MarkLogic Server is a multimodel database that combines document, semantic graph, geospatial, and relational models into a single, scalable, operational database, according to MarkLogic. It provides native storage for JSON, XML, text, semantic/Resource Description Framework (RDF) triples, geospatial, and binaries, with unified search-and-query interface capabilities.

The database has a search engine built into its core, providing a single platform to load data from silos and search across all the data. As such, it does not require a bolt-on search engine for full-text search. MarkLogic Server also offers enterprise data security controls such as data loss prevention.

Microsoft Azure Cosmos: Azure Cosmos DB is a Microsoft Azure database service that supports multiple NoSQL models and a variety of data formats, including JSON and binary data. Microsoft says the database is also fully managed, with Microsoft Azure handling all the underlying infrastructure so that developers can focus on their applications and data.

Azure Cosmos DB offers security tools such as data encryption and data access controls. It features automatic and instant scalability, and open source APIs for MongoDB, Cassandra, and other NoSQL engines.

MongoDB: MongoDB, maintained by MongoDB Inc. and published under a combination of the Gnu Affero General Public License and the Apache License, is a free and open source, cross-platform, document-oriented database.

It uses JSON-like documents with schemas, and incorporates operational best practices learned from optimizing thousands of deployments at organizations of all sizes. The cloud-based offering can handle database management, setup and configuration, software patching, monitoring, and backups. It operates as a distributed database cluster. Key features and capabilities include fully managed backup, point-in-time recovery, a real-time performance panel, and customizable alerting.

Redis: Redis Enterprise, sponsored by Redis Labs, is an open source, key-value NoSQL in-memory database that supports both relaxed and strong consistency, a flexible schema-less model, high availability, and ease of deployment.

The platform supports key-value; a variety of data structures such as lists, sets, bitmaps, and hashes; and a variety of models through pluggable modules such as search, graph, JSON, and XML. Redis Enterprise includes a real-time indexing, querying, and full-text search engine available on-premises and as a managed service in the cloud.

Yandex ClickHouse: ClickHouse is an open source, column-oriented online analytical processing (OLAP) database management system that manages extremely large volumes of data, including non-aggregated data, in a stable and sustainable manner, and allows generating custom data reports online in real time. The system is linearly scalable and can be scaled up to store and process trillions of rows and petabytes of data.

ClickHouse is designed to work on regular hard drives, which means the cost per gigabyte of data storage is low, but SSD and additional RAM are also fully used if available. (By contrast, SAP HANA can only work in RAM.) ClickHouse does parallel processing on multiple cores.

In ClickHouse, data can reside on different shards. Each shard can be a group of replicas that are used for fault tolerance. The query is processed on all the shards in parallel.

ClickHouse supports a declarative query language based on SQL that is identical to the SQL standard in many cases. Supported queries include group by, order by, subqueries in from, in, and join clauses, and scalar subqueries. Dependent subqueries and window functions are not supported.

Although ClickHouse does support data inserts and mutations, it was not designed for OLTP. Yandex recommends inserting data in packets of at least 1,000 rows, or no more than a single request per second. No locks are taken when new data is ingested.

YugaByte: YugaByte DB is an open source, transactional, high-performance database for planet-scale applications that supports three API sets: YCQL, compatible with Apache CQL; YEDIS, compatible with Redis; and PostgreSQL.

YugaWare is the orchestration layer for YugaByte DB Enterprise Edition. YugaWare makes quick work of spinning up and tearing down distributed clusters on Amazon Web Services, Google Cloud Platform, and Microsoft Azure. YugaByte DB implements multi-version concurrency control (MVCC), which it uses for nonlocking reads.

YugaByte Enterprise supports read replicas, multicloud clusters, and comprehensive monitoring and alerting without any configuration. It also features in-flight and at-rest encryption, one-click distributed backups and restores for clusters of any size, and auto-tiering of cold data to cheaper storage.

Essential reading

martin_heller
Contributor

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi. Disclosure: He also writes for Hewlett-Packard’s TechBeacon marketing website.

More from this author