jonathan_freeman
Contributing Editor

Couchbase 4.0 review: The Swiss Army knife of NoSQL

reviews
Feb 17, 201614 mins
DatabasesNoSQL DatabasesSoftware Development

Hybrid document-oriented, key-value database brings easy, ad hoc queries into the mix with a SQL-like query language

Couchbase Server, similar to MongoDB and RethinkDB, is a document-oriented distributed database, but that description sells it a good deal short. Couchbase is what you get when a distributed key-value store and a document database join forces — literally.

With direct and immediate ties to both Membase and CouchDB, Couchbase Server takes the best of both worlds and jams them into a single product. It’s even added structured queries to the mix. With the recent release of version 4.0, the open source database takes a big leap forward in usability with the introduction of the SQL-like N1QL query language.

Understanding persistence in Couchbase is much easier if you approach it as a key-value store than as a document database. With a key-value store, it’s obvious that in order to store and retrieve a value, you also need to provide a key. It’s also obvious that the key doesn’t have to be repeated somewhere inside the value you’re storing. However, if you’re coming from a document database like MongoDB, it may seem strange to pull the unique identifier out of the object in order to store it.

The benefit here is that you’re really getting two kinds of databases at once. You can still take advantage of the key-value functionality that’s been a part of Couchbase from the beginning, while utilizing the document storage and retrieval that was incorporated in the 2.0 release.

If I had a N1QL for every view…

Before diving into what N1QL is, let’s get some context for why it’s here. Formerly, the way you got data out of Couchbase was either by direct key lookup or by writing an incremental map-reduce script (a “view”). This was a huge drawback because it seriously limited the ability to make ad-hoc queries in a performant manner.

If you decided you were interested in sifting through data by last name instead of Social Security number, for example, you’d have to write up a map-reduce job and wait for Couchbase to do the equivalent of a full table scan in order to populate your view. (Subsequent requests for that data would be speedy, but the first time for any custom view was painful.) On top of that, if you wanted to do anything else with the data (such as roll up the results on city and state), you’d be stuck doing it in your application code or writing another map-reduce job.

N1QL aims to eliminate that pain by overlaying a partial SQL implementation on top of the otherwise NoSQL model. N1QL not only gives programmers another option for querying their data, but its familiar dialect opens the field to less technical folk who have experience and comfort in the world of SQL. Enabling your business analysts to explore the data more comfortably reduces iteration time to valuable results and frees up engineers who would otherwise be fielding those requests by writing custom views.

N1QL supports a seriously wide range of SQL syntax, from simple SELECT and WHERE statements to nested queries, GROUP BY and HAVING aggregations, and even JOIN. Here is an example of a valid N1QL query:

SELECT u.screen_name as sn, count(*) as num_tweets FROM `tweets` tw JOIN   `users` u ON KEYS tw.user_id WHERE tw.text LIKE "%javascript%" GROUP BY u.screen_name HAVING count(*) > 5

The above query finds tweets containing the word “javascript,” joins them with data from the “users” bucket, groups tweets by user, and only returns groups that have more than five tweets per user. Note that only INNER and LEFT OUTER joins are supported, and one side of the join has to be on a bucket key. Those limitations purportedly improve performance, but I’d still be hesitant to add any join-heavy query logic to my application code. If you know you’ll need it, you’re better off writing a view and side-stepping the costlier query.

Because the nesting of documents and document elements is common practice in document databases, N1QL includes new operators to help navigate these structures. NEST and UNNEST gather documents and split them out, respectively, while helper functions like array_length() allow you to work with embedded arrays.

Indexing, updates, and storage engines

When it comes to relational databases, SQL query performance can be greatly improved by indexing properly ahead of time, and the same is true for N1QL. To support indices for N1QL queries, Couchbase Server now comes with an Index service. The Index service is a new component that allows you to create and manage indices on buckets of data. You can create an index by specifying the fields on which to index, as well as N1QL expressions and an optional WHERE clause to filter which documents get sent to the indexer.

A key point to keep in mind: When you create an index, it exists on a single instance of the index service. If that instance goes down, you’re out of luck. There is no automated replication or sharding of indexes; unless you’ve manually created the index on multiple nodes (which you can and should do), you’re back to full bucket scans for potentially complex queries. Compare indexed queries to writing incremental map-reduce views for complex queries. Although map-reduce views are sharded, distributed, and replicated with your data, getting results for a view requires scatter-gather operations across the network. Because an index resides completely on a single node, you can avoid any scatter-gather operations if you have a covering index.

Like Cassandra and some other popular data stores, Couchbase employs an append-only write model. This model favors immutability by never performing in-place updates. Instead, updated documents are added to the end of a file, which is subsequently read from the end. The most recent document wins, and old versions of the same document are invalid.

The append-only write model lends itself to the problem of files growing forever, since every possible change results in more bits at the end of the file. As a result, a cleanup step, often referred to as compaction, is needed to prevent the disk from filling up. The existing file is rewritten without all of the stale documents, and when the new file catches up with the old file, the database starts using the new one and the old one can be deleted.

couchbase compaction

Instead of performing in-place updates, Couchbase appends updated documents to the end of a file. At some point, it writes a new version of the file that omits the stale documents — a process called compaction. 

Compaction works pretty well, but it has drawbacks. First, you need enough disk space to hold an extra copy of all of your data; otherwise, compaction will fail. More important, with a write-heavy use case, the new file may never catch up. Couchbase mitigates these issues by performing compaction not at the database or bucket level, but at the vBucket level, which is 1/1,024 of a bucket. By reducing the size of the file to be compacted, you can perform incremental compaction with lower disk resource requirements and smaller probabilities of compaction failure due to heavy writes. Compacting at the vBucket level is a big step toward preventing compaction failures, but it’s easy to negate its mitigations with a poorly devised data model.

Couchbase runs two different services that demand disk usage; as a result, it uses two different storage engines. The data service, responsible for basic CRUD operations and views, uses Couchstore to persist to disk. The index service, responsible for maintaining index data from the GSI (global secondary index), works with ForestDB for persistence.

Couchstore, which has been around since the Couchbase 2.0 release, is what the data service uses to handle direct document access and the maintenance and storage of views. It employs a slightly modified B-Tree data structure, which ensures consistency and performance across lookups, updates, and deletes. There are drawbacks due to the append-only strategy, such as no sibling-chaining for sequential lookups and a slightly costlier update algorithm, but it’s not too different from a standard B-Tree in practice. On disk, the Snappy library is used to compress data, similar to the default in MongoDB 3.0’s Wired Tiger. However, whereas compression is pluggable in MongoDB, Snappy is the only option for Couchbase. Lucky for us, Snappy is a solid compression library.

ForestDB is relatively new by comparison, dating from its original beta release in October 2014. Used by the Index service to maintain the GSI, ForestDB is accessed exclusively through the Query service that receives and parses N1QL queries. It employs a “Hierarchical B+-Tree based Trie,” which is a “trie” (a tree data structure whose keys are strings) of B+ trees optimized for shallow depth and disk access. The benefits gained by this sort of data structure are primarily realized when you need efficient access to variable length strings, which is exactly what the index service is going to be doing.

This sort of data structure would also be useful in place of the B-Tree used in Couchstore, and in fact there are plans to replace Couchstore with ForestDB in the future. Developers at Couchbase have been working on it for over a year, but a target release number has yet to be announced. As with Couchstore, Snappy is used for compression.

InfoWorld Scorecard
Administration (20%)
Ease of use (20%)
Scalability (20%)
Installation and setup (15%)
Documentation (15%)
Value (10%)
Overall Score (100%)
Couchbase Server 4.0 8 8 9 9 8 8 8.5

Scaling, specialization, and spatial views

All Couchbase nodes in a cluster are identical by default. Each node contains several services that power indexing, querying, and data management, as well as cluster administration tools. Having one node type and all nodes performing all duties in a basic cluster makes administration easy: Simply add another node of the same type to the cluster and you’re most of the way there. While this simple approach was the only way of scaling in previous versions of Couchbase Server, the deployment story has changed with the introduction of the Index and Query services. These two services, both released in Couchbase 4.0 to support N1QL, have a different resource footprint than the preexisting Data service, and having to scale them together could be inefficient.

Enter Multidimensional Scaling (MDS), a feature available only in the Enterprise Edition. MDS is simply a way to specialize nodes in your cluster in order to scale them independently. Each of the core services — Data, Query, and Index — can be turned on or off for a particular node in the cluster, and you can then optimize for the type of workload each node needs to perform. The Data service thrives on horizontal scaling, the Index service on fast disk access, and the Query service on beefy CPUs. Throw them all on their own nodes and you can optimize to the extreme.

This type of scaling is transparent to developers, so they don’t have worry about their code breaking as features are promoted from a single-instance development environment to a symmetrically scaled QA environment to a multidimensionally scaled production environment. The downside of MDS is that you incur some administrative overhead for specializing nodes, but it’s designed so that you can isolate that cost to specific environments without cascading effects.

An appealing feature of Couchbase is Cross Datacenter Replication (XDCR). This is a mechanism for sending data between otherwise separate Couchbase clusters. XDCR supports unidirectional or bidirectional data flow, and it can send all the data or only a filtered subset of data. Major use cases include disaster recovery, data locality, and performing development tests on production data in a sandbox.

Setting up XDCR is pretty straightforward in the administrative UI that runs on all Couchbase nodes. Keep in mind that some XDCR features, like data filtering and encryption, are only available in the Couchbase Server Enterprise Edition.

In nearly every domain, valuable information comes from context, and one of the most important contexts is location. Couchbase 4.0 finally flips on the GA flag for the spatial views feature that has been marked experimental for the last two major versions. As the name suggests, spatial queries require you to write an old-fashioned view. Spatial views rely on location information defined in the GeoJSON format, and they support a handful of geometry types from Points to MultiPolygons.

However, note that only the bounding box of more complex geometry is indexed under the covers. If you’re trying to do sophisticated queries with MultiPoints or GeometryCollections, you’ll more than likely have to augment the database query with custom logic to handle the current constraints.

couchbase dashboard

Couchbase’s admin UI, available on every node in the cluster, puts admin functions and performance metrics within easy reach. 

Speaking of constraints, you can only query for a geospatial index by bounding box. This is sufficient in some cases, but radius queries and polygon queries are necessary in many others.

Beyond simply creating a view on a geolocation, these spatial views often combine locations with other numeric fields to create n-dimensional views. The resulting view can then be used to query on a geographic bounding box as well as other bounding values, such as a date range or log level (so long as you encoded log levels into a numeric value in the view).

In addition to the major features already discussed, Couchbase 4.0 includes a handful of smaller features that may prove useful. Auditing of administrative users will prove handy for those working on HIPAA compliance, as will LDAP integration, but only if you pay for the Enterprise Edition. Additionally, since the early October release of Couchbase 4.0, a minor update has brought performance enhancements, better platform coverage, and more notably, the general availability of the INSERT, UPDATE, and DELETE functionality in N1QL.

Managing Couchbase Server

Each node in a Couchbase cluster has an administrative UI running on it, so you can hop in anywhere and start tinkering. The UI has a cluster overview tab, which has basic cluster-wide aggregations, followed by a Server Nodes tab where you can drill into detailed semi-real-time metrics. Here you can monitor everything from HTTP requests per second to average scan latency on an index.

Next up is a tab for Data Buckets, where you can create buckets, trigger manual compaction, and view and modify documents. The Views tab provides an editor for playing around with views in development and tools to “publish” those views to production.

The Settings tab has one gem worth pointing out: Alerts. A quick configuration allows the Couchbase cluster to use an email server you provide to send you alerts about significant events in the cluster. You choose which ones you care about from a list, but some include “Disk space used for persistent storage has reached at least 90% of capacity” and “Node was not auto-failed-over as there are not enough nodes in the cluster running the same service.” The UI generally feels a bit old school, but it offers useful information and plenty of functionality. If you need more, there’s always the command line.

There are three options regarding Couchbase Server editions and licensing. The Community Edition releases, which have no restrictions and can be used by anyone for free, lack some advanced features and don’t incorporate bug fixes in a timely manner. They’re also less rigorously tested than the Enterprise Edition. Couchbase Enterprise unlocks certain features like XDCR filtering and rack awareness, and it includes regular updates, bug fixes, and 24/7 support. The Enterprise Edition starts at $5,600 per node per year and scales up with the technical service SLAs needed. The last option is that you build the server yourself, without fees or support. The source code for every component and feature, including those found only in the Enterprise Edition, is open and available on GitHub. You’ll find the instructions in the Couchbase documentation.

Couchbase Server shares similarities with a number of other databases, but there’s nothing else quite like it. Couchbase is a document database with a promising new query model, a key-value store with an odd map-reduce mechanism, and a distributed replacement for Memcached. Couchbase Server has undergone so many changes in the past few years that it almost defies description.

Sometimes Couchbase looks like SQL bolted onto a document database bolted onto a key-value store. Then you catch a glimpse of underlying elegance, like how the original product was extended to support the Indexing service, XDCR, and Kafka connectors, each in effectively identical ways. All in all, it’s a rich technological foundation that offers both developers and operations teams a great deal of flexibility. As such, it also presents them with a great deal of complexity. As always, when there are many ways to do something, you need to pay close attention to what you’re doing.

jonathan_freeman
Contributing Editor

Jonathan Freeman is a software developer, consultant, and jazz musician living in Chicago. Through consulting, he's enjoyed working in various domains, from finance to healthcare to video games. While he specializes in JavaScript, both browser and server side, he also takes a keen interest in modern data stores (particularly graph databases) and distributed computing platforms.

More from this author