martin_heller
Contributor

Buyer’s guide: How to choose a cloud data warehouse

how-to
Dec 13, 202113 mins
Cloud ComputingData ArchitectureDatabases

Modern data warehouses can query structured data and semi-structured data simultaneously, and even combine historical data and streaming live data for analysis.

cloud computing data center
Credit: Thinkstock

Cloud data warehouses explained

Enterprise data warehouses, or EDWs, are unified databases for all historical data across an enterprise, optimized for analytics. These days, organizations implementing data warehouses often consider creating the data warehouse in the cloud rather than on premises. Many also consider using data lakes that support queries instead of traditional data warehouses. A third consideration is whether you want to combine historical data with streaming live data.

A data warehouse is an analytic, usually relational, database created from two or more data sources, typically to store historical data, which may have a scale of petabytes. Data warehouses often have significant compute and memory resources for running complicated queries and generating reports, and are often the data sources for business intelligence (BI) systems and machine learning.

[ Download our editors’ PDF cloud data warehouse enterprise buyer’s guide today! ]

In this buyer’s guide

  • Cloud data warehouses explained
  • What to look for in cloud data warehouses
  • Leading vendors for cloud data warehouses
  • Essential reading

The write throughput requirements of transactional operational databases limit the number and kind of indexes you can create (more indexes mean more writes and updates per record added, and more possible contention). This in turn slows down analytic queries against the operation database. Once you have exported your data into a data warehouse, you can index everything you care about in the data warehouse for good analytic query performance, without affecting the write performance of the separate online transaction processing (OLTP) database.

Data marts contain data oriented toward a specific business line. Data marts may be dependent on the data warehouse, independent of the data warehouse (i.e., drawn from an operational database or external source), or a hybrid of the two.

Data lakes, which store files of data in its native format, are essentially “schema on read,” meaning that any application that reads data from the lake will need to impose its own types and relationships on the data. Traditional data warehouses, on the other hand, are “schema on write,” meaning that data types, indexes, and relationships are imposed on the data as it is stored in the data warehouse.

Modern data warehouses can often handle structured data and semi-structured data and query them simultaneously. In addition, modern data warehouses can often query historical data and streamed recent data simultaneously.

A data warehouse can be implemented on-premises, in the cloud, or as a hybrid. Historically, data warehouses were always on-premises, but the capital cost and lack of scalability of on-premises servers in data centers were sometimes issues. On-premises EDW installations grew when vendors started offering data warehouse appliances. Now, however, the trend is to move all or part of your data warehouse to the cloud to take advantage of the inherent scalability of cloud data warehouses, and the ease of connecting to other cloud services.

The downside of putting petabytes of data in the cloud is the operational cost, both for cloud data storage and for cloud data warehouse compute and memory resources. You might think that the time to upload petabytes of data to the cloud would be a huge barrier, but the hyperscale cloud vendors now offer high-capacity, disk-based data transfer services.

What to look for in cloud data warehouses

As you evaluate cloud data warehouses, look for administrative simplicity, high scalability, high performance, good integrations, and reasonable cost. Ask for customer references, especially for large deployments, and do your own proof of concept. Look explicitly at the vendor’s current and planned machine learning capabilities, because much of the business value of data warehouses comes from obtaining and applying predictive analytics.

There are six types of capabilities every cloud data warehouse should provide. This section explains them.

Speed and scalability: Data warehouses are designed so that analytical queries can run fast. For old on-premises data warehouses, reports with multiple queries based on historical data were typically run overnight. For modern cloud data warehouses, the performance requirements are stiffer, as analysts expect to run queries based on historical plus streaming data interactively, and then dig deeper with more queries.

Cloud data warehouses are usually designed to scale CPU capacity as needed, so that interactive queries against petabytes of data can return answers in minutes. Some cloud data warehouses can increase the CPU resources while a query is running without restarting the query, and reduce them again when the data warehouse is idle. Aggressive up-scaling and down-scaling can be a good strategy to get high performance when needed for low overall cost.

Columnar versus row storage: Row-oriented databases organize data by record. They typically try to store one database row in one block of storage, so that the whole row can be retrieved with a single read operation. Row-oriented databases are efficient for both reading and writing rows. Most transactional databases are row-oriented and use b-tree indexes.

Column-oriented databases organize data by field, and try to store all the data associated with a field together. Columnar databases are efficient for reading and computing on columns. Most data warehouses store data in columns, compress their data heavily, and use LSM-tree indexes. The original paper describing C-Store, a read-optimized column-oriented database, was published in 2005. The C-Store paper laid the groundwork for most modern columnar store data warehouses, including Amazon Redshift, Google BigQuery, and Snowflake.

Some databases combine row and columnar storage. They use row storage for OLTP, and columnar storage for analytic queries. A few databases can query data in columnar storage and row storage together, which speeds up queries where not all fields can fit into columnar storage.

In-memory storage and layered storage: What’s faster than a compressed columnar store on disk? A compressed columnar store in memory. What can handle more data than a columnar store in memory? A layered storage system that backs memory with PMEM, such as Intel Optane, which is faster than flash and cheaper than DRAM. Additional layers would be flash and spinning disks. The hard part of a scheme like this is implementing the multilevel caching without slowing down retrievals or allowing unnecessary cache flushing in the faster layers.

ETL versus ELT: Extract, transform, and load (ETL) tools pull the data, perform any desired mappings and transformations, and load the data into the data storage layer. By contrast, extract, load, and transform (ELT) tools store the data first and transform it later. When you use ELT tools, it is common to also use a data lake.

Clustered and distributed cloud data warehouses: Because data warehouses are read-mostly databases, it is easier to cluster them than to cluster OLTP databases. It is also easier to distribute data warehouses geographically without incurring high write latency. Once your data warehouse has a clustered architecture, it is easy to add nodes to the cluster to increase processing capacity and return results faster.

Cloud UI for admin and queries: Just about every cloud data warehouse has its own user interface for administration and queries. Some are more usable than others. Administration is simpler than query building. Adding a node (or setting a maximum number of nodes for autoscaling) can be as easy as pressing one button. Some cloud data warehouses offer a graphical query builder, which is useful for SQL novices. Many cloud data warehouses offer a history pane for past queries and their answers.

Leading vendors for cloud data warehouses

The 13 products listed here alphabetically either are cloud data warehouses or provide the functionality of data warehouses while building on a different base architecture, such as data lakes.

You could argue that Ahana, Delta Lake, and Qubole are built on data lakes rather than starting as data warehouses, but you could also argue that they provide much the same functionality as unquestioned data warehouses such as AWS Redshift, Azure Synapse, and Google BigQuery. Because all these products add heterogenous federated query engines, the functional distinction between data lakes and data warehouses tends to blur.

Amazon Redshift: Using Amazon Redshift, you can query and combine exabytes of structured and semistructured data across your data warehouse, operational database, and data lake using standard SQL. Redshift lets you easily save the results of your queries back to your S3 data lake using open formats, such as Apache Parquet, so that you can do additional analytics from other analytics services such as Amazon EMR, Amazon Athena, and Amazon SageMaker.

Delta Lake: Developed by Databricks and now a project of the Linux Foundation, Delta Lake is an open source project that enables building a “lakehouse” architecture on top of existing storage systems such as Amazon S3, Microsoft Azure Data Lake Storage, Google Cloud Storage, and HDFS. It adds ACID transactions, metadata handling, data versioning, schema enforcement, and schema evolution to data lakes. Databricks Lakehouse Platform uses Delta Lake, MLflow, and Spark in a cloud service that runs on AWS, Google Cloud, and Microsoft Azure to combine the data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes.

Google BigQuery: Google BigQuery is a serverless, petabyte-scale, cloud data warehouse with an internal BI engine, internal machine learning accessible via SQL extensions, and integrations across all Google Cloud services including Vertex AI and TensorFlow. BigQuery Omni extends BigQuery to analyze data across clouds, using Anthos. Data QnA provides a natural language front end to BigQuery. Connected Sheets allow users to analyze billions of rows of live BigQuery data in Google Sheets. BigQuery can process federated queries including external data sources in object storage (Google Cloud Storage) for Parquet and Optimized Row Columnar (ORC) file formats, transactional databases (such as Google Cloud Bigtable and Google Cloud SQL), or spreadsheets in Google Drive.

IBM Ahana Cloud for Presto: Ahana Cloud for Presto turns a data lake on Amazon S3 into what is effectively a data warehouse, without moving any data. SQL queries run quickly even when joining multiple heterogeneous data sources.

Presto is an open source, distributed SQL query engine for running interactive analytic queries against data sources of all sizes. Presto allows querying data where it lives, including Hive, Cassandra, relational databases, and proprietary data stores. A single Presto query can combine data from multiple sources. Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse.

Ahana Cloud for Presto runs on Amazon Web Services (AWS), has a fairly simple user interface, and has end-to-end cluster life cycle management. It runs in Kubernetes and is highly scalable. It has a built-in catalog and easy integration with data sources, catalogs, and dashboarding tools. The default Ahana query interface is Apache Superset. You can also use Jupyter or Zeppelin notebooks, especially if you are doing machine learning.

Ahana claims to have 3X the performance of other Presto services, including Amazon Elastic MapReduce and Amazon Athena.

Microsoft Azure Synapse Analytics: Azure Synapse Analytics is an analytics service that brings together data integration, data warehousing, and big data analytics. It lets you ingest, explore, prepare, manage, and serve data for immediate BI and machine learning needs, and query data using either serverless or dedicated resources at scale. Azure Synapse can run queries using Spark or SQL engines. It has deep integration with Azure Machine Learning, Azure Cognitive Services, and Power BI.

Opentext Vertica: Vertica provides a unified analytics warehouse across major public clouds and on-premises data centers and integrates data in cloud object storage and HDFS without forcing you to move any of your data. Vertica offers two deployment options. Vertica in Enterprise Mode runs on industry-standard servers with tightly coupled storage, delivering the highest performance for use cases that demand consistent compute capacity. Vertica in Eon Mode has a cloud-native architecture that separates compute from storage, enabling simplified management for variable workloads with the flexibility to apply specific compute resources to shared storage for different business use cases. Vertica in Eon Mode is available on AWS and Google Cloud but is not limited to public cloud deployments.

Oracle Autonomous Data Warehouse: Oracle Autonomous Data Warehouse is a cloud data warehouse service that automates provisioning, configuring, securing, tuning, scaling, and backing up of the data warehouse. It includes tools for self-service data loading, data transformations, business models, automatic insights, and built-in converged database capabilities that enable simpler queries across multiple data types and machine learning analysis. It’s available in both the Oracle public cloud and customers’ data centers with Oracle Cloud@Customer.

Qubole: Qubole is a simple, open, and secure data lake platform for machine learning, streaming, and ad hoc analytics. It is available on the AWS, Azure, Google, and Oracle clouds. Qubole helps you to ingest datasets from a data lake, build schemas with Hive, query the data with Hive, Presto, Quantum, or Spark, and continue to your data engineering and data science. You can work with Qubole data in Zeppelin or Jupyter notebooks and Airflow workflows.

Rockset: Rockset is an operational analytics database. It occupies a niche between transactional databases and data warehouses. Rockset can analyze gigabytes to terabytes of recent, real-time, and streaming data, and has the indexes to make most queries run in milliseconds. Rockset builds a converged index on structured and semi-structured data from OLTP databases, streams, and data lakes in real time, and exposes a RESTful SQL interface.

Snowflake: Snowflake is a dynamically scalable enterprise data warehouse designed for the cloud. It runs on AWS, Azure, and Google Cloud. Snowflake features storage, compute, and global services layers that are physically separated but logically integrated. Data workloads scale independently from one another, making Snowflake an appropriate platform for data warehousing, data lakes, data engineering, data science, modern data sharing, and developing data applications.

Teradata Vantage: Teradata Vantage is a connected multicloud data platform for enterprise analytics that unifies data lakes, data warehouses, analytics, and new data sources and types. Vantage runs on public clouds (such as AWS, Azure, and Google Cloud), hybrid multicloud environments, on-premises with Teradata IntelliFlex, or on commodity hardware with Broadcom VMware.

Yandex ClickHouse: ClickHouse is an open source, column-oriented, OLAP database management system that manages extremely large volumes of data, including nonaggregated data, and allows generating custom data reports online in real time. The system is linearly scalable and can be scaled up to store and process trillions of rows and petabytes of data. ClickHouse is designed to work on regular hard drives, which means the cost per GB of data storage is low, but SSD and additional RAM are also fully used if available.

In ClickHouse, data can reside on different shards. Each shard can be a group of replicas that are used for fault tolerance. The query is processed on all the shards in parallel. ClickHouse uses asynchronous multi-master replication. After being written to any available replica, data is distributed to all of the remaining replicas in the background. ClickHouse is available as a cloud service from Alibaba, SberCloud, Altinity (on AWS), Tencent, and, of course, Yandex.

Yellowbrick Data Warehouse: Yellowbrick Data Warehouse is a modern, massively parallel processing, analytic database designed for the most demanding batch, real-time, interactive, and mixed workloads. Yellowbrick lets you provision data warehouses wherever needed—in private data centers, multiple public clouds, and the network edge. Yellowbrick promotes its use for data lake augmentation and data warehouse modernization.

Essential reading

martin_heller
Contributor

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi. Disclosure: He also writes for Hewlett-Packard’s TechBeacon marketing website.

More from this author