Buyer’s guide: How to choose a cloud machine learning platform

12 capabilities every cloud machine learning platform should provide to support the complete machine learning lifecycle—and which cloud machine learning platforms provide them.

Credit: Peshkova / Getty Images

Machine learning platforms explained

To create effective machine learning and deep learning models, you need copious amounts of data, a way to clean the data and perform feature engineering on it, and a way to train them on your data in a reasonable amount of time. Then you need a way to deploy your models, monitor them for drift over time, and retrain them as needed.

You can do all of that on-premises if you have invested in compute resources and accelerators such as GPUs, but you may find that if your resources are adequate, they are also idle much of the time. On the other hand, it can sometimes be more cost-effective to run the entire pipeline in the cloud, using large amounts of compute resources and accelerators as needed, and then releasing them.

[ Download our editors’ PDF machine learning platform enterprise buyer’s guide today! ]

In this buyer’s guide

Machine learning platforms explained
What to look for in machine learning platforms
Leading vendors for machine learning platforms
Essential reading

The major cloud providers — and a number of minor clouds too — have put significant effort into building out their machine learning platforms to support the complete machine learning life cycle, from planning a project to maintaining a model in production. How do you determine which of these clouds will meet your needs?

What to look for in machine learning platforms

There are 12 capabilities every end-to-end machine learning platform should provide. This section explains them and identifies how various machine learning platforms support them.

Be close to your data: If you have the large amounts of data needed to build precise models, you don’t want to ship it halfway around the world. The issue here isn’t distance, however, it’s time: Data transmission latency is ultimately limited by the speed of light, even on a perfect network with infinite bandwidth. Long distances mean latency.

The ideal case for very large data sets is to build the model where the data already resides, so that no mass data transmission is needed. Multiple databases support that.

The next best case is for the data to be on the same high-speed network as the model-building software, which typically means within the same data center. Even moving the data from one data center to another within a cloud availability zone can introduce a significant delay if you have terabytes or more. You can mitigate this by doing incremental updates.

The worst case would be if you have to move big data long distances over paths with constrained bandwidth and high latency. The trans-Pacific cables going to Australia are particularly egregious in this respect.

The major cloud providers have been addressing this issue in multiple ways. One is to add machine learning and deep learning to their database services. For example:

Amazon Redshift ML is designed to make it easy for Structured Query Language (SQL) users to create, train, and deploy machine learning models using SQL commands against Amazon Redshift, a managed, petabyte-scale data warehouse service.
Google’s BigQuery ML lets you create and execute machine learning models in BigQuery, Google Cloud’s managed, petabyte-scale data warehouse, also using SQL queries.
IBM Db2 Warehouse on Cloud includes a wide set of in-database SQL analytics that includes some basic machine learning functionality, plus in-database support for Python and R.
Microsoft SQL Server Machine Learning Services supports Java, Python, R, the Predict T-SQL command, and the rx_Predict stored procedure in the SQL Server RDBMS, and Spark MLlib in Microsoft SQL Server Big Data Clusters.
Oracle Cloud Infrastructure (OCI) Data Science is a managed and serverless platform for data science teams to build, train, and manage machine learning models using Oracle Cloud Infrastructure including Oracle Autonomous Database and Oracle Autonomous Data Warehouse.

Another way cloud providers have addressed this issue is to bring their cloud services to customer data centers as well as to satellite points of presence (often in large metropolitan areas) that are closer to customers than full-blown availability zones.

Amazon Web Services (AWS) calls these AWS Outposts and AWS Local Zones
Google Cloud calls them network edge locations, Google Distributed Cloud Virtual, and Anthos on-premises.
Microsoft Azure calls them Azure Stack Edge nodes and Azure Arc.

Support an ETL or ELT pipeline: Export, transform, and load (ETL) and export, load, and transform (ELT) are two data pipeline configurations that are common in the database world. Machine learning and deep learning amplify the need for these, especially the transform portion. ELT gives you more flexibility when your transformations need to change, as the load phase is usually the most time-consuming for big data.

In general, data in the wild is noisy. That needs to be filtered. Additionally, data in the wild has varying ranges: One variable might have a maximum in the millions, while another might have a range of –0.1 to –0.001. For machine learning, variables must be transformed to standardized ranges to keep the ones with large ranges from dominating the model. Exactly which standardized range depends on the algorithm used for the model.

AWS Glue is an Apache Spark-based serverless ETL engine. AWS also offers Amazon EMR, a big data platform that can run Apache Spark, and Amazon Redshift Spectrum, which supports ELT from an Amazon S3-based data lake.
Google Cloud Data Fusion, Dataflow, and Dataproc are useful for ETL and ELT.
Microsoft’s Azure Data Factory and Azure Synapse can do both ETL and ELT.
Third-party self-service ETL/ELT products such as Trifacta can also be used on the clouds.

Support an online environment for model building: The conventional wisdom used to be that you should import your data to your desktop for model building. The sheer quantity of data needed to build good machine learning and deep learning models changes the picture: You can download a small sample of data to your desktop for exploratory data analysis and model building, but for production models you need to have access to the full data.

Web-based development environments such as Jupyter Notebooks, JupyterLab, and Apache Zeppelin are well suited for model building. If your data is in the same cloud as the notebook environment, you can bring the analysis to the data, minimizing the time-consuming movement of data. Notebooks can also be used for ELT as part of the pipeline.

Amazon SageMaker lets you build, train, and deploy machine learning and deep learning models for any use case with fully managed infrastructure, tools, and workflows. SageMaker Studio is based on JupyterLab.
Google Cloud Vertex AI lets you build, deploy, and scale machine learning models faster, with pretrained models and custom tooling within a unified artificial intelligence platform. Through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. Vertex AI also integrates with widely used open source frameworks such as PyTorch, Scikit-learn, and TensorFlow, and it supports all machine learning frameworks and artificial intelligence branches via custom containers for training and prediction.
Microsoft Azure Machine Learning is an end-to-end, scalable, trusted AI platform with experimentation and model management. Azure Machine Learning Studio includes Jupyter Notebooks, a drag-and-drop machine learning pipeline designer, and an automated machine learning (AutoML) Azure Databricks is an Apache Spark-based analytics platform. Azure Data Science Virtual Machines make it easy for advanced data scientists to set up machine learning and deep learning development environments.

Support scale-up and scale-out training: The compute and memory requirements of notebooks are generally minimal, except for training models. It helps a lot if a notebook can spawn training jobs that run on multiple large virtual machines or containers. It also helps a lot if the training can access accelerators such as GPUs, TPUs, and FPGAs; these can turn days of training into hours.

Amazon SageMaker supports a wide range of virtual machine (VM) sizes; GPUs and other accelerators including Nvidia A100s, Habana Gaudi, and AWS Trainium; a model compiler; and distributed training using either data parallelism or model parallelism.
Google Cloud Vertex AI supports a wide range of VM sizes; GPUs and other accelerators including Nvidia A100s and Google TPUs; and distributed training using either data parallelism or model parallelism, with an optional reduction server.
Microsoft Azure Machine Learning supports a wide range of VM sizes; GPUs and other accelerators including Nvidia A100s and Intel FPGAs; and distributed training using either data parallelism or model parallelism.

Support AutoML and automated feature engineering: Not everyone is good at picking machine learning models, selecting features (the variables that are used by the model), and engineering new features from the raw observations. Even if you’re good at those tasks, they are time-consuming and can be automated to a large extent.

AutoML systems often try many models to see which result in the best objective function values, for example the minimum squared error for regression problems. The best AutoML systems can also perform feature engineering, and use their resources effectively to pursue the best possible models with the best possible sets of features.

Amazon SageMaker Autopilot provides AutoML and hyperparameter tuning, which can use Hyperband as a search strategy.
Google Cloud Vertex AI supplies AutoML, and so do Google’s specialized AutoML services for structured data, sight, and language, although Google tends to lump AutoML in with transfer learning in some cases.
Microsoft Azure Machine Learning and Azure Databricks both provide AutoML, as does Apache Spark in Azure HDInsight.
DataRobot, Dataiku, and ai Driverless AI all offer AutoML with automated feature engineering and hyperparameter tuning.

Support the best machine learning and deep learning frameworks: Most data scientists have favorite frameworks and programming languages for machine learning and deep learning.

For those who prefer Python, Scikit-learn is often a favorite for machine learning, while Keras, MXNet, PyTorch, and TensorFlow are often top picks for deep learning.
In Scala, Spark MLlib tends to be preferred for machine learning.
In R, there are many native machine learning packages, and a good interface to Python.
In Java, ai rates highly, as do Java-ML and Deep Java Library.

The cloud machine learning and deep learning platforms tend to have their own collection of algorithms, and they often support external frameworks in at least one language or as containers with specific entry points. In some cases you can integrate your own algorithms and statistical methods with the platform’s AutoML facilities, which is quite convenient.

Some cloud platforms also offer their own tuned versions of major deep learning frameworks. For example, AWS has an optimized version of TensorFlow that it claims can achieve nearly linear scalability for deep neural network training. Similarly, Google Cloud offers TensorFlow Enterprise.

Offer pretrained models and support transfer learning: Not everyone wants to spend the time and compute resources to train their own models — nor should they, when pretrained models are available. For example, the ImageNet data set is huge, and training a state-of-the-art deep neural network against it can take weeks, so it makes sense to use a pre-trained model for it when you can.

On the other hand, pretrained models may not always identify the objects you care about. Transfer learning can help you customize the last few layers of the neural network for your specific data set without the time and expense of training the full network.

All major deep learning frameworks and cloud service providers support transfer learning at some level. There are differences; one major difference is that Azure can customize some kinds of models with tens of labeled exemplars, versus hundreds or thousands for some of the other platforms.

Offer tuned, pretrained AI services: The major cloud platforms offer strong, tuned AI services for many applications, not just image identification. Examples include language translation, speech to text, text to speech, forecasting, and recommendations.

These services have already been trained and tested on more data than is usually available to businesses. They are also already deployed on service endpoints with enough computational resources, including accelerators, to ensure good response times under worldwide load.

The differences among the services offered by the big three tend to be down in the weeds. One area of active development is services for edge computing, including machine learning that resides on devices such as cameras and communicates with the cloud.

Manage your experiments: The only way to find the best model for your data set is to try everything, whether manually or using AutoML. That leaves another problem: managing your experiments.

A good cloud machine learning platform will have a way that you can see and compare the objective function values of each experiment for both the training sets and the test data, as well as the size of the model and the confusion matrix. Being able to graph all of that is a definite plus.

In addition to the experiment tracking built into Amazon SageMaker, Google Cloud Vertex AI, and Microsoft Azure Machine Learning, you can use third-party products such as Neptune.ai, Weights & Biases, Sacred plus Omniboard, and MLflow. Most of these are free for at least personal use, and some are open source.

Support model deployment for prediction: Once you have a way of picking the best experiment given your criteria, you also need an easy way to deploy the model. If you deploy multiple models for the same purpose, you’ll also need a way to apportion traffic among them for A/B testing.

One sticking point is the cost of deploying an endpoint. There have been many changes in model deployment over the last few years. Ideally, low-volume prediction endpoints should be serverless, and high-volume prediction endpoints should be clustered and/or use accelerators.

Monitor prediction performance and data drift: Unfortunately for model builders, data can change over time. That means you can’t deploy a model and forget it. Instead, you need to monitor the data submitted for predictions over time for drift. When the data starts changing significantly from the baseline of your original training data set, you’ll need to retrain your model.

Amazon SageMaker Model Monitor does an especially good job of this, although it is restricted to tabular data.
Google Cloud Vertex AI Model Monitoring detects skew and drifts for tabular AutoML and tabular custom-trained models.
Microsoft Azure Machine Learning has a data-drift detection package.

Control costs: Finally, you need ways to control the costs incurred by your models. Deploying models for production inference often accounts for 90% of the cost of deep learning, while the training accounts for only 10% of the cost.

The best way to control prediction costs depends on your load and the complexity of your model. If you have a high load, you might be able to use an accelerator to avoid adding more virtual machine instances. If you have a variable load, you might be able to dynamically change your size or number of instances or containers as the load goes up or down. And if you have a low or occasional load, you might be able to use a very small instance with a partial accelerator to handle the predictions.

Leading vendors for machine learning platforms

The preceding section noted the key vendors and products for each of the critical aspects of a machine learning platform. This section summarizes the providers’ offerings as a quick reference. The big three cloud providers provide the bulk of the tools in their platforms.

Amazon Web Services: Amazon Redshift ML enables the creation of machine learning models via SQL for the Redshift data warehouse service. AWS Outposts and AWS Local Zones provide nearer access for such services. For ETL and ELT, AWS Glue is an Apache Spark-based serverless ETL engine. AWS also offers Amazon EMR, a big data platform that can run Apache Spark, and Amazon Redshift Spectrum, which supports ELT from an Amazon S3-based data lake. For model building, Amazon SageMaker lets you build, train, and deploy machine learning and deep learning models; SageMaker Studio is based on the popular JupyterLab web-based development environment. For scale-up training, Amazon SageMaker supports a wide range of virtual machine (VM) sizes; GPUs and other accelerators including Nvidia A100s, Habana Gaudi, and AWS Trainium. For AutoML, Amazon SageMaker Autopilot provides AutoML and hyperparameter tuning. For model prediction performance, AWS offers Amazon SageMaker Model Monitor, although it is restricted to tabular data.

Google Cloud: BigQuery ML enables the creation of machine learning models via SQL for the BigQuery data warehouse. Google Distributed Cloud Virtual provides nearer for such services, and Anthos for on-premises usage. Google Cloud Data Fusion, Dataflow, and Dataproc are useful for ETL and ELT. Google Cloud Vertex AI lets you build, deploy, and scale machine learning models faster. For model building, through Vertex AI Workbench, Vertex AI is natively integrated with BigQuery, Dataproc, and Spark. Vertex AI also integrates with widely used open source frameworks such as PyTorch, Scikit-learn, and TensorFlow. For scale-up training, Google Cloud Vertex AI supports a wide range of VM sizes; GPUs and other accelerators including Nvidia A100s and Google TPUs. For AutoML, Google has Cloud Vertex AI, though Google tends to lump AutoML in with transfer learning in some cases. For model prediction performance, Google has Cloud Vertex AI Model Monitoring to detect skew and drifts for tabular AutoML and tabular custom-trained models.

IBM: Db2 Warehouse on Cloud includes a wide set of in-database SQL analytics that includes some basic machine learning functionality, plus in-database support for Python and R.

Microsoft Azure: SQL Server Machine Learning Services supports the creation of machine learning models for Microsoft SQL Server Big Data Clusters via several languages: Java, Python, R, the Predict T-SQL command, and the rx_Predict stored procedure. Microsoft provides Azure Stack Edge and Azure Arc for nearer access. Microsoft’s Azure Data Factory and Azure Synapse can do both ETL and ELT. For model building, Azure Machine Learning Studio includes Jupyter Notebooks, a drag-and-drop machine learning pipeline designer, and an AutoML facility. Azure Databricks is an Apache Spark-based analytics platform. Azure Data Science Virtual Machines make it easy for advanced data scientists to set up machine learning and deep learning development environments. For scale-up training, Microsoft Azure Machine Learning supports a wide range of VM sizes; GPUs and other accelerators including Nvidia A100s and Intel FPGAs. For AutoML, Microsoft has Azure Machine Learning, Azure Databricks, and Apache Spark in Azure HDInsight. For model prediction performance, Microsoft Azure Machine Learning has a data-drift detection package.

Oracle: Oracle Cloud Infrastructure (OCI) Data Science is a managed and serverless platform for data science teams to build, train, and manage machine learning models in an Oracle environment.

AutoML for automated feature training and hyperparameter tuning: In addition to those from AWS, Google Cloud, and Microsoft Azure, you can consider stand-alone tools from DataRobot, Dataiku, and H2O.ai Driverless AI.

Machine learning and deep learning frameworks: For those who prefer Python, Scikit-learn is often a favorite for machine learning, while Keras, MXNet, PyTorch, and TensorFlow are often top picks for deep learning. In Scala, Spark MLlib tends to be preferred for machine learning. In R, there are many native machine learning packages, and a good interface to Python. In Java, H2O.ai rates highly, as do Java-ML and Deep Java Library.

Topics

About

Policies

Our Network

More

Buyer’s guide: How to choose a cloud machine learning platform

12 capabilities every cloud machine learning platform should provide to support the complete machine learning lifecycle—and which cloud machine learning platforms provide them.

Machine learning platforms explained

In this buyer’s guide

What to look for in machine learning platforms

Leading vendors for machine learning platforms

Essential reading

More from this author

Download our machine learning platform enterprise buyer’s guide

Amazon Q Developer review: Code completions, code chat, and AWS skills

Download our cloud CI/CD enterprise buyer’s guide

LlamaIndex review: Easy context-augmented LLM applications

Download our cloud data warehouse enterprise buyer’s guide

Google Project IDX: A promising next-generation cloud IDE

What is model quantization? Smaller, faster LLMs

Understanding the generative AI development process

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI

Buyer’s guide: How to choose a cloud machine learning platform

12 capabilities every cloud machine learning platform should provide to support the complete machine learning lifecycle—and which cloud machine learning platforms provide them.

Machine learning platforms explained

In this buyer’s guide

What to look for in machine learning platforms

Leading vendors for machine learning platforms

Essential reading

Related content

Beyond the usual suspects: 5 fresh data science tools to try today

Generative AI won’t fix cloud migration

HR professionals trust AI recommendations

Safety off: Programming in Rust with `unsafe`

More from this author

Download our machine learning platform enterprise buyer’s guide

Amazon Q Developer review: Code completions, code chat, and AWS skills

Download our cloud CI/CD enterprise buyer’s guide

LlamaIndex review: Easy context-augmented LLM applications

Download our cloud data warehouse enterprise buyer’s guide

Google Project IDX: A promising next-generation cloud IDE

What is model quantization? Smaller, faster LLMs

Understanding the generative AI development process

Most popular authors

Show me more

OpenSilver 3.0 previews AI-powered UI designer for .NET

How to use FastEndpoints in ASP.NET Core

How Azure Functions is evolving

How to use dbm to stash data quickly in Python

How to auto-generate Python type hints with Monkeytype

How to make HTML GUIs in Python with NiceGUI