martin_heller
Contributor

What is model quantization? Smaller, faster LLMs

feature
May 31, 20248 mins
Artificial IntelligenceGenerative AISoftware Development

Reducing the precision of model weights can make deep neural networks run faster in less GPU memory, while preserving model accuracy.

Big data and artificial intelligence concept. Machine learning and circuit board. Deep learning

If ever there were a salient example of a counter-intuitive technique, it would be quantization of neural networks. Quantization reduces the precision of the weights and other tensors in neural network models, often drastically. It’s no surprise that reducing the precision of weights and other parameters from, say, 32-bit floats to 8-bit integers, makes the model run faster, and allows it to run in less powerful processors with far less memory. The stunning, counter-intuitive finding is that quantization can be done while largely preserving the accuracy of the model.

Why do we need quantization? The current large language models (LLMs) are enormous. The best models need to run on a cluster of server-class GPUs; gone are the days where you could run a state-of-the-art model locally on one GPU and get quick results. Quantization not only makes it possible to run a LLM on a single GPU, it allows you to run it on a CPU or on an edge device.

Post-training quantization

Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy.

TensorFlow Lite documentation

Given how mature TensorFlow Lite is compared to, say, the Gen AI model du jour (probably Mistral AI’s Codestral, which was released the day I wrote this), it’s worth looking at how TensorFlow Lite implements quantization. First of all, TensorFlow Lite implements three options for quantization:

Technique

Benefits

Hardware

Dynamic range quantization

4x smaller, 2x to 3x speedup

CPU

Full integer quantization

4x smaller, 3x+ speedup

CPU, Edge TPU, Microcontrollers

Float16 quantization

2x smaller, GPU acceleration

CPU, GPU

In the decision tree that accompanies this table, the TensorFlow Lite documenters outline the considerations for choosing a quantization technique. It’s worth reading through the logic. In a nutshell, the best post-quantization method for your use case will depend on your hardware support for integer or floating point operations and whether you can provide a representative data set for calibration.

Dynamic range quantization

Then they explain why dynamic range quantization is the usual starting point: It provides reduced memory usage and faster computation without requiring you to provide a representative data set for calibration. Dynamic range quantization statically quantizes only the weights from floating point to integer at conversion time, which provides 8 bits of precision. Additionally, “dynamic-range” operators dynamically quantize activations based on their range to 8 bits and perform computations with 8-bit weights and activations. The outputs are still stored as floating-point values.

Full integer quantization

Full integer quantization can speed things up even more than dynamic range quantization, but you need to provide a representative data set for calibration (typically a few hundred samples) and run a few inference cycles so that you can capture the range of all floating-point tensors in the model. Those include not only model weights and biases, but also model input, activations (outputs of intermediate layers), and model output. Full integer quantization is essentially mandatory on integer-only devices, such as 8-bit microcontrollers, and integer-only accelerators, such as the Coral Edge TPU.

Float16 quantization

Float16 quantization reduces model size by up to half, since all weights become half of their original size, and causes minimal loss in accuracy. It also supports some “delegates” (i.e., on-device accelerators such as a GPU) that can operate directly on float16 data. On the down side, float16 quantization does not reduce latency as much as quantization to fixed point math. In addition, a float16 quantized model will “dequantize” the weight values to float32 when run on a CPU, which is a great reason to use a GPU delegate instead, along with the speed increase from using the GPU.

Quantization and model accuracy

As you might expect, accuracy may be an issue when you quantize a model. You can evaluate the accuracy of a quantized model against the original model, and decide whether the quantized model is sufficiently accurate for your purposes. For example, TensorFlow Lite offers three executables for checking the accuracy of quantized models. You might also consider MQBench, a benchmark and framework for evaluating quantization algorithms under real-world hardware deployments that uses PyTorch.

If the degradation in accuracy from post-training quantization is too high, then one alternative is to use quantization aware training

Quantization aware training

Quantization aware training (QAT) models the effects of quantization during training or fine-tuning, and produces a model with float32 weights that can then be quantized to integer weights and activations. The resulting quantized model is usually more accurate than a model produced by post-training quantization (PTQ) without taking quantization into account during training.

One quick way of understanding how and why QAT works is to look at when activation ranges are computed. For post training dynamic quantization, the range for each activation is computed on the fly at run time. For post training static quantization (called full integer quantization above), the range for each activation is computed in advance at quantization time, using observers to record the values of activations. For quantization aware training, the range for each activation is computed at training time, following the same idea as post training static quantization. The twist is that in QAT “fake quantize” operators are used instead of observers, not only to record values, but also to simulate the error induced by quantization, so that the model can adapt to it.

1-bit LLMs

The obvious endpoint of the quantization trend is a reductio ad absurdum: 1-bit quantization. Surprisingly, 1-bit quantized models (introduced in the BitNet paper) actually work, and 1.58-bit models (we’ll explain that fraction of a bit momentarily) are even better. Both kinds of models were developed by a group from Microsoft Research and the Chinese Academy of Sciences.

First, 1-bit quantized models. Lest you get the wrong idea, no, BitNet 1-bit transformer models don’t reduce all the tensors in the model to 1 bit willy-nilly. Weights, and only weights, are binarized to either -1 or 1, after centralization to zero mean, and then the binarized weights are scaled to reduce the error introduced by binarization.

Activations are quantized to b-bit precision (the original paper used 8-bit precision) after some scaling and clipping. The model is modified to use BitLinear layers instead of nn.Linear layers, and a LayerNorm function is applied to the input of each BitLinear layer. In other words, a lot of work goes on to make the 1-bit quantized models competitive with the original models in accuracy, while being much smaller and faster.

Now, about that 1.58 bit number. The paper The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits introduces a 1-bit LLM variant, called BitNet b1.58, in which every single weight of the LLM is ternary {-1, 0, 1}. The authors say that’s 1.58 bits, but they don’t show their calculation.

According to the paper, BitNet b1.58 “matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.” The authors go on to say that “it enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.”

The architecture of BitNet b1.58 starts with BitNet, but uses a different quantization function to get to the ternary weight representation, and uses activation scaling to the range [−Qb,Qb] per token instead of the range [0,Qb]. To be more compatible with LLaMA-type models, BitNet b1.58 adds RMSNorm, SwiGLU, and rotary embedding.

The authors compared BitNet b1.58 to a reproduced FP16 LLaMA LLM in various sizes, trained from scratch on the RedPajama data set for 100 billion tokens. Their conclusion was that “BitNet b1.58 starts to match full precision LLaMA LLM at 3B model size in terms of perplexity, while being 2.71 times faster and using 3.55 times less GPU memory. In particular, BitNet b1.58 with a 3.9B model size is 2.4 times faster, consumes 3.32 times less memory, but performs significantly better than LLaMA LLM 3B.”

Smaller and faster LLMs

As we’ve seen, quantization can help solve some of the biggest problems with large language models: LLMs are too big and too slow to run on normal hardware, instead requiring clusters of GPUs in the cloud. Various quantization techniques help to different degrees, but the exciting and unexpected “one-bit” models (meaning both 1-bit binary and 1.58-bit ternary quantizations) are starting to break the logjam of escalating model sizes.

martin_heller
Contributor

Martin Heller is a contributing editor and reviewer for InfoWorld. Formerly a web and Windows programming consultant, he developed databases, software, and websites from his office in Andover, Massachusetts, from 1986 to 2010. More recently, he has served as VP of technology and education at Alpha Software and chairman and CEO at Tubifi. Disclosure: He also writes for Hewlett-Packard’s TechBeacon marketing website.

More from this author