Quantization

What is Quantization?

Quantization reduces neural network memory usage and accelerates inference by converting model weights from high-precision floating point to lower-precision integer representations. A model stored in 16-bit floats (FP16) consumes half the memory of 32-bit (FP32), while 4-bit quantization achieves 8x compression. This enables running large language models on consumer GPUs and mobile devices that lack the memory for full-precision weights, democratizing access to powerful AI models.

How does Quantization work?

Quantization maps continuous floating-point values to a discrete set of integers. Post-training quantization (PTQ) applies this conversion after training by analyzing weight distributions and choosing scaling factors that minimize information loss. The simplest approach applies uniform quantization per-tensor, while more sophisticated methods use per-channel or per-group scaling for better accuracy.

Quantization-aware training (QAT) simulates low-precision arithmetic during training, allowing the model to adapt its weights to quantization constraints. This typically preserves more accuracy than PTQ but requires retraining.

GPTQ, AWQ, and GGUF are popular LLM quantization formats. GPTQ uses calibration data to minimize layer-wise reconstruction error. AWQ (Activation-aware Weight Quantization) protects salient weights that disproportionately affect activations. Mixed-precision quantization keeps sensitive layers (embedding, attention) at higher precision while aggressively compressing feed-forward layers.

The precision-performance tradeoff varies by model size: larger models tolerate aggressive quantization better because they have more redundant parameters.

Why does Quantization matter?

Quantization determines whether a model fits on available hardware. A 70B parameter model requires 140GB in FP16 (multiple GPUs) but only 35GB in 4-bit (single GPU). This directly impacts inference cost, latency, and deployment flexibility. For production AI services, quantization often provides 2-4x throughput improvement with negligible quality loss.

Best practices for Quantization

Benchmark quantized models on your specific evaluation set, not just general benchmarks, since sensitivity varies by domain
Use calibration datasets representative of production traffic when applying post-training quantization methods
Keep embedding and output layers at higher precision since they disproportionately impact generation quality
Monitor perplexity degradation across quantization levels to find the optimal precision-accuracy tradeoff for your use case
Test edge cases and long-form generation specifically, as quantization errors can compound over many generation steps

What is Quantization?

How does Quantization work?

Why does Quantization matter?

Best practices for Quantization

Related Terms

About the Author