LLM Architecture

Quantization

Quantization reduces neural network memory usage and accelerates inference by converting model weights from high-precision floating point to lower-precision integer representations.

What is Quantization?

Quantization reduces neural network memory usage and accelerates inference by converting model weights from high-precision floating point to lower-precision integer representations. A model stored in 16-bit floats (FP16) consumes half the memory of 32-bit (FP32), while 4-bit quantization achieves 8x compression. This enables running large language models on consumer GPUs and mobile devices that lack the memory for full-precision weights, democratizing access to powerful AI models.

How does Quantization work?

Quantization maps continuous floating-point values to a discrete set of integers. Post-training quantization (PTQ) applies this conversion after training by analyzing weight distributions and choosing scaling factors that minimize information loss. The simplest approach applies uniform quantization per-tensor, while more sophisticated methods use per-channel or per-group scaling for better accuracy.

Quantization-aware training (QAT) simulates low-precision arithmetic during training, allowing the model to adapt its weights to quantization constraints. This typically preserves more accuracy than PTQ but requires retraining.

GPTQ, AWQ, and GGUF are popular LLM quantization formats. GPTQ uses calibration data to minimize layer-wise reconstruction error. AWQ (Activation-aware Weight Quantization) protects salient weights that disproportionately affect activations. Mixed-precision quantization keeps sensitive layers (embedding, attention) at higher precision while aggressively compressing feed-forward layers.

The precision-performance tradeoff varies by model size: larger models tolerate aggressive quantization better because they have more redundant parameters.

Why does Quantization matter?

Quantization determines whether a model fits on available hardware. A 70B parameter model requires 140GB in FP16 (multiple GPUs) but only 35GB in 4-bit (single GPU). This directly impacts inference cost, latency, and deployment flexibility. For production AI services, quantization often provides 2-4x throughput improvement with negligible quality loss.

Best practices for Quantization

  • Benchmark quantized models on your specific evaluation set, not just general benchmarks, since sensitivity varies by domain
  • Use calibration datasets representative of production traffic when applying post-training quantization methods
  • Keep embedding and output layers at higher precision since they disproportionately impact generation quality
  • Monitor perplexity degradation across quantization levels to find the optimal precision-accuracy tradeoff for your use case
  • Test edge cases and long-form generation specifically, as quantization errors can compound over many generation steps

About the Author

Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.