Inference is the process of running a trained machine learning model on new input data to generate predictions, classifications, or text outputs in real time.
Inference is the process of running a trained machine learning model on new input data to generate predictions, classifications, or text outputs in real time. Unlike training (which adjusts model weights over many iterations), inference uses fixed weights to process a single input and produce an output. For language models, inference means generating a response to a prompt — the operation that occurs every time you interact with an AI assistant.
For language models, inference proceeds in two phases: prefill and decode. During prefill, the model processes the entire input prompt in parallel, computing the internal representations for all input tokens simultaneously. During decode, the model generates output tokens one at a time, with each new token depending on all previous tokens.
Prefill is compute-bound (limited by GPU processing speed), while decode is memory-bound (limited by how fast KV cache values can be read from GPU memory). This distinction drives hardware selection: prefill benefits from more compute units, while decode benefits from higher memory bandwidth.
Inference optimization is a critical engineering discipline. Techniques include batching (processing multiple requests simultaneously to amortize fixed costs), quantization (reducing weight precision from 16-bit to 4-bit to fit larger models on smaller GPUs), speculative decoding (using a small model to draft tokens that a large model verifies), and KV cache management (reusing computed attention states across requests with shared prefixes).
Inference cost and speed determine whether an AI application is economically viable and responsive enough for users. A model that costs $0.01 per request at 500ms latency enables different use cases than one costing $0.50 per request at 10 seconds latency. The gap between a research demo and a production system is almost entirely about inference optimization.
The inference market is massive: enterprises collectively spend billions annually on inference compute, dwarfing training costs. For every dollar spent training a model, $10-100 is spent running inference over its lifetime — making inference efficiency the dominant factor in AI economics.
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.