Inference

What is Inference?

Inference is the process of running a trained machine learning model on new input data to generate predictions, classifications, or text outputs in real time. Unlike training (which adjusts model weights over many iterations), inference uses fixed weights to process a single input and produce an output. For language models, inference means generating a response to a prompt — the operation that occurs every time you interact with an AI assistant.

How does Inference work?

For language models, inference proceeds in two phases: prefill and decode. During prefill, the model processes the entire input prompt in parallel, computing the internal representations for all input tokens simultaneously. During decode, the model generates output tokens one at a time, with each new token depending on all previous tokens.

Prefill is compute-bound (limited by GPU processing speed), while decode is memory-bound (limited by how fast KV cache values can be read from GPU memory). This distinction drives hardware selection: prefill benefits from more compute units, while decode benefits from higher memory bandwidth.

Inference optimization is a critical engineering discipline. Techniques include batching (processing multiple requests simultaneously to amortize fixed costs), quantization (reducing weight precision from 16-bit to 4-bit to fit larger models on smaller GPUs), speculative decoding (using a small model to draft tokens that a large model verifies), and KV cache management (reusing computed attention states across requests with shared prefixes).

Why does Inference matter?

Inference cost and speed determine whether an AI application is economically viable and responsive enough for users. A model that costs $0.01 per request at 500ms latency enables different use cases than one costing $0.50 per request at 10 seconds latency. The gap between a research demo and a production system is almost entirely about inference optimization.

The inference market is massive: enterprises collectively spend billions annually on inference compute, dwarfing training costs. For every dollar spent training a model, $10-100 is spent running inference over its lifetime — making inference efficiency the dominant factor in AI economics.

Best practices for Inference

Measure time-to-first-token and tokens-per-second separately, as they reflect different optimization opportunities
Use prompt caching to avoid reprocessing identical system prompts across requests
Implement request batching during high-traffic periods to maximize GPU utilization and reduce per-request cost
Set appropriate timeout limits so slow inference does not cascade into application-wide latency spikes

What is Inference?

How does Inference work?

Why does Inference matter?

Best practices for Inference

Related Terms

About the Author