LLM Infrastructure

Inference

Inference is the process of running a trained machine learning model on new input data to generate predictions, classifications, or text outputs in real time.

What is Inference?

Inference is the process of running a trained machine learning model on new input data to generate predictions, classifications, or text outputs in real time. Unlike training (which adjusts model weights over many iterations), inference uses fixed weights to process a single input and produce an output. For language models, inference means generating a response to a prompt — the operation that occurs every time you interact with an AI assistant.

How does Inference work?

For language models, inference proceeds in two phases: prefill and decode. During prefill, the model processes the entire input prompt in parallel, computing the internal representations for all input tokens simultaneously. During decode, the model generates output tokens one at a time, with each new token depending on all previous tokens.

Prefill is compute-bound (limited by GPU processing speed), while decode is memory-bound (limited by how fast KV cache values can be read from GPU memory). This distinction drives hardware selection: prefill benefits from more compute units, while decode benefits from higher memory bandwidth.

Inference optimization is a critical engineering discipline. Techniques include batching (processing multiple requests simultaneously to amortize fixed costs), quantization (reducing weight precision from 16-bit to 4-bit to fit larger models on smaller GPUs), speculative decoding (using a small model to draft tokens that a large model verifies), and KV cache management (reusing computed attention states across requests with shared prefixes).

Why does Inference matter?

Inference cost and speed determine whether an AI application is economically viable and responsive enough for users. A model that costs $0.01 per request at 500ms latency enables different use cases than one costing $0.50 per request at 10 seconds latency. The gap between a research demo and a production system is almost entirely about inference optimization.

The inference market is massive: enterprises collectively spend billions annually on inference compute, dwarfing training costs. For every dollar spent training a model, $10-100 is spent running inference over its lifetime — making inference efficiency the dominant factor in AI economics.

Best practices for Inference

  • Measure time-to-first-token and tokens-per-second separately, as they reflect different optimization opportunities
  • Use prompt caching to avoid reprocessing identical system prompts across requests
  • Implement request batching during high-traffic periods to maximize GPU utilization and reduce per-request cost
  • Set appropriate timeout limits so slow inference does not cascade into application-wide latency spikes

About the Author

Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.