A context window is the maximum number of tokens a language model can process in a single input-output interaction, encompassing both the prompt and the generated response.
A context window is the maximum number of tokens a language model can process in a single input-output interaction, encompassing both the prompt and the generated response. It represents the model's working memory — everything the model can "see" at once when generating a response. Information outside the context window is invisible to the model and cannot influence its output.
Language models process text as sequences of tokens (roughly 3/4 of a word each). The context window defines the maximum sequence length the model's architecture supports. When you send a prompt, the model processes all tokens in the window simultaneously using self-attention mechanisms that allow each token to reference every other token.
As of 2026, context windows range from 8K tokens (small local models) to over 1 million tokens (Claude, Gemini). A 200K-token window can hold approximately 150,000 words — equivalent to a 500-page book. However, model performance often degrades with very long contexts, particularly for information located in the middle of the window (the "lost in the middle" phenomenon).
The context window is shared between input and output. A model with a 200K window given a 190K prompt can only generate 10K tokens of response.
Context window size determines what tasks a model can handle. Small windows force aggressive summarization or chunking strategies, while large windows enable analyzing entire codebases, processing lengthy legal documents, or maintaining long conversation histories without losing earlier context.
However, larger context windows come with trade-offs: inference cost scales linearly with input tokens, and latency increases with window utilization. Production systems must balance the benefits of more context against cost and speed requirements — making context window management a core infrastructure concern.
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.