78 terms covering AI agents, LLMs, and developer infrastructure. Each definition is self-contained and quotable.
A/B testing compares two or more variants of a system by randomly assigning users to groups and measuring statistically significant differences in predefined outcome metrics.
MLOpsAn agent harness is the runtime environment that manages an AI agent's execution loop, tool access, permission boundaries, memory persistence, and conversation state.
Developer ToolsAn agent loop is the iterative cycle of observe, reason, act, and evaluate that an AI agent repeats until it completes a task or reaches a termination condition.
AI Agent DevelopmentAgent memory is the system that enables AI agents to persist, retrieve, and reason over information across conversation turns and sessions, providing continuity beyond the immediate context window.
AI Agent DevelopmentAgent observability is the practice of instrumenting AI agent systems to capture traces, metrics, and logs across the full execution lifecycle, enabling debugging, performance optimization, and reliability monitoring.
MLOpsAgent orchestration is the coordination layer that manages how multiple AI agents communicate, share context, delegate tasks, and resolve conflicts within a system.
AI Agent DevelopmentAgentic AI refers to artificial intelligence systems that autonomously plan, execute, and adapt multi-step tasks toward a goal without requiring human intervention at each step.
AI Agent DevelopmentAgentic RAG is a retrieval-augmented generation pattern where an AI agent iteratively decides what to retrieve, evaluates retrieval quality, and reformulates queries until it has sufficient context to answer accurately.
AI Agent DevelopmentAI agent memory is the system that persists information across interactions, enabling agents to recall past context, learn from experience, and maintain continuity between sessions.
AI Agent DevelopmentAI alignment is the research field dedicated to ensuring artificial intelligence systems reliably pursue goals that match human intentions, values, and ethical principles.
AI SafetyAn AI coding agent is an autonomous software development assistant that can read codebases, write code, run tests, debug errors, and commit changes with minimal human direction.
Developer ToolsAI guardrails are programmatic constraints and validation layers that prevent AI systems from generating harmful, off-topic, or policy-violating outputs during production use.
AI SafetyAn attention mechanism allows neural networks to dynamically focus on relevant parts of the input when producing each element of the output, weighting information by learned importance.
LLM ArchitectureA canary release gradually routes a small percentage of production traffic to a new version while monitoring for errors before expanding to all users.
DevOps/CI-CDChain of thought is a prompting technique that instructs language models to produce intermediate reasoning steps before arriving at a final answer, improving accuracy on complex tasks.
Prompt EngineeringConstitutional AI is an alignment technique where a language model critiques and revises its own outputs according to a set of written principles, reducing reliance on human feedback for safety training.
AI SafetyContainer orchestration automates the deployment, scaling, networking, and lifecycle management of containerized applications across clusters of machines.
Cloud InfrastructureA content delivery network (CDN) distributes cached copies of web content across geographically dispersed servers to reduce latency and improve load times for users worldwide.
Cloud InfrastructureContext compression is a set of techniques that reduce the token count of prompts while preserving semantic content, enabling more information to fit within a model's fixed context window.
LLM InfrastructureContext engineering is the practice of designing and optimizing the information provided to a language model to maximize the relevance, accuracy, and efficiency of its outputs.
LLM InfrastructureA context window is the maximum number of tokens a language model can process in a single input-output interaction, encompassing both the prompt and the generated response.
LLM InfrastructureContinuous batching is an inference serving technique that dynamically adds and removes requests from a running batch at each generation step, maximizing GPU utilization without waiting for all requests to complete.
LLM InfrastructureContinuous deployment automatically releases every code change that passes automated testing directly to production without manual approval gates.
DevOps/CI-CDA data pipeline is an automated sequence of processing steps that ingests, transforms, validates, and delivers data from source systems to destination systems for analysis or model training.
MLOpsDirect Preference Optimization (DPO) is a training method that aligns language models to human preferences by directly optimizing on preference pairs without requiring a separate reward model.
Model TrainingEdge computing processes data at or near the source of data generation rather than in a centralized data center, reducing latency and bandwidth consumption.
Cloud InfrastructureAn embedding is a dense numerical vector representation of text, images, or other data that captures semantic meaning in a format suitable for mathematical comparison and retrieval.
Machine LearningExperiment tracking systematically records machine learning training runs including hyperparameters, metrics, artifacts, and code versions to enable comparison and reproducibility.
MLOpsFeature flags are conditional switches in code that enable or disable functionality at runtime without deploying new code, decoupling deployment from feature release.
DevOps/CI-CDA feature store is a centralized platform that manages the storage, transformation, and serving of machine learning features, ensuring consistency between training and inference pipelines.
MLOpsFine-tuning is the process of further training a pre-trained language model on a domain-specific dataset to improve its performance on targeted tasks without training from scratch.
Machine LearningFlashAttention is an IO-aware attention algorithm that computes exact attention with reduced GPU memory reads/writes through tiling and kernel fusion, enabling faster training and inference for long sequences.
LLM ArchitectureFunction calling is an LLM capability that allows models to generate structured JSON arguments for predefined functions, enabling AI to interact with external systems and APIs.
AI Agent DevelopmentFunction grounding is the process of connecting language model outputs to executable code and real-world systems, ensuring that model-generated actions produce verifiable, deterministic results.
AI Agent DevelopmentGenerative engine optimization is the practice of structuring web content to maximize its likelihood of being cited, quoted, or referenced by AI systems when generating answers.
Search & DiscoveryGitOps is an operational framework that uses Git repositories as the single source of truth for declarative infrastructure and application configuration with automated reconciliation.
DevOps/CI-CDA guardrail framework is a software layer that validates, filters, and constrains language model inputs and outputs to enforce safety policies, prevent misuse, and ensure response quality in production systems.
AI SafetyHallucination in AI refers to model outputs that are fluent and confident but factually incorrect, unsupported by training data, or inconsistent with provided context.
AI SafetyHuman-in-the-loop is an agent design pattern where the system pauses execution at designated checkpoints to request human approval, correction, or guidance before proceeding with consequential actions.
AI Agent DevelopmentInference is the process of running a trained machine learning model on new input data to generate predictions, classifications, or text outputs in real time.
LLM InfrastructureInfrastructure as Code (IaC) manages and provisions computing infrastructure through machine-readable configuration files rather than manual processes or interactive tools.
Cloud InfrastructureA large language model is a neural network trained on massive text datasets that generates human-like text by predicting the most probable next tokens in a sequence.
Machine LearningLoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains small rank-decomposed weight matrices alongside frozen base model weights, enabling model customization with minimal compute and memory.
Model TrainingAn MCP server is a lightweight program that exposes tools, resources, and prompts to AI applications through the Model Context Protocol's standardized client-server interface.
Developer ToolsMixture of Experts (MoE) is a neural network architecture that routes each input to a subset of specialized sub-networks, enabling massive model capacity with efficient per-token computation.
LLM ArchitectureModel Context Protocol is an open standard that defines how AI applications connect to external data sources and tools through a unified client-server interface.
Developer ToolsModel distillation transfers knowledge from a large teacher model to a smaller student model by training the student to match the teacher's output distributions rather than hard labels.
LLM ArchitectureModel evaluation is the systematic process of measuring language model performance against benchmarks, human judgments, and task-specific metrics to determine fitness for production deployment.
MLOpsA model gateway is an API proxy layer that sits between applications and LLM providers, providing unified access, load balancing, fallback routing, cost tracking, and policy enforcement across multiple models.
LLM InfrastructureA model registry is a centralized repository that stores, versions, and manages machine learning model artifacts along with their metadata, lineage, and deployment status.
MLOpsModel routing is the dynamic selection of which language model handles each request based on task complexity, cost constraints, latency requirements, or content classification.
LLM InfrastructureModel serving deploys trained machine learning models as production services that accept inference requests and return predictions with low latency and high availability.
MLOpsA multi-agent system is an architecture where multiple specialized AI agents collaborate, communicate, and coordinate to solve problems that exceed any single agent's capabilities.
AI Agent DevelopmentMulti-modal agents are AI systems that perceive and act across multiple data types — text, images, audio, video, and code — using vision-language models to understand and interact with graphical interfaces.
AI Agent DevelopmentMultimodal AI refers to systems that can process, understand, and generate content across multiple data types including text, images, audio, and video within a unified model.
Machine LearningPrefix caching is a self-hosted inference optimization that stores KV cache states for common prompt prefixes on the serving infrastructure, enabling instant context reuse without recomputation.
LLM InfrastructurePrompt caching is an inference optimization where API providers store and reuse precomputed KV cache states for repeated prompt prefixes, reducing latency and cost for requests sharing common context.
LLM InfrastructurePrompt engineering is the practice of crafting and refining instructions given to language models to elicit accurate, relevant, and properly formatted outputs for specific tasks.
Machine LearningPrompt injection is an attack technique where malicious instructions are embedded in user inputs or external data to override a language model's system prompt and alter its intended behavior.
AI SafetyReAct is an agent prompting pattern that interleaves reasoning traces with action execution, enabling language models to plan, act, and observe iteratively to solve complex tasks.
AI Agent DevelopmentRed teaming in AI involves systematically probing AI systems for vulnerabilities, biases, and failure modes by simulating adversarial attacks and edge-case scenarios.
AI SafetyRetrieval-augmented generation is an architecture that enhances language model outputs by retrieving relevant documents from external knowledge sources and including them in the model's context.
Search & DiscoveryRLHF (Reinforcement Learning from Human Feedback) trains AI models to align with human preferences by using human judgment as a reward signal to fine-tune model behavior.
AI SafetySemantic caching stores LLM responses indexed by the semantic meaning of queries rather than exact string matches, enabling cache hits for paraphrased questions that would miss traditional caches.
LLM InfrastructureServerless computing is a cloud execution model where the provider dynamically allocates resources and bills only for actual compute time used during function invocations.
Cloud InfrastructureSpeculative decoding is an inference acceleration technique that uses a smaller draft model to propose multiple tokens in parallel, then verifies them against the larger target model in a single forward pass.
LLM InfrastructureStructured content is information organized with consistent formatting, semantic markup, and machine-readable metadata that enables automated processing by search engines and AI systems.
Search & DiscoveryStructured output is a language model capability that constrains generation to produce valid JSON, XML, or other schema-conforming formats, ensuring reliable parsing by downstream systems.
AI Agent DevelopmentTensorRT-LLM is NVIDIA's open-source library that optimizes large language model inference through kernel fusion, quantization, and hardware-specific compilation for maximum GPU utilization.
LLM InfrastructureA token budget is the allocated limit on input and output tokens for a language model request, used to control costs, latency, and context window utilization.
LLM InfrastructureTool orchestration is the coordination layer that manages how AI agents discover, select, invoke, and compose multiple tools to complete complex multi-step tasks autonomously.
AI Agent DevelopmentTool use is a capability that allows language models to invoke external functions, APIs, or services by generating structured calls that are executed by the host application.
AI Agent DevelopmentThe transformer is a neural network architecture that uses self-attention mechanisms to process sequential data in parallel, forming the foundation of all modern large language models.
LLM ArchitectureA vector database is a specialized storage system designed to index, store, and perform fast similarity searches over high-dimensional embedding vectors at scale.
Search & DiscoveryvLLM is an open-source LLM serving engine that uses PagedAttention to efficiently manage GPU memory for KV caches, enabling high-throughput inference with continuous batching.
LLM Infrastructure