Your agent failed in prod and you can't reproduce it. Compare LangSmith, Langfuse, and Phoenix on tracing, evals, self-hosting, and cost.

TL;DR: LangSmith, Langfuse, and Phoenix are the three leading LLM observability platforms for tracing, debugging, and evaluating AI agents in production. LangSmith is the polished, proprietary standard built by the LangChain team. Langfuse is the open-source, OpenTelemetry-native platform that teams self-host for data control. Phoenix, from Arize, is the free, notebook-first tool with the strongest built-in evaluation engine. Choose LangSmith for managed convenience and LangGraph integration, Langfuse for self-hosted production at scale, and Phoenix for fast local debugging and offline evals.
You built an agent. It passed every test in your notebook. You shipped it. Three days later a user reports it returned a confidently wrong answer, and you have no idea why. The agent made eleven tool calls, retrieved four documents, summarized two of them, and produced a paragraph of plausible nonsense. Your application logs show a single line: 200 OK, 4.2s.
This is the day-2 problem of agent engineering, and it is the reason "Your Agent Failed in Prod. Good Luck Reproducing It" resonates with every developer who has shipped an LLM system. Traditional logging was designed for deterministic software, where the same input produces the same output and a stack trace points at the failing line. Agents break both assumptions. They are non-deterministic, so you cannot simply re-run the request and watch it fail again. And they fail semantically — the code executes perfectly while the meaning of the output is wrong.
LLM observability platforms exist to close this gap. They capture the full execution tree of an agent run — every prompt, every model response, every tool invocation, every retrieved chunk, with token counts, costs, and latencies attached — and let you replay, inspect, and score it after the fact. The three tools that dominate this space in 2026 are LangSmith, Langfuse, and Arize Phoenix. They overlap heavily in capability but differ sharply in philosophy, licensing, and operational model. This guide compares them on the dimensions that actually decide which one belongs in your stack.
Application Performance Monitoring (APM) tools like Datadog, New Relic, and Honeycomb answer questions about system health: latency percentiles, error rates, throughput, resource saturation. They are indispensable, and most of them now have LLM add-ons. But they were built around the assumption that a "good" request is a fast request that returns a 200.
LLM observability adds a second axis that APM never had to consider: output quality. A request can be fast, cheap, and return HTTP 200 while producing an answer that is factually wrong, off-policy, or hallucinated. To capture this, LLM observability platforms add three capabilities that classic APM lacks:
With that framing, here is how the three platforms stack up.
LangSmith is the observability and evaluation platform built by the LangChain team. It is the most mature and polished of the three, and it is proprietary — there is no open-source LangSmith you can git clone. You use it as a hosted SaaS, or, on the Enterprise plan, as a self-managed deployment inside your own cloud.
Its biggest strength is integration. If you build with LangChain or LangGraph, tracing is nearly automatic: set two environment variables and every chain, tool, and agent step is captured with zero code changes. But LangSmith is not LangChain-only. Its @traceable decorator instruments any Python or TypeScript function, and wrap_openai transparently traces direct OpenAI SDK calls without touching the rest of your code.
Tracing is enabled purely through environment variables, which means you can turn it on in production without redeploying code:
Recognizing that teams do not want to be locked into its proprietary SDK, LangSmith now also accepts traces over OpenTelemetry. You can point any OTel exporter at its ingest endpoint:
LangSmith's evaluation suite is excellent: managed datasets, an evaluate() runner, LLM-as-judge evaluators, pairwise comparison, and annotation queues for human review. Its main trade-offs are that the full platform is closed-source and that self-hosting is reserved for Enterprise contracts — a real constraint for teams in regulated industries who need data residency on a startup budget.
Langfuse is the open-source answer to LangSmith. Its core is MIT-licensed, and self-hosting is a first-class, documented path rather than an enterprise afterthought. This is the single biggest reason teams choose it: when your traces contain user PII, medical data, or proprietary prompts, "send everything to a third-party SaaS" is often a non-starter. With Langfuse you run the whole platform inside your own VPC.
Instrumentation mirrors the others. The @observe() decorator wraps any function, and Langfuse ships drop-in replacements for popular SDKs so you can trace existing code by changing a single import:
Langfuse is built on OpenTelemetry semantics from the ground up, so traces from any OTel-instrumented service flow in natively. Its evaluation model is dataset-driven and reads cleanly:
The catch with self-hosting is operational weight. A production Langfuse deployment is not a single container — it needs PostgreSQL (transactional data), ClickHouse (the high-volume trace/analytics store), Redis (queueing), and S3-compatible object storage (large payloads). That is a real stack to run, monitor, and back up. For teams that want Langfuse without the ops burden, Langfuse Cloud offers a free Hobby tier and paid usage tiers — the same open-source product, managed.
Phoenix, from Arize, optimizes for a different moment: the developer staring at a broken agent who wants a trace view right now, with no account, no cloud, and no infrastructure. It is fully open source and free, and it is the fastest of the three to stand up. One install and one function call gives you a local tracing server with a UI:
Phoenix is built on OpenInference, Arize's open standard that extends OpenTelemetry with semantic conventions for LLM spans. Because it is OTel-native, the same traces can later be shipped to Arize's commercial cloud platform (Arize AX) or any OTel backend — instrument once, decide the backend later. For containerized setups, the entire server is a single image:
Where Phoenix genuinely pulls ahead is evaluation. Its phoenix.evals library ships battle-tested, prebuilt evaluators — hallucination, Q&A correctness, RAG relevance, toxicity, faithfulness — that run as LLM-as-judge classifiers over a dataframe of traces:
This dataframe-centric workflow makes Phoenix a natural fit for offline experiments, regression testing in CI, and research notebooks. Its trade-off is the mirror image of Langfuse's strength: Phoenix is lighter and less oriented toward being a multi-tenant, always-on production backend with long-term retention, team RBAC, and SLAs. Many teams use it precisely as a local-and-CI eval tool alongside a heavier production tracer.
The instrumentation philosophies converge more than they diverge in 2026 — all three are OTel-compatible — but the developer ergonomics differ:
The strategic takeaway: because every platform now ingests OpenTelemetry, instrumentation is no longer a one-way door. Emit OTel/OpenInference spans from your application, and you can switch backends — or run two in parallel — without rewriting your agent. Avoid coding directly against a single vendor's proprietary span format unless you are certain you will never migrate.
Tracing tells you what happened; evaluation tells you whether it was good. This is where your choice matters most, because the eval loop is where you actually spend your debugging time.
A practical pattern many teams adopt: use Phoenix in CI for fast, free, offline regression evals on every pull request, and use Langfuse or LangSmith in production for continuous online evaluation and long-term trace retention.
For teams with strict data-residency or compliance requirements, this dimension can override every other consideration.
Pricing tiers shift frequently, so treat these as the model rather than a quote — always confirm current numbers on each vendor's pricing page.
The headline: Phoenix is free, Langfuse is free if you self-host, and LangSmith trades dollars for not having to run anything. For a solo developer or early-stage team, Phoenix or self-hosted Langfuse costs nothing but engineering time. For a funded team that values managed convenience and is comfortable with a SaaS, LangSmith's per-seat pricing is often worth it.
There is no universal winner — the right choice maps to your constraints:
And remember the meta-point: instrument with OpenTelemetry / OpenInference rather than a proprietary SDK where you can. That keeps the decision reversible — you can start with Phoenix locally, add Langfuse in production, and evaluate LangSmith later, all without re-instrumenting your agent. The frameworks you build on (see our AI Agent Frameworks Complete Guide and the AgentCore vs LangChain comparison) will keep changing; a vendor-neutral observability layer is the part of your stack worth protecting.
Traditional monitoring (APM) measures system health — latency, error rates, throughput, and resource usage — and assumes a fast request returning HTTP 200 is a good request. LLM observability adds a quality axis that APM lacks. It captures the full semantic content of each agent step (prompts, model responses, tool calls, retrieved documents), accounts for token usage and cost per generation, and supports evaluation through human review, heuristics, or LLM-as-judge scoring. This lets you catch outputs that are fast and cheap but factually wrong or hallucinated — failures invisible to traditional monitoring.
Langfuse's core is open source under an MIT license and free to self-host, so you pay only for the infrastructure you run it on. A production self-hosted deployment is not a single container, though — it requires PostgreSQL for transactional data, ClickHouse for high-volume trace and analytics storage, Redis for queueing, and S3-compatible object storage for large payloads. If you prefer not to operate that stack, Langfuse Cloud offers a free Hobby tier and paid usage-based tiers of the same product.
Yes. All three are framework-agnostic. LangSmith integrates most deeply with LangChain and LangGraph but also traces arbitrary functions via its @traceable decorator and wrap_openai wrapper, and it accepts OpenTelemetry traces from any source. Langfuse uses an @observe() decorator plus drop-in SDK wrappers and is built on OpenTelemetry semantics. Phoenix uses OpenInference, an OpenTelemetry standard for LLM spans, with auto-instrumentation for OpenAI, LlamaIndex, and many other libraries. Because all three speak OpenTelemetry, you can instrument once and switch backends later.
For built-in, ready-to-run evaluations, Phoenix leads with its phoenix.evals library of prebuilt LLM-as-judge evaluators for hallucination, correctness, relevance, faithfulness, and toxicity — ideal for offline and CI-based scoring. LangSmith offers the most polished managed evaluation workflow, including hosted datasets, a one-call evaluate() runner, pairwise comparison, and human annotation queues. Langfuse supports both dataset-driven offline experiments via run_experiment and online evaluators that score sampled production traffic, with the privacy advantage of running entirely on self-hosted infrastructure. Many teams combine Phoenix in CI with Langfuse or LangSmith in production.
Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.
Compare AgentCore and LangChain for AI agents. Architecture, pricing, and deployment trade-offs explained with code.
AI Engineering, Agent FrameworksOne misplaced timestamp invalidated our entire KV cache and 10x'd our bill. Here are 6 context engineering patterns from Manus and production agent teams that prevent exactly this -- with code examples.
AI Engineering, Agent FrameworksLangChain, AgentCore, LangGraph, CrewAI, AutoGen, and Strands compared on architecture, scaling, and production readiness. See which frameworks actually work for enterprise agents.
AI Agent Development, Framework Comparison