AI Engineering, Observability17 min read

LangSmith vs Langfuse vs Phoenix: LLM Observability

Your agent failed in prod and you can't reproduce it. Compare LangSmith, Langfuse, and Phoenix on tracing, evals, self-hosting, and cost.

LangSmith vs Langfuse vs Phoenix: LLM Observability

LangSmith vs Langfuse vs Phoenix: Which LLM Observability Platform Should You Use?

TL;DR: LangSmith, Langfuse, and Phoenix are the three leading LLM observability platforms for tracing, debugging, and evaluating AI agents in production. LangSmith is the polished, proprietary standard built by the LangChain team. Langfuse is the open-source, OpenTelemetry-native platform that teams self-host for data control. Phoenix, from Arize, is the free, notebook-first tool with the strongest built-in evaluation engine. Choose LangSmith for managed convenience and LangGraph integration, Langfuse for self-hosted production at scale, and Phoenix for fast local debugging and offline evals.

Key Takeaways

  • LLM observability is the "day-2" problem of agent engineering: once an agent runs autonomously across dozens of steps, you cannot debug it without a trace of every prompt, tool call, token count, and latency measurement.
  • All three platforms now speak OpenTelemetry, so instrumentation is no longer a lock-in decision — you can emit traces once and route them to a different backend later.
  • LangSmith offers the smoothest managed experience and deepest LangChain/LangGraph integration, but self-hosting is gated behind its Enterprise plan.
  • Langfuse is the most production-ready open-source option, with a first-class self-hosting story (Postgres + ClickHouse + Redis + object storage) and an MIT-licensed core.
  • Phoenix is the fastest path to a working trace view — one `pip install` and a `register()` call — and ships the most complete library of prebuilt LLM-as-judge evaluators.
  • Evaluation, not just tracing, is the real differentiator in 2026: the platform that makes it cheap to run hallucination, correctness, and faithfulness evals on production traces wins your debugging loop.

Why Does Your Agent Fail Silently in Production?

You built an agent. It passed every test in your notebook. You shipped it. Three days later a user reports it returned a confidently wrong answer, and you have no idea why. The agent made eleven tool calls, retrieved four documents, summarized two of them, and produced a paragraph of plausible nonsense. Your application logs show a single line: 200 OK, 4.2s.

This is the day-2 problem of agent engineering, and it is the reason "Your Agent Failed in Prod. Good Luck Reproducing It" resonates with every developer who has shipped an LLM system. Traditional logging was designed for deterministic software, where the same input produces the same output and a stack trace points at the failing line. Agents break both assumptions. They are non-deterministic, so you cannot simply re-run the request and watch it fail again. And they fail semantically — the code executes perfectly while the meaning of the output is wrong.

LLM observability platforms exist to close this gap. They capture the full execution tree of an agent run — every prompt, every model response, every tool invocation, every retrieved chunk, with token counts, costs, and latencies attached — and let you replay, inspect, and score it after the fact. The three tools that dominate this space in 2026 are LangSmith, Langfuse, and Arize Phoenix. They overlap heavily in capability but differ sharply in philosophy, licensing, and operational model. This guide compares them on the dimensions that actually decide which one belongs in your stack.

What Is LLM Observability, and How Does It Differ from Traditional APM?

Application Performance Monitoring (APM) tools like Datadog, New Relic, and Honeycomb answer questions about system health: latency percentiles, error rates, throughput, resource saturation. They are indispensable, and most of them now have LLM add-ons. But they were built around the assumption that a "good" request is a fast request that returns a 200.

LLM observability adds a second axis that APM never had to consider: output quality. A request can be fast, cheap, and return HTTP 200 while producing an answer that is factually wrong, off-policy, or hallucinated. To capture this, LLM observability platforms add three capabilities that classic APM lacks:

  1. Hierarchical, semantic traces. A single agent request becomes a tree of nested spans — the parent run, child LLM generations, tool calls, and retriever lookups — each annotated with the actual prompt text, the model's response, the model name, and token usage. You can expand any node and read exactly what the model saw.
  2. Cost and token accounting. Every generation span records input and output tokens and computes cost per model. This rolls up to per-trace, per-user, and per-feature cost dashboards — the data you need to find the one prompt that is quietly burning your budget. (See our guide on context engineering for how cache design turns those token counts into 10x cost swings.)
  3. Evaluation and scoring. This is the defining feature. Platforms let you attach scores to traces — from human annotators, heuristic checks, or LLM-as-judge evaluators — so you can measure whether outputs are actually good, not just fast. Evals run both offline (against a fixed dataset, in CI) and online (sampling live production traffic).

With that framing, here is how the three platforms stack up.

How Do LangSmith, Langfuse, and Phoenix Compare at a Glance?

What Is LangSmith and Who Is It For?

LangSmith is the observability and evaluation platform built by the LangChain team. It is the most mature and polished of the three, and it is proprietary — there is no open-source LangSmith you can git clone. You use it as a hosted SaaS, or, on the Enterprise plan, as a self-managed deployment inside your own cloud.

Its biggest strength is integration. If you build with LangChain or LangGraph, tracing is nearly automatic: set two environment variables and every chain, tool, and agent step is captured with zero code changes. But LangSmith is not LangChain-only. Its @traceable decorator instruments any Python or TypeScript function, and wrap_openai transparently traces direct OpenAI SDK calls without touching the rest of your code.

Tracing is enabled purely through environment variables, which means you can turn it on in production without redeploying code:

Recognizing that teams do not want to be locked into its proprietary SDK, LangSmith now also accepts traces over OpenTelemetry. You can point any OTel exporter at its ingest endpoint:

LangSmith's evaluation suite is excellent: managed datasets, an evaluate() runner, LLM-as-judge evaluators, pairwise comparison, and annotation queues for human review. Its main trade-offs are that the full platform is closed-source and that self-hosting is reserved for Enterprise contracts — a real constraint for teams in regulated industries who need data residency on a startup budget.

What Is Langfuse and Why Do Teams Self-Host It?

Langfuse is the open-source answer to LangSmith. Its core is MIT-licensed, and self-hosting is a first-class, documented path rather than an enterprise afterthought. This is the single biggest reason teams choose it: when your traces contain user PII, medical data, or proprietary prompts, "send everything to a third-party SaaS" is often a non-starter. With Langfuse you run the whole platform inside your own VPC.

Instrumentation mirrors the others. The @observe() decorator wraps any function, and Langfuse ships drop-in replacements for popular SDKs so you can trace existing code by changing a single import:

Langfuse is built on OpenTelemetry semantics from the ground up, so traces from any OTel-instrumented service flow in natively. Its evaluation model is dataset-driven and reads cleanly:

The catch with self-hosting is operational weight. A production Langfuse deployment is not a single container — it needs PostgreSQL (transactional data), ClickHouse (the high-volume trace/analytics store), Redis (queueing), and S3-compatible object storage (large payloads). That is a real stack to run, monitor, and back up. For teams that want Langfuse without the ops burden, Langfuse Cloud offers a free Hobby tier and paid usage tiers — the same open-source product, managed.

What Is Arize Phoenix and How Does Its Eval Engine Work?

Phoenix, from Arize, optimizes for a different moment: the developer staring at a broken agent who wants a trace view right now, with no account, no cloud, and no infrastructure. It is fully open source and free, and it is the fastest of the three to stand up. One install and one function call gives you a local tracing server with a UI:

Phoenix is built on OpenInference, Arize's open standard that extends OpenTelemetry with semantic conventions for LLM spans. Because it is OTel-native, the same traces can later be shipped to Arize's commercial cloud platform (Arize AX) or any OTel backend — instrument once, decide the backend later. For containerized setups, the entire server is a single image:

Where Phoenix genuinely pulls ahead is evaluation. Its phoenix.evals library ships battle-tested, prebuilt evaluators — hallucination, Q&A correctness, RAG relevance, toxicity, faithfulness — that run as LLM-as-judge classifiers over a dataframe of traces:

This dataframe-centric workflow makes Phoenix a natural fit for offline experiments, regression testing in CI, and research notebooks. Its trade-off is the mirror image of Langfuse's strength: Phoenix is lighter and less oriented toward being a multi-tenant, always-on production backend with long-term retention, team RBAC, and SLAs. Many teams use it precisely as a local-and-CI eval tool alongside a heavier production tracer.

How Does Tracing Setup Differ Across the Three?

The instrumentation philosophies converge more than they diverge in 2026 — all three are OTel-compatible — but the developer ergonomics differ:

  • LangSmith is environment-variable-driven. If you are on LangChain, you write no tracing code; otherwise you sprinkle `@traceable` and use `wrap_openai`. The path of least resistance leads to its SaaS.
  • Langfuse centers on the `@observe()` decorator and import-swap SDK wrappers (`from langfuse.openai import openai`). It is explicit and framework-agnostic, with OTel as the foundation rather than an add-on.
  • Phoenix leans hardest into auto-instrumentation: `register(auto_instrument=True)` detects installed OpenInference integrations and wires them up with zero per-function decoration.

The strategic takeaway: because every platform now ingests OpenTelemetry, instrumentation is no longer a one-way door. Emit OTel/OpenInference spans from your application, and you can switch backends — or run two in parallel — without rewriting your agent. Avoid coding directly against a single vendor's proprietary span format unless you are certain you will never migrate.

How Do You Run Evaluations on Each Platform?

Tracing tells you what happened; evaluation tells you whether it was good. This is where your choice matters most, because the eval loop is where you actually spend your debugging time.

  • Phoenix has the deepest out-of-the-box library. Prebuilt evaluators for hallucination, correctness, relevance, and toxicity mean you can score traces in minutes without writing prompt templates. Its `ClassificationEvaluator` and `llm_classify` are designed for high-throughput, dataframe-based scoring — ideal for offline batch evals and CI gates.
  • LangSmith offers the most polished managed eval workflow: hosted datasets, a one-call `evaluate()` runner, built-in and custom evaluators, pairwise/preference scoring, and human annotation queues with reviewer assignment. If your evaluation process involves a team of human reviewers, LangSmith's UI is the most refined.
  • Langfuse sits in between, with dataset-driven `run_experiment` for offline evals and configurable online evaluators that score a sampled percentage of live production traces. Because it is self-hostable, you can run LLM-as-judge evals on sensitive data without it ever leaving your network.

A practical pattern many teams adopt: use Phoenix in CI for fast, free, offline regression evals on every pull request, and use Langfuse or LangSmith in production for continuous online evaluation and long-term trace retention.

Which Platform Wins on Self-Hosting and Data Privacy?

For teams with strict data-residency or compliance requirements, this dimension can override every other consideration.

  • Langfuse is the clear winner for production self-hosting. It is open source, explicitly designed to run in your own infrastructure, and documented for Docker, Docker Compose, Kubernetes, and Helm. The cost is operational: you run and maintain Postgres, ClickHouse, Redis, and object storage.
  • Phoenix is the easiest to self-host for development and lighter production. A single container or `pip install` gets you running, backed by SQLite locally or Postgres for persistence. It is less of a full-fledged, multi-tenant production backend than Langfuse but dramatically simpler to operate.
  • LangSmith can be self-hosted, but only on its Enterprise plan. For most startups and individual teams, LangSmith effectively means "send your traces to LangChain's cloud." If that is acceptable for your data, it is the smoothest experience; if it is not, LangSmith is off the table at smaller scales.

How Much Does Each Platform Cost?

Pricing tiers shift frequently, so treat these as the model rather than a quote — always confirm current numbers on each vendor's pricing page.

The headline: Phoenix is free, Langfuse is free if you self-host, and LangSmith trades dollars for not having to run anything. For a solo developer or early-stage team, Phoenix or self-hosted Langfuse costs nothing but engineering time. For a funded team that values managed convenience and is comfortable with a SaaS, LangSmith's per-seat pricing is often worth it.

Which LLM Observability Tool Should You Choose?

There is no universal winner — the right choice maps to your constraints:

  • Choose LangSmith if you build primarily with LangChain or LangGraph, want the most polished managed UI, value human-in-the-loop annotation workflows, and are comfortable sending traces to a third-party cloud (or can afford the Enterprise self-host plan).
  • Choose Langfuse if you need production-grade observability and must self-host for privacy, compliance, or cost reasons. It is the best open-source platform for a real, always-on production backend — provided you can operate its data stack.
  • Choose Phoenix if you want the fastest possible start, the strongest built-in evaluation library, and a free tool for local debugging and CI-based offline evals. It pairs naturally with a heavier production tracer.

And remember the meta-point: instrument with OpenTelemetry / OpenInference rather than a proprietary SDK where you can. That keeps the decision reversible — you can start with Phoenix locally, add Langfuse in production, and evaluate LangSmith later, all without re-instrumenting your agent. The frameworks you build on (see our AI Agent Frameworks Complete Guide and the AgentCore vs LangChain comparison) will keep changing; a vendor-neutral observability layer is the part of your stack worth protecting.

Frequently Asked Questions

What is the difference between LLM observability and traditional monitoring?

Traditional monitoring (APM) measures system health — latency, error rates, throughput, and resource usage — and assumes a fast request returning HTTP 200 is a good request. LLM observability adds a quality axis that APM lacks. It captures the full semantic content of each agent step (prompts, model responses, tool calls, retrieved documents), accounts for token usage and cost per generation, and supports evaluation through human review, heuristics, or LLM-as-judge scoring. This lets you catch outputs that are fast and cheap but factually wrong or hallucinated — failures invisible to traditional monitoring.

Is Langfuse really free, and what does self-hosting actually require?

Langfuse's core is open source under an MIT license and free to self-host, so you pay only for the infrastructure you run it on. A production self-hosted deployment is not a single container, though — it requires PostgreSQL for transactional data, ClickHouse for high-volume trace and analytics storage, Redis for queueing, and S3-compatible object storage for large payloads. If you prefer not to operate that stack, Langfuse Cloud offers a free Hobby tier and paid usage-based tiers of the same product.

Can I use these observability tools without LangChain?

Yes. All three are framework-agnostic. LangSmith integrates most deeply with LangChain and LangGraph but also traces arbitrary functions via its @traceable decorator and wrap_openai wrapper, and it accepts OpenTelemetry traces from any source. Langfuse uses an @observe() decorator plus drop-in SDK wrappers and is built on OpenTelemetry semantics. Phoenix uses OpenInference, an OpenTelemetry standard for LLM spans, with auto-instrumentation for OpenAI, LlamaIndex, and many other libraries. Because all three speak OpenTelemetry, you can instrument once and switch backends later.

Which platform is best for evaluating AI agents, not just tracing them?

For built-in, ready-to-run evaluations, Phoenix leads with its phoenix.evals library of prebuilt LLM-as-judge evaluators for hallucination, correctness, relevance, faithfulness, and toxicity — ideal for offline and CI-based scoring. LangSmith offers the most polished managed evaluation workflow, including hosted datasets, a one-call evaluate() runner, pairwise comparison, and human annotation queues. Langfuse supports both dataset-driven offline experiments via run_experiment and online evaluators that score sampled production traffic, with the privacy advantage of running entirely on self-hosted infrastructure. Many teams combine Phoenix in CI with Langfuse or LangSmith in production.

Subscribe to the newsletter

By subscribing, you agree to our Terms of Service and Privacy Policy.

About the Author

Aaron is an engineering leader, software architect, and founder with 18 years building distributed systems and cloud infrastructure. Now focused on LLM-powered platforms, agent orchestration, and production AI. He shares hands-on technical guides and framework comparisons at fp8.co.

Cite this Article

Aaron. "LangSmith vs Langfuse vs Phoenix: LLM Observability." fp8.co, June 10, 2026. https://fp8.co/articles/LangSmith-vs-Langfuse-vs-Phoenix-LLM-Agent-Observability

Related Articles

AgentCore vs LangChain: 2026 Framework Guide

Compare AgentCore and LangChain for AI agents. Architecture, pricing, and deployment trade-offs explained with code.

AI Engineering, Agent Frameworks

Context Engineering for AI Agents: 6 Techniques That Cut Our Costs 10x

One misplaced timestamp invalidated our entire KV cache and 10x'd our bill. Here are 6 context engineering patterns from Manus and production agent teams that prevent exactly this -- with code examples.

AI Engineering, Agent Frameworks

6 AI Agent Frameworks Tested: Only 2 Production-Ready in 2026

LangChain, AgentCore, LangGraph, CrewAI, AutoGen, and Strands compared on architecture, scaling, and production readiness. See which frameworks actually work for enterprise agents.

AI Agent Development, Framework Comparison