≫ Home / Observability & Analytics / LLM Observability Tools Compared

Monitoring dashboard showing analytics and metrics

LLM Observability Tools Compared (2026)

March 28, 2026 / Updated May 21, 2026 / Arvid Andersson

Once you ship LLM features to production, you need visibility into what's happening: what prompts are being sent, how much each call costs, whether quality is holding up, and where failures occur. LLM observability tools fill this gap. This post compares the tools in the category and what each does well.

Looking for a side-by-side table with filters? See the Observability & Analytics pricing comparison.

At a glance

Feature sets evolve month to month. Verified May 2026.

Self-hosted, open-source tracing. Langfuse. MIT-licensed, self-hostable, broad integrations (LangChain, OpenAI SDK, LiteLLM). Joined ClickHouse in January 2026, with no planned changes to licensing or self-hosting. Trade-off: long-term roadmap is now tied to a database company.
Zero-code integration. Helicone. Proxy-based, change one line (base URL). Logs every request with cost tracking across major providers (OpenAI, Anthropic, Gemini, Mistral, DeepSeek, and more). Trade-off: proxy adds a network hop, and all traffic routes through it.
LangChain-native. LangSmith. Deepest integration with LangChain, includes prompt playground and agent deployment (Fleet). Trade-off: proprietary, and the tight LangChain coupling is a lock-in risk if you switch frameworks.
Evaluation-first. Braintrust. Custom database for AI traces, scoring and comparison workflows, Loop agent for automated evals. $80M Series B (Feb 2026). Trade-off: less focus on real-time production monitoring, more on offline evaluation.
AI gateway with built-in observability. Portkey. Open-source AI gateway with routing, fallbacks, caching, and logging. Gateway 2.0 (pre-release as of May 2026) is bringing previously enterprise-only features into open-source. Trade-off: adds a gateway layer, which is more infrastructure to manage.
Agent-specific observability. AgentOps. Session replay, time-travel debugging for multi-step agents. Integrates with CrewAI, AutoGen, OpenAI Agents SDK. Trade-off: narrow focus, less useful for simple LLM call monitoring.
Budget-friendly open-source. Opik (by Comet). Apache-2.0, built-in evaluation metrics, agent optimizer SDK. Backed by an established MLOps company. Trade-off: newer in the LLM observability space, smaller community than Langfuse.
Agent testing and simulation. LangWatch. Open-source, Netherlands-based. Runs thousands of synthetic conversations across scenarios, languages, and edge cases. DSPy integration for automated prompt optimization. Trade-off: smaller community than Langfuse or Phoenix, less established track record.

Why LLM observability matters

Traditional application monitoring tracks request times and error rates. LLM applications add new dimensions: token costs that can spike unexpectedly, prompt quality that degrades over time, hallucinations that are hard to catch without systematic evaluation, and complex chains where a failure in one step cascades. LLM observability tools capture the full trace of each request, from prompt to completion, and give you the data to debug, optimize, and evaluate.

Open-source options

Langfuse is one of the more widely adopted open-source options, with ~27k GitHub stars as of May 2026. It provides tracing, prompt management, evaluation, and cost tracking, with integrations across most LLM frameworks (LangChain, LlamaIndex, OpenAI SDK). Langfuse joined ClickHouse in January 2026 (announced as an acquisition, no funding figures disclosed). The team has committed to no planned changes to the MIT license or self-hosting, which matters for teams with strict data residency requirements.

Opik (by Comet) is an Apache-2.0 option with built-in evaluation metrics for hallucination, moderation, and relevance. It includes an agent optimizer SDK for automated prompt tuning. Backed by Comet, an established MLOps company.

Lunary is self-hostable (Kubernetes or Docker) and focused on chatbot and RAG evaluation, with paid tiers starting at $20/user/month. Good for smaller teams that want observability without a large budget.

Platform-integrated tools

LangSmith is built by the LangChain team and integrates tightly with the LangChain ecosystem. If your application already uses LangChain, LangSmith gives you tracing, evaluation datasets, and prompt playgrounds with minimal setup. The trade-off is vendor lock-in to the LangChain stack.

Braintrust takes a different angle, focusing on evaluation and experimentation. It treats LLM outputs as data that should be scored, compared, and iterated on. After raising an $80M Series B in February 2026 (customers include Vercel, Notion, Coursera, Dropbox, Replit, Graphite, and Navan), the team built Brainstore, a custom database designed for querying complex AI traces from multi-step agents. The Loop agent generates better prompts, scorers, and datasets automatically. A good fit for teams that want rigorous, data-driven prompt optimization rather than just logging.

Proxy-based monitoring

Helicone works as a proxy that sits between your application and the LLM provider. You change your API base URL to point at Helicone, and it logs every request with zero code changes. This approach is appealing for teams that want observability without modifying their application code. Helicone tracks costs, latency, and usage patterns across providers.

Agent-focused observability

As AI agents become more common, tools like AgentOps specialize in agent-specific observability. AgentOps records every LLM call, tool invocation, and decision point in an agent session, letting you replay runs step-by-step. It integrates with frameworks like CrewAI, AutoGen, and OpenAI Agents SDK. If you're building autonomous agents rather than simple LLM chains, agent-specific tooling gives you better visibility into multi-step reasoning and tool use.

HoneyHive covers similar ground with distributed tracing and drift detection for production agents, plus flexible deployment options (SaaS, hybrid, or self-hosted). Useful for teams that need on-premises deployment alongside agent observability.

ML platforms with LLM support

Weights & Biases started as an experiment tracking platform for ML training and has expanded into LLM observability with its Weave product. If your team already uses W&B for model training, adding LLM tracing keeps everything in one platform. Arize AI takes a similar approach, combining traditional ML monitoring with LLM-specific features like prompt tracing and embedding drift detection. Their open-source Phoenix library provides local tracing and evaluation without sending data to a cloud service.

AI gateways

Portkey combines an AI gateway with observability. It sits between your application and LLM providers, handling request routing, fallbacks, and load balancing while logging every request. The Gateway 2.0 release (in pre-release as of May 2026) is merging previously enterprise-only features into the open-source core, including circuit breakers, semantic caching, budget limits, and real-time metrics. Portkey advertises 10B+ tokens processed daily and supports a wide range of LLMs with 50+ guardrails built in. They have also added an MCP Gateway for governing AI agents across enterprise tools. Useful for teams running multiple LLM providers in production who want both reliability and observability in one layer.

Evaluation and testing

LangWatch bridges evaluation and observability with agent simulation. It runs thousands of synthetic conversations across scenarios, languages, and edge cases, then feeds results back into prompt optimization via DSPy integration. Netherlands-based and open-source. Useful for teams that want to test agent behavior systematically before shipping, not just monitor after the fact.

Two notes on the wider ecosystem: Humanloop, previously listed here for prompt management and A/B testing, was acquired by Anthropic in 2025 and the Humanloop platform is being sunset. Separately, Traceloop, whose open-source OpenLLMetry library provides OpenTelemetry instrumentation used by several tools on this page, was acquired by ServiceNow in March 2026.

Guardrails and safety

Catching bad outputs before they reach users requires more than logging. Guardrails sit on top of LLM calls to block hallucinations, PII leaks, prompt injection attempts, or policy violations at request time.

Patronus AI focuses on automated evaluation with LLM-as-judge approaches, useful when you want to score outputs at scale without writing custom evaluators. Giskard takes a pre-deployment angle, scanning models for vulnerabilities and quality issues before they reach production. Portkey's gateway includes 50+ built-in guardrails, which fits teams that already want a gateway layer. Future AGI bundles guardrails with simulation and evaluation in one platform.

The trade-off across this category is breadth versus depth: dedicated safety tools (Patronus, Giskard) tend to go deeper on a narrower problem, while platforms that combine guardrails with tracing or gateway features reduce integration cost but cover less ground per area.

Comparison

Tool	Best for	Open source	Key feature
Langfuse	General-purpose LLM tracing	Yes (MIT)	Self-hostable, acquired by ClickHouse (Jan 2026)
LangSmith	LangChain users	No	Deep LangChain integration, Fleet agent deployment
Helicone	Zero-code integration	Yes	Proxy-based, cost tracking across major providers
Braintrust	Evaluation and experimentation	No	Custom trace database, Loop agent for automated evals
AgentOps	AI agent monitoring	Yes	Session replay, time-travel debugging
Weights & Biases (Weave)	ML teams adding LLM observability	Yes (Apache-2.0)	Unified ML + LLM experiment tracking
Arize AI (Phoenix)	ML monitoring + LLM tracing	Yes	Embedding clustering, OpenTelemetry-native
Portkey	AI gateway + observability	Yes	Open-source gateway with routing, caching, 50+ guardrails
Opik (Comet)	Open-source eval + tracing	Yes (Apache-2.0)	Built-in metrics, agent optimizer SDK
Lunary	Chatbot and RAG teams	Self-hostable	Paid from $20/user/mo
HoneyHive	Production agent monitoring	No	Flexible deployment (SaaS, hybrid, self-hosted)
LangWatch	Agent testing + simulation	Yes	Synthetic conversation testing, DSPy integration
Future AGI	Guardrails + simulation in one platform	Yes (Apache-2.0)	15 guardrails, 7 eval agents, text + voice simulation

How to choose

The category is broad enough that the right pick depends on what you optimize for. A few starting points:

Open-source and self-hosted: Langfuse for general-purpose tracing, Arize Phoenix for OpenTelemetry-native, Opik for built-in evaluation metrics.
Already on LangChain: LangSmith is the path of least resistance.
Simplest possible setup: Helicone's proxy gets you logging with a one-line base URL change.
Prompt quality and structured evals: Braintrust for scoring and comparison workflows, Patronus for LLM-as-judge.
Building agents: AgentOps for session replay and time-travel debugging, HoneyHive for production agent monitoring with flexible deployment.
Already using W&B or Arize for ML: Their LLM features keep everything in one platform.
AI gateway plus observability: Portkey for the open-source gateway with routing and logging in one layer.
Budget-friendly chatbots and RAG: Lunary starts at $20/user/month.
Agent simulation before shipping: LangWatch for synthetic conversation testing with DSPy integration.
Guardrails, evals, and simulation under one roof: Future AGI bundles all of these with a free tier that covers prototype-scale usage.

Most of these tools offer free tiers, so the practical approach is to try two or three with your actual workload before committing.

See how these tools fit together

Observability is one layer. These stacks show how to combine it with inference, frameworks, and vector databases for a complete setup.

🚀 Indie & Early Startup Stack 💬 RAG Chatbot Stack 🤖 AI Agent Stack 🖥️ Self-Hosted Stack

Frequently asked questions

What is the best open-source LLM observability tool?

Langfuse is widely adopted, with MIT licensing, self-hosting, and broad framework support. ClickHouse acquired it in January 2026, giving the project a well-funded backer. Arize Phoenix and Opik (by Comet) are Apache-2.0 alternatives. Phoenix is OpenTelemetry-native with embedding analysis tools. Opik includes built-in evaluation metrics. All three are fully self-hostable with no feature gates.

What is the best LLM monitoring tool for AI agents?

AgentOps specializes in agent observability with session replay and time-travel debugging for multi-step agents. It integrates with CrewAI, AutoGen, and OpenAI Agents SDK. For broader observability that also covers agents, LangSmith and Braintrust both handle multi-step traces. The choice depends on whether agent debugging is your primary need or one of several.

What are alternatives to LangSmith?

Langfuse covers similar ground (tracing, prompt management, evaluations) without LangChain lock-in. Helicone is simpler (proxy-based, one-line setup). Braintrust leans more on evaluation workflows. Portkey combines gateway routing with observability. All of these work with any LLM framework, not just LangChain.

Which LLM observability tools support OpenTelemetry?

Langfuse, Arize Phoenix, and Traceloop's OpenLLMetry library are all built on OpenTelemetry. OpenLLMetry (acquired by ServiceNow in March 2026) provides OTel instrumentation that many other tools integrate with. If your team already has an OTel pipeline, these tools plug into it directly.

How much do LLM observability tools cost?

Most tools offer free tiers suitable for prototyping: Langfuse (50k units/month), Helicone (10k requests/month), Arize Phoenix (25k spans/month). Paid tiers range from $20/user/month (Lunary) up to enterprise pricing. Open-source tools (Langfuse, Phoenix, Opik) and self-hostable platforms (Lunary) can be deployed at infrastructure cost only.

Do I need LLM observability if I already use Datadog or New Relic?

Traditional APM tools track request latency and error rates but miss LLM-specific dimensions: token costs, prompt quality, hallucination rates, and multi-step chain failures. LLM observability tools capture the full prompt-to-completion trace with cost attribution. Some teams run both, using Datadog for infrastructure monitoring and a dedicated LLM tool for prompt-level visibility.

Browse all Observability & Analytics tools on Infrabase.ai

Is your product missing?

Add it here →