LLM Observability Tools Compared (2026)
/ Arvid Andersson
Once you ship LLM features to production, you need visibility into what's happening: what prompts are being sent, how much each call costs, whether quality is holding up, and where failures occur. LLM observability tools fill this gap. This post compares the leading options and what each does well.
At a glance
Feature sets evolve month to month. Verified April 2026.
- Self-hosted, open-source tracing. Langfuse. MIT-licensed, self-hostable, broad integrations (LangChain, OpenAI SDK, LiteLLM). Acquired by ClickHouse in January 2026, MIT license unchanged. Trade-off: long-term roadmap is now tied to a database company.
- Zero-code integration. Helicone. Proxy-based, change one line (base URL). Logs every request with cost tracking across 300+ models. Trade-off: proxy adds a network hop, and all traffic routes through it.
- LangChain-native. LangSmith. Deepest integration with LangChain, includes prompt playground and agent deployment (Fleet). Trade-off: proprietary, and the tight LangChain coupling is a lock-in risk if you switch frameworks.
- Evaluation-first. Braintrust. Custom database for AI traces, scoring and comparison workflows, Loop agent for automated evals. $80M Series B (Feb 2026). Trade-off: less focus on real-time production monitoring, more on offline evaluation.
- AI gateway with built-in observability. Portkey. Fully open-sourced gateway (March 2026) with routing, fallbacks, caching, and logging. Trade-off: adds a gateway layer, which is more infrastructure to manage.
- Agent-specific observability. AgentOps. Session replay, time-travel debugging for multi-step agents. Integrates with CrewAI, AutoGen, OpenAI Agents SDK. Trade-off: narrow focus, less useful for simple LLM call monitoring.
- Budget-friendly open-source. Opik (by Comet). Apache-2.0, built-in evaluation metrics, agent optimizer SDK. Backed by an established MLOps company. Trade-off: newer in the LLM observability space, smaller community than Langfuse.
- Agent testing and simulation. LangWatch. Open-source, Netherlands-based. Runs thousands of synthetic conversations across scenarios, languages, and edge cases. DSPy integration for automated prompt optimization. Trade-off: smaller community than Langfuse or Phoenix, less established track record.
Why LLM observability matters
Traditional application monitoring tracks request times and error rates. LLM applications add new dimensions: token costs that can spike unexpectedly, prompt quality that degrades over time, hallucinations that are hard to catch without systematic evaluation, and complex chains where a failure in one step cascades. LLM observability tools capture the full trace of each request, from prompt to completion, and give you the data to debug, optimize, and evaluate.
Open-source options
Langfuse is the most established open-source option. It provides tracing, prompt management, evaluation, and cost tracking, with integrations across most LLM frameworks (LangChain, LlamaIndex, OpenAI SDK). ClickHouse acquired Langfuse in January 2026 alongside a $400M Series D. The MIT license and self-hosting remain unchanged. Self-hosting is attractive for teams with strict data residency requirements.
Opik (by Comet) is an Apache-2.0 option with built-in evaluation metrics for hallucination, moderation, and relevance. It includes an agent optimizer SDK for automated prompt tuning. Backed by Comet, an established MLOps company, so the project has staying power.
Lunary is also open-source (Apache-2.0), focused on chatbot and RAG evaluation with the cheapest paid tier in the category at $20/user/month. Good for smaller teams that want observability without a large budget.
Platform-integrated tools
LangSmith is built by the LangChain team and integrates tightly with the LangChain ecosystem. If your application already uses LangChain, LangSmith gives you tracing, evaluation datasets, and prompt playgrounds with minimal setup. The trade-off is vendor lock-in to the LangChain stack.
Braintrust takes a different angle, focusing on evaluation and experimentation. It treats LLM outputs as data that should be scored, compared, and iterated on. After raising an $80M Series B in February 2026 (customers include Notion, Replit, Cloudflare, and Ramp), the team built Brainstore, a custom database designed for querying complex AI traces from multi-step agents. Braintrust is a good fit for teams that want rigorous, data-driven prompt optimization rather than just logging.
Proxy-based monitoring
Helicone works as a proxy that sits between your application and the LLM provider. You change your API base URL to point at Helicone, and it logs every request with zero code changes. This approach is appealing for teams that want observability without modifying their application code. Helicone tracks costs, latency, and usage patterns across providers.
Agent-focused observability
As AI agents become more common, tools like AgentOps specialize in agent-specific observability. AgentOps records every LLM call, tool invocation, and decision point in an agent session, letting you replay runs step-by-step. It integrates with frameworks like CrewAI, AutoGen, and OpenAI Agents SDK. If you're building autonomous agents rather than simple LLM chains, agent-specific tooling gives you better visibility into multi-step reasoning and tool use.
HoneyHive covers similar ground with distributed tracing and drift detection for production agents, plus flexible deployment options (SaaS, hybrid, or self-hosted). Useful for teams that need on-premises deployment alongside agent observability.
ML platforms with LLM support
Weights & Biases started as an experiment tracking platform for ML training and has expanded into LLM observability with its Weave product. If your team already uses W&B for model training, adding LLM tracing keeps everything in one platform. Arize AI takes a similar approach, combining traditional ML monitoring with LLM-specific features like prompt tracing and embedding drift detection. Their open-source Phoenix library provides local tracing and evaluation without sending data to a cloud service.
AI gateways
Portkey combines an AI gateway with observability. It sits between your application and LLM providers, handling request routing, fallbacks, and load balancing while logging every request. In March 2026, Portkey fully open-sourced its production gateway, including features that previously required a SaaS subscription: circuit breakers, semantic caching, budget limits, and real-time metrics. The gateway processes over 1 trillion tokens daily across 24,000+ organizations. Portkey also added an MCP Gateway for governing AI agents across enterprise tools. Useful for teams running multiple LLM providers in production who want both reliability and observability in one open-source layer.
Evaluation and testing
Several tools in this space go beyond logging to offer evaluation capabilities. Giskard focuses on AI testing, scanning models for vulnerabilities and quality issues before deployment. Patronus AI specializes in automated evaluation with LLM-as-judge approaches. (Note: Humanloop, previously listed here for prompt management and A/B testing, was acqui-hired by Anthropic in August 2025 and shut down in September 2025. Separately, Traceloop, whose open-source OpenLLMetry library provides OpenTelemetry instrumentation used by several tools on this page, was acquired by ServiceNow in March 2026.)
LangWatch bridges evaluation and observability with agent simulation. It runs thousands of synthetic conversations across scenarios, languages, and edge cases, then feeds results back into prompt optimization via DSPy integration. Netherlands-based and open-source. Useful for teams that want to test agent behavior systematically before shipping, not just monitor after the fact.
Comparison
| Tool | Best for | Open source | Key feature |
|---|---|---|---|
| Langfuse | General-purpose LLM tracing | Yes (MIT) | Self-hostable, acquired by ClickHouse (Jan 2026) |
| LangSmith | LangChain users | No | Deep LangChain integration, Fleet agent deployment |
| Helicone | Zero-code integration | Yes | Proxy-based, cost tracking across 300+ models |
| Braintrust | Evaluation and experimentation | No | Custom trace database, Loop agent for automated evals |
| AgentOps | AI agent monitoring | Yes | Session replay, time-travel debugging |
| Weights & Biases (Weave) | ML teams adding LLM observability | Yes (Apache-2.0) | Unified ML + LLM experiment tracking |
| Arize AI (Phoenix) | ML monitoring + LLM tracing | Yes | Embedding clustering, OpenTelemetry-native |
| Portkey | AI gateway + observability | Yes | Fully open-sourced gateway, request routing |
| Opik (Comet) | Open-source eval + tracing | Yes (Apache-2.0) | Built-in metrics, agent optimizer SDK |
| Lunary | Chatbot and RAG teams | Yes (Apache-2.0) | Cheapest paid tier ($20/user/mo) |
| HoneyHive | Production agent monitoring | No | Flexible deployment (SaaS, hybrid, self-hosted) |
| LangWatch | Agent testing + simulation | Yes | Synthetic conversation testing, DSPy integration |
How to choose
If you want open-source and self-hosting, start with Langfuse. If you're already using LangChain, LangSmith is the path of least resistance. If you want the simplest possible setup, Helicone's proxy approach gets you logging without touching your code. For teams focused on prompt quality and evaluation, Braintrust offers structured scoring and comparison workflows. If you're building agents, AgentOps or HoneyHive give you the right level of detail for multi-step execution traces. If your team already uses Weights & Biases or Arize for ML monitoring, their LLM features keep everything in one place. If you need an AI gateway with built-in observability, Portkey's fully open-source gateway covers both routing and logging. For budget-conscious teams building chatbots or RAG pipelines, Lunary's $20/user/month tier is the cheapest entry point. Opik is worth evaluating if you want open-source with built-in evaluation metrics and prompt optimization. And LangWatch is a good fit for teams that want to test agent behavior with synthetic conversations before deploying.
Most of these tools offer free tiers, so the practical approach is to try two or three with your actual workload before committing.
See how these tools fit together
Observability is one layer. These stacks show how to combine it with inference, frameworks, and vector databases for a complete setup.
Frequently asked questions
What is the best open-source LLM observability tool?
Langfuse is the most established, with MIT licensing, self-hosting, and broad framework support. ClickHouse acquired it in January 2026, giving the project a well-funded backer. Arize Phoenix and Opik (by Comet) are strong alternatives, both Apache-2.0. Phoenix has a larger community (9,000+ GitHub stars). Opik includes built-in evaluation metrics. All three are fully self-hostable with no feature gates.
What is the best LLM monitoring tool for AI agents?
AgentOps specializes in agent observability with session replay and time-travel debugging for multi-step agents. It integrates with CrewAI, AutoGen, and OpenAI Agents SDK. For broader observability that also covers agents, LangSmith and Braintrust both handle multi-step traces. The choice depends on whether agent debugging is your primary need or one of several.
What are alternatives to LangSmith?
Langfuse is the closest open-source alternative, offering tracing, prompt management, and evaluations without LangChain lock-in. Helicone is simpler (proxy-based, one-line setup). Braintrust is stronger on evaluation workflows. Portkey combines gateway routing with observability. All of these work with any LLM framework, not just LangChain.
Which LLM observability tools support OpenTelemetry?
Langfuse, Arize Phoenix, and Traceloop's OpenLLMetry library are all built on OpenTelemetry. OpenLLMetry (acquired by ServiceNow in March 2026) provides OTel instrumentation that many other tools integrate with. If your team already has an OTel pipeline, these tools plug into it directly.
How much do LLM observability tools cost?
Most tools offer free tiers suitable for prototyping: Langfuse (50k events/month), Helicone (10k requests/month), Braintrust (1 GB/month), Arize Phoenix (25k spans/month). Paid tiers range from $20/user/month (Lunary) to $249/month (Braintrust Pro). Open-source tools (Langfuse, Phoenix, Opik, Lunary) can be self-hosted at infrastructure cost only.
Do I need LLM observability if I already use Datadog or New Relic?
Traditional APM tools track request latency and error rates but miss LLM-specific dimensions: token costs, prompt quality, hallucination rates, and multi-step chain failures. LLM observability tools capture the full prompt-to-completion trace with cost attribution. Some teams run both, using Datadog for infrastructure monitoring and a dedicated LLM tool for prompt-level visibility.
Browse all Observability & Analytics tools on Infrabase.ai
Is your product missing?