Skip to main content

Lesson 15: Observability with Arize Phoenix

Topics Covered
  • Why Observability Matters: Debugging non-deterministic, multi-step systems.
  • Arize Phoenix Overview: Open-source tracing for LLM applications.
  • Tracing Agent Runs: Visualizing the full execution chain.
  • Span Analysis: Understanding latency, tokens, and costs.
  • Evaluations: Measuring agent quality programmatically.
  • Production Monitoring: Alerts, dashboards, and continuous improvement.

Agentic systems are notoriously hard to debug. The same prompt can produce different results. Tool calls chain unpredictably. Failures happen deep in multi-step workflows. Traditional logging isn't enough—you need tracing that captures the full execution graph. Arize Phoenix is an open-source observability platform purpose-built for LLM applications. In this lesson, you'll instrument your agents, trace their behavior, and build dashboards for production monitoring.

Synopsis

1. The Observability Challenge

  • Why agentic systems are hard to debug
  • Non-deterministic behavior: same input, different outputs
  • Multi-step failures: where did it go wrong?
  • The cost problem: tracking token usage across chains
  • Traditional logging vs distributed tracing

2. Arize Phoenix Overview

  • What is Arize Phoenix (open-source LLM observability)
  • Phoenix vs alternatives (LangSmith, Weights & Biases, Helicone)
  • Core concepts: traces, spans, projects
  • Local vs hosted deployment
  • The Phoenix UI

3. Setting Up Phoenix

  • Installing Phoenix locally
  • Launching the Phoenix server
  • Configuring your application to send traces
  • The OpenTelemetry foundation
  • First trace: verifying the setup

4. Instrumenting LangChain

  • Auto-instrumentation with phoenix.otel
  • What gets captured: LLM calls, tool executions, retrievals
  • Adding custom spans for business logic
  • Trace context propagation
  • Filtering and sampling traces

5. Instrumenting LangGraph

  • Tracing stateful workflows
  • Visualizing graph execution paths
  • Node-level performance analysis
  • Checkpoint and state visibility
  • Debugging conditional branches

6. Instrumenting Other Frameworks

  • Pydantic AI instrumentation
  • CrewAI and Autogen tracing
  • Agno's built-in observability vs Phoenix
  • Custom instrumentation for any framework
  • OpenInference semantic conventions

7. Understanding Traces and Spans

  • Trace anatomy: root span, child spans, attributes
  • Span types: LLM, retriever, tool, chain, agent
  • Reading the waterfall view
  • Identifying bottlenecks
  • Correlating traces with user sessions

8. Cost and Token Analysis

  • Tracking token usage per span
  • Calculating costs across providers
  • Identifying expensive operations
  • Optimization opportunities
  • Budget alerts and limits

9. Evaluations: Measuring Quality

  • What are evaluations (automated quality checks)
  • Built-in evaluators: relevance, hallucination, toxicity
  • Running evaluations on traces
  • Custom evaluators for your domain
  • Evaluation datasets and benchmarks

10. Building Dashboards

  • Phoenix's built-in dashboards
  • Key metrics for agent health
  • Latency percentiles (p50, p95, p99)
  • Error rates and failure patterns
  • Custom queries and visualizations

11. Alerting and Production Monitoring

  • Setting up alerts on key metrics
  • Anomaly detection for agent behavior
  • Integration with PagerDuty, Slack, etc.
  • On-call playbooks for agent failures
  • Continuous improvement workflows

12. Debugging Common Issues

  • Case study: agent stuck in loops (trace analysis)
  • Case study: wrong tool selection (span inspection)
  • Case study: hallucination in responses (evaluation)
  • Case study: cost explosion (token analysis)
  • Building a debugging checklist

Additional Resources