Lesson 15: Observability with Arize Phoenix

Topics Covered

Why Observability Matters: Debugging non-deterministic, multi-step systems.
Arize Phoenix Overview: Open-source tracing for LLM applications.
Tracing Agent Runs: Visualizing the full execution chain.
Span Analysis: Understanding latency, tokens, and costs.
Evaluations: Measuring agent quality programmatically.
Production Monitoring: Alerts, dashboards, and continuous improvement.

Agentic systems are notoriously hard to debug. The same prompt can produce different results. Tool calls chain unpredictably. Failures happen deep in multi-step workflows. Traditional logging isn't enough—you need tracing that captures the full execution graph. Arize Phoenix is an open-source observability platform purpose-built for LLM applications. In this lesson, you'll instrument your agents, trace their behavior, and build dashboards for production monitoring.

Synopsis

1. The Observability Challenge

Why agentic systems are hard to debug
Non-deterministic behavior: same input, different outputs
Multi-step failures: where did it go wrong?
The cost problem: tracking token usage across chains
Traditional logging vs distributed tracing

2. Arize Phoenix Overview

What is Arize Phoenix (open-source LLM observability)
Phoenix vs alternatives (LangSmith, Weights & Biases, Helicone)
Core concepts: traces, spans, projects
Local vs hosted deployment
The Phoenix UI

3. Setting Up Phoenix

Installing Phoenix locally
Launching the Phoenix server
Configuring your application to send traces
The OpenTelemetry foundation
First trace: verifying the setup

4. Instrumenting LangChain

Auto-instrumentation with phoenix.otel
What gets captured: LLM calls, tool executions, retrievals
Adding custom spans for business logic
Trace context propagation
Filtering and sampling traces

5. Instrumenting LangGraph

Tracing stateful workflows
Visualizing graph execution paths
Node-level performance analysis
Checkpoint and state visibility
Debugging conditional branches

6. Instrumenting Other Frameworks

Pydantic AI instrumentation
CrewAI and Autogen tracing
Agno's built-in observability vs Phoenix
Custom instrumentation for any framework
OpenInference semantic conventions

7. Understanding Traces and Spans

Trace anatomy: root span, child spans, attributes
Span types: LLM, retriever, tool, chain, agent
Reading the waterfall view
Identifying bottlenecks
Correlating traces with user sessions

8. Cost and Token Analysis

Tracking token usage per span
Calculating costs across providers
Identifying expensive operations
Optimization opportunities
Budget alerts and limits

9. Evaluations: Measuring Quality

What are evaluations (automated quality checks)
Built-in evaluators: relevance, hallucination, toxicity
Running evaluations on traces
Custom evaluators for your domain
Evaluation datasets and benchmarks

10. Building Dashboards

Phoenix's built-in dashboards
Key metrics for agent health
Latency percentiles (p50, p95, p99)
Error rates and failure patterns
Custom queries and visualizations

11. Alerting and Production Monitoring

Setting up alerts on key metrics
Anomaly detection for agent behavior
Integration with PagerDuty, Slack, etc.
On-call playbooks for agent failures
Continuous improvement workflows

12. Debugging Common Issues

Case study: agent stuck in loops (trace analysis)
Case study: wrong tool selection (span inspection)
Case study: hallucination in responses (evaluation)
Case study: cost explosion (token analysis)
Building a debugging checklist

Synopsis​

1. The Observability Challenge​

2. Arize Phoenix Overview​

3. Setting Up Phoenix​

4. Instrumenting LangChain​

5. Instrumenting LangGraph​

6. Instrumenting Other Frameworks​

7. Understanding Traces and Spans​

8. Cost and Token Analysis​

9. Evaluations: Measuring Quality​

10. Building Dashboards​

11. Alerting and Production Monitoring​

12. Debugging Common Issues​

Additional Resources​