Lesson 15: Observability with Arize Phoenix
Topics Covered
- Why Observability Matters: Debugging non-deterministic, multi-step systems.
- Arize Phoenix Overview: Open-source tracing for LLM applications.
- Tracing Agent Runs: Visualizing the full execution chain.
- Span Analysis: Understanding latency, tokens, and costs.
- Evaluations: Measuring agent quality programmatically.
- Production Monitoring: Alerts, dashboards, and continuous improvement.
Agentic systems are notoriously hard to debug. The same prompt can produce different results. Tool calls chain unpredictably. Failures happen deep in multi-step workflows. Traditional logging isn't enough—you need tracing that captures the full execution graph. Arize Phoenix is an open-source observability platform purpose-built for LLM applications. In this lesson, you'll instrument your agents, trace their behavior, and build dashboards for production monitoring.
Synopsis
1. The Observability Challenge
- Why agentic systems are hard to debug
- Non-deterministic behavior: same input, different outputs
- Multi-step failures: where did it go wrong?
- The cost problem: tracking token usage across chains
- Traditional logging vs distributed tracing
2. Arize Phoenix Overview
- What is Arize Phoenix (open-source LLM observability)
- Phoenix vs alternatives (LangSmith, Weights & Biases, Helicone)
- Core concepts: traces, spans, projects
- Local vs hosted deployment
- The Phoenix UI
3. Setting Up Phoenix
- Installing Phoenix locally
- Launching the Phoenix server
- Configuring your application to send traces
- The OpenTelemetry foundation
- First trace: verifying the setup
4. Instrumenting LangChain
- Auto-instrumentation with
phoenix.otel - What gets captured: LLM calls, tool executions, retrievals
- Adding custom spans for business logic
- Trace context propagation
- Filtering and sampling traces
5. Instrumenting LangGraph
- Tracing stateful workflows
- Visualizing graph execution paths
- Node-level performance analysis
- Checkpoint and state visibility
- Debugging conditional branches
6. Instrumenting Other Frameworks
- Pydantic AI instrumentation
- CrewAI and Autogen tracing
- Agno's built-in observability vs Phoenix
- Custom instrumentation for any framework
- OpenInference semantic conventions
7. Understanding Traces and Spans
- Trace anatomy: root span, child spans, attributes
- Span types: LLM, retriever, tool, chain, agent
- Reading the waterfall view
- Identifying bottlenecks
- Correlating traces with user sessions
8. Cost and Token Analysis
- Tracking token usage per span
- Calculating costs across providers
- Identifying expensive operations
- Optimization opportunities
- Budget alerts and limits
9. Evaluations: Measuring Quality
- What are evaluations (automated quality checks)
- Built-in evaluators: relevance, hallucination, toxicity
- Running evaluations on traces
- Custom evaluators for your domain
- Evaluation datasets and benchmarks
10. Building Dashboards
- Phoenix's built-in dashboards
- Key metrics for agent health
- Latency percentiles (p50, p95, p99)
- Error rates and failure patterns
- Custom queries and visualizations
11. Alerting and Production Monitoring
- Setting up alerts on key metrics
- Anomaly detection for agent behavior
- Integration with PagerDuty, Slack, etc.
- On-call playbooks for agent failures
- Continuous improvement workflows
12. Debugging Common Issues
- Case study: agent stuck in loops (trace analysis)
- Case study: wrong tool selection (span inspection)
- Case study: hallucination in responses (evaluation)
- Case study: cost explosion (token analysis)
- Building a debugging checklist