Documentation

Failures

Inspect errors, retries, and failed execution branches.

Overview

The Failures view groups all traces that need attention into three categories: failed traces, traces with retries, and slow requests. Data is fetched from the GET /failures endpoint, which applies server-side filtering and returns pre-categorized results.

Failure Categories

CategoryDefinitionFilter Logic
Failedtrace.status == failedAny trace where status is explicitly failed
Retriestrace.retry_count > 0Traces with one or more retried steps
Slow requeststrace.slow_request == trueTraces with total latency >= 1500ms

How Failures Are Detected

Trace status is inferred through a multi-step process in normalize_trace_document():

  1. If the payload explicitly sets status to a valid value (success/warning/failed), use it
  2. If any step has success=False, status becomes failed
  3. If failure_reason is set or retry_count > 0, status becomes warning
  4. Otherwise, status is success

Retry Detection

Retries are detected by _infer_retry_count() which counts duplicate tool names in the step list. Each time a tool name appears more than once, it is counted as a retry:

Retry inferenceCopy
python
def _infer_retry_count(steps: list[dict[str, Any]]) -> int:
    retries = 0
    tool_attempts: defaultdict[str, int] = defaultdict(int)
    for step in steps:
        tool_name = step.get("tool_name", "agent")
        tool_attempts[tool_name] += 1
        if tool_attempts[tool_name] > 1:
            retries += 1
    return retries

Slow Request Threshold

The slow request threshold is 1500ms, defined as SLOW_TRACE_THRESHOLD_MS. If a trace's total latency equals or exceeds this value, slow_request is set to true. The same threshold is used by the CLI to color-code latency values (green below 900ms, yellow from 900ms to 1499ms, red at 1500ms+).

Troubleshooting Failures

ObservationCommon CauseAction
Status is failedException in traced function or step with success=FalseInspect the failure_reason field and the step detail
Status is warningRetries occurred during executionReview which steps were retried and check the error output
slow_request is trueTotal latency >= 1500msOptimize slow steps or increase the threshold
Retry count is unexpectedDuplicate tool names in step listCheck for unintended repeated tool invocations