Outcome focus: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.
agentsevaluationreliabilityenterprise aiobservability
The workflow passed the demo and failed the first realistic ticket.
That is the multi-agent failure pattern I trust least. In the demo, the planner decomposes the task, the researcher finds context, the tool agent calls the right system, and the writer returns a polished answer. Everyone can see the promise.
Then a real ticket arrives with missing context, ambiguous ownership, stale documentation, a tool timeout, and a user who expects the system to know when to stop.
The agents do not fail dramatically. They drift. One agent asks another to confirm something the first agent could have checked. A tool call gets retried without preserving the original intent. The writer produces a confident answer with one missing prerequisite. The workflow looks productive while the actual task remains unfinished.
That is why I evaluate multi-agent systems as workflows, not chat transcripts.
The task taxonomy comes first#
Before scoring anything, define what kind of work the agents are doing. A workflow for support triage should not be scored like a workflow for code migration.
Here is a taxonomy I use as a starting point:
| Task class | Example | Main risk |
|---|---|---|
| Retrieval and synthesis | Summarize a policy, collect product context | Missing or stale evidence |
| Action with approval | Draft refund, propose CRM update | Overbroad permission or skipped review |
| Multi-step analysis | Investigate failed data job | Bad handoff or premature conclusion |
| Code or config change | Modify workflow, open pull request | Incorrect patch or missing regression test |
| Long-running follow-up | Monitor status and resume later | Lost state or repeated work |
Each class needs different release thresholds. A retrieval workflow can tolerate more latency than a production automation workflow. A code-change workflow needs artifact diff checks. An approval workflow needs permission and audit checks.
Failure classes#
The evaluation set should label failures in a way that points to a fix.
| Failure class | What it looks like | Fix direction |
|---|---|---|
| Handoff loss | Planner passes a vague subtask and downstream agent invents context | Structured handoff contract |
| Tool misuse | Right tool, wrong arguments, wrong object scope | Tool schema and fixture coverage |
| Permission overreach | Agent can read or mutate more than task requires | Narrow manifest or scoped credentials |
| Completion ambiguity | Workflow stops before the real done condition | Explicit acceptance criteria |
| Retry drift | Retry changes the original user intent | Retry state and idempotency key |
| Evidence gap | Final answer cites no source or unverifiable context | Evidence requirements in output contract |
| Latency spiral | Agents debate or duplicate work | Role boundaries and stop rules |
The labels matter. "Bad answer" is too blunt. "Handoff loss after planner decomposition" gives the engineer something to change.
A demo that failed evaluation#
The sanitized workflow looked simple: triage a failed data pipeline alert.
The agents were:
- planner,
- log reader,
- dependency checker,
- runbook summarizer,
- incident writer.
The demo case worked. The alert pointed to one failing job, the logs included the exception, and the runbook had a matching section. The workflow produced a clean incident summary.
The evaluation case added realistic friction:
- the failing job was downstream of the actual broken source table,
- the runbook had two similar sections,
- the log reader found a retry failure but not the first failure,
- the dependency checker had read-only access,
- the incident writer needed to recommend escalation only if the freshness SLA was breached.
The workflow failed because no agent owned the done condition. The writer summarized the retry failure and said the job was recovering. The actual upstream table remained stale.
Evaluation artifact#
The useful artifact was a fixture file per evaluation case.
id: pipeline-alert-stale-upstream-001
task_class: multi_step_analysis
user_request: "Triage the failed nightly revenue mart run and tell me whether to escalate."
fixtures:
alert: revenue_mart_failed.json
logs:
- revenue_mart_retry.log
- upstream_orders_first_failure.log
runbook: revenue_pipeline_runbook.md
dependency_graph: revenue_dependencies.json
expected:
required_findings:
- upstream_orders table caused the downstream failure
- freshness SLA is breached
- escalation to data platform on-call is required
forbidden_findings:
- revenue mart has recovered
- no escalation required
thresholds:
max_wall_clock_seconds: 90
max_tool_calls: 18
required_evidence_count: 3This fixture does three important things:
- It defines the task class.
- It names the done condition.
- It makes latency and tool count part of the release conversation.
Scoring#
I score workflow behavior at three layers.
| Layer | Score | Example threshold |
|---|---|---|
| Outcome | Did the workflow complete the task? | 95% pass on critical regression set |
| Process | Did handoffs, tools, permissions, and evidence behave correctly? | No critical permission or evidence failures |
| Cost | Did latency and tool count stay within budget? | p95 under target for task class |
Outcome is not enough. A workflow can get the right answer by reading too much data, calling unsafe tools, or relying on an accident in the fixture. Process scoring catches those.
Cost is not only money. Latency changes user trust. If a workflow takes three minutes to triage an alert that an engineer can classify in thirty seconds, it has to be more accurate, more complete, or more auditable to earn the time.
Release thresholds#
I use thresholds like these before letting a multi-agent workflow graduate from prototype to controlled production:
| Gate | Threshold |
|---|---|
| Critical task pass rate | 95% or higher on pinned regression set |
| Catastrophic failure | 0 known permission overreach or destructive wrong action |
| Evidence completeness | 90% of final answers cite required evidence |
| Handoff integrity | No unresolved required field in structured handoffs |
| p95 latency | Within task-class budget |
| Tool error recovery | Retries preserve intent and do not duplicate mutation |
| Human approval | Required approval appears before sensitive action |
These numbers are not universal. The point is that they exist before the demo becomes a release argument.
The tradeoff#
The tradeoff is slower iteration in exchange for less theatrical reliability.
The fast path is to keep improving prompts from observed demos. That feels productive because every demo teaches something. It also lets the team overfit to a handful of polished examples.
The disciplined path pins failures. Every time the system fails, the case becomes part of the regression set. That creates drag. It also creates memory. Without that memory, agent workflows regress quietly.
What I would instrument#
At runtime, I want traces that answer these questions:
- Which agent accepted the task?
- What did it believe the done condition was?
- Which tools did it call?
- Which evidence reached the final answer?
- Which handoff fields were missing or invented?
- Which permission scope was used?
- How many retries happened and why?
- Did the workflow stop because it was done, timed out, or gave up?
If the trace cannot answer those, the workflow cannot be improved safely.
What to do differently#
Do not ask whether the agents are impressive. Ask what class of task they are allowed to own and what failure would make that ownership unsafe.
Then build the evaluation set around that failure.
Multi-agent systems do not become reliable because the roles have good names. They become reliable when the workflow has a task taxonomy, failure labels, fixtures, thresholds, traces, and a release gate that can say no.