Evaluating Multi-Agent Workflows for Enterprise Reliability

Outcome focus: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.

The workflow passed the demo and failed the first realistic ticket.

That is the multi-agent failure pattern I trust least. In the demo, the planner decomposes the task, the researcher finds context, the tool agent calls the right system, and the writer returns a polished answer. Everyone can see the promise.

Then a real ticket arrives with missing context, ambiguous ownership, stale documentation, a tool timeout, and a user who expects the system to know when to stop.

The agents do not fail dramatically. They drift. One agent asks another to confirm something the first agent could have checked. A tool call gets retried without preserving the original intent. The writer produces a confident answer with one missing prerequisite. The workflow looks productive while the actual task remains unfinished.

That is why I evaluate multi-agent systems as workflows, not chat transcripts.

The task taxonomy comes first#

Before scoring anything, define what kind of work the agents are doing. A workflow for support triage should not be scored like a workflow for code migration.

Here is a taxonomy I use as a starting point:

Task class	Example	Main risk
Retrieval and synthesis	Summarize a policy, collect product context	Missing or stale evidence
Action with approval	Draft refund, propose CRM update	Overbroad permission or skipped review
Multi-step analysis	Investigate failed data job	Bad handoff or premature conclusion
Code or config change	Modify workflow, open pull request	Incorrect patch or missing regression test
Long-running follow-up	Monitor status and resume later	Lost state or repeated work

Each class needs different release thresholds. A retrieval workflow can tolerate more latency than a production automation workflow. A code-change workflow needs artifact diff checks. An approval workflow needs permission and audit checks.

Failure classes#

The evaluation set should label failures in a way that points to a fix.

Failure class	What it looks like	Fix direction
Handoff loss	Planner passes a vague subtask and downstream agent invents context	Structured handoff contract
Tool misuse	Right tool, wrong arguments, wrong object scope	Tool schema and fixture coverage
Permission overreach	Agent can read or mutate more than task requires	Narrow manifest or scoped credentials
Completion ambiguity	Workflow stops before the real done condition	Explicit acceptance criteria
Retry drift	Retry changes the original user intent	Retry state and idempotency key
Evidence gap	Final answer cites no source or unverifiable context	Evidence requirements in output contract
Latency spiral	Agents debate or duplicate work	Role boundaries and stop rules

The labels matter. "Bad answer" is too blunt. "Handoff loss after planner decomposition" gives the engineer something to change.

A demo that failed evaluation#

The sanitized workflow looked simple: triage a failed data pipeline alert.

The agents were:

planner,
log reader,
dependency checker,
runbook summarizer,
incident writer.

The demo case worked. The alert pointed to one failing job, the logs included the exception, and the runbook had a matching section. The workflow produced a clean incident summary.

The evaluation case added realistic friction:

the failing job was downstream of the actual broken source table,
the runbook had two similar sections,
the log reader found a retry failure but not the first failure,
the dependency checker had read-only access,
the incident writer needed to recommend escalation only if the freshness SLA was breached.

The workflow failed because no agent owned the done condition. The writer summarized the retry failure and said the job was recovering. The actual upstream table remained stale.

Evaluation artifact#

The useful artifact was a fixture file per evaluation case.

eval-pipeline-alert.yaml

id: pipeline-alert-stale-upstream-001
task_class: multi_step_analysis
user_request: "Triage the failed nightly revenue mart run and tell me whether to escalate."
fixtures:
  alert: revenue_mart_failed.json
  logs:
    - revenue_mart_retry.log
    - upstream_orders_first_failure.log
  runbook: revenue_pipeline_runbook.md
  dependency_graph: revenue_dependencies.json
expected:
  required_findings:
    - upstream_orders table caused the downstream failure
    - freshness SLA is breached
    - escalation to data platform on-call is required
  forbidden_findings:
    - revenue mart has recovered
    - no escalation required
thresholds:
  max_wall_clock_seconds: 90
  max_tool_calls: 18
  required_evidence_count: 3

This fixture does three important things:

It defines the task class.
It names the done condition.
It makes latency and tool count part of the release conversation.

Scoring#

I score workflow behavior at three layers.

Layer	Score	Example threshold
Outcome	Did the workflow complete the task?	95% pass on critical regression set
Process	Did handoffs, tools, permissions, and evidence behave correctly?	No critical permission or evidence failures
Cost	Did latency and tool count stay within budget?	p95 under target for task class

Outcome is not enough. A workflow can get the right answer by reading too much data, calling unsafe tools, or relying on an accident in the fixture. Process scoring catches those.

Cost is not only money. Latency changes user trust. If a workflow takes three minutes to triage an alert that an engineer can classify in thirty seconds, it has to be more accurate, more complete, or more auditable to earn the time.

Release thresholds#

I use thresholds like these before letting a multi-agent workflow graduate from prototype to controlled production:

Gate	Threshold
Critical task pass rate	95% or higher on pinned regression set
Catastrophic failure	0 known permission overreach or destructive wrong action
Evidence completeness	90% of final answers cite required evidence
Handoff integrity	No unresolved required field in structured handoffs
p95 latency	Within task-class budget
Tool error recovery	Retries preserve intent and do not duplicate mutation
Human approval	Required approval appears before sensitive action

These numbers are not universal. The point is that they exist before the demo becomes a release argument.

The tradeoff#

The tradeoff is slower iteration in exchange for less theatrical reliability.

The fast path is to keep improving prompts from observed demos. That feels productive because every demo teaches something. It also lets the team overfit to a handful of polished examples.

The disciplined path pins failures. Every time the system fails, the case becomes part of the regression set. That creates drag. It also creates memory. Without that memory, agent workflows regress quietly.

What I would instrument#

At runtime, I want traces that answer these questions:

Which agent accepted the task?
What did it believe the done condition was?
Which tools did it call?
Which evidence reached the final answer?
Which handoff fields were missing or invented?
Which permission scope was used?
How many retries happened and why?
Did the workflow stop because it was done, timed out, or gave up?

If the trace cannot answer those, the workflow cannot be improved safely.

What to do differently#

Do not ask whether the agents are impressive. Ask what class of task they are allowed to own and what failure would make that ownership unsafe.

Then build the evaluation set around that failure.

Multi-agent systems do not become reliable because the roles have good names. They become reliable when the workflow has a task taxonomy, failure labels, fixtures, thresholds, traces, and a release gate that can say no.