What Complexity Science Teaches About AI Evaluation

Outcome focus: Improved evaluation strategy by testing full decision paths, interaction effects, feedback loops, and second-order behavior instead of isolated component scores.

The retrieval score improved and the product got worse.

That is the kind of AI evaluation failure complexity science helps explain. A component metric moves in the right direction. The team celebrates. Then the product behavior degrades because the component does not live alone.

In a retrieval-augmented support assistant, better recall can mean more evidence. It can also mean longer context, more conflicting snippets, slower responses, and a generator that hedges instead of deciding. The component improved locally. The system degraded globally.

AI evaluation fails when teams assume independent parts inside a coupled system.

The concrete example#

Imagine a support assistant for internal agents. It answers policy questions, cites sources, and recommends escalation when confidence is low.

The team evaluates three components:

Component	Local metric	Looks good when
Retriever	Recall@k	More relevant documents appear in top results
Generator	Rubric score	Answer is complete, polite, and grounded
Escalation classifier	Precision/recall	Uncertain cases route to review

Those are useful metrics. They are not enough.

The team changes the retriever from top-5 to top-12 because recall improves. In isolation, the move is defensible. In the full workflow, several things change:

average context length increases,
conflicting policy versions appear more often,
answer latency rises,
generator confidence drops on edge cases,
escalation volume increases,
human reviewers start ignoring escalations because too many are low-value.

The measured retrieval metric improved. The operating system got worse.

A local retrieval change can create a feedback loop that changes generator behavior, human review, and future evaluation labels.

What complexity adds#

Complexity science is not a decorative metaphor here. It offers a few practical warnings.

First, local optimization can degrade global behavior. Better component scores do not guarantee better product outcomes.

Second, feedback loops change the system after release. A support assistant does not only answer questions. It changes how humans ask, when they escalate, and which examples become future evaluation data.

Third, interaction effects matter. The retriever, generator, tool layer, approval flow, and human reviewer are not independent variables.

Fourth, the system can cross thresholds. A small increase in escalation volume may be harmless until reviewer capacity is saturated. After that, review quality collapses quickly.

The tradeoff#

The tradeoff is evaluation speed versus evaluation realism.

Component tests are fast. They are cheap to run. They give clean signals. I use them.

Full decision-path tests are slower. They require realistic fixtures, tool mocks, latency budgets, and human review assumptions. They are harder to maintain.

The mistake is choosing one. The practical evaluation stack needs both:

component tests for tight engineering feedback,
slice tests for full user journeys,
longitudinal monitoring for feedback loops after release.

If I can only afford one for a high-risk AI feature, I choose the slice test. It is less elegant and more useful.

Evaluation artifact: a decision-path slice#

Here is the kind of slice I want for the support assistant:

Step	Evaluation question	Failure signal
User question	Is the request realistic and ambiguous enough?	Only toy prompts pass
Retrieval	Are current and conflicting policies visible?	Missing governing doc or stale doc wins
Generation	Does answer cite the right source and state uncertainty?	Confident answer with weak evidence
Escalation	Does uncertainty route correctly?	Too many low-value escalations or missed high-risk cases
Human handoff	Does reviewer receive useful context?	Reviewer repeats the model's work
Feedback	Does the resolved case update eval labels?	Future evals learn from noisy or missing labels

This slice evaluates behavior the user actually experiences. It also exposes second-order costs: latency, reviewer load, and feedback quality.

Emergent failure story#

The emergent failure I watch for is reviewer fatigue.

At first, an escalation classifier that is slightly overcautious looks safe. It catches more uncertain cases. Nobody wants the model to guess on sensitive policy questions.

Then the queue grows. Reviewers start skimming. Low-risk escalations crowd out high-risk ones. Feedback quality drops because reviewers leave shorter notes. The next evaluation set inherits weaker labels. The model appears harder to improve even though the original classifier change looked conservative.

This is why "safe" local behavior can become unsafe system behavior.

Practical design#

The evaluation design I would ship has four layers.

Layer	Purpose	Cadence
Unit evals	Retriever, generator, classifier, and tool behavior	Every change
Slice evals	Full decision paths with realistic fixtures	Every release
Stress evals	Capacity, latency, conflicting evidence, stale docs	Before major launch
Production monitors	Drift, escalation load, feedback quality, user correction	Ongoing

Each layer answers a different question. Unit evals ask "did the part behave?" Slice evals ask "did the path work?" Stress evals ask "where does it break?" Production monitors ask "what changed after users adapted?"

A small stress test#

The stress test I would add first is conflicting evidence under load.

Take twenty support questions where the current policy and an archived policy both look relevant. Run the workflow with normal retrieval depth, then with a higher recall setting. Track four things:

Signal	Why it matters
Correct current-source citation	The system found the governing evidence
Answer latency	Extra context did not break the user experience
Escalation decision	The model knew when conflict required review
Reviewer correction rate	Humans did not have to repair most answers

This test is intentionally small. It is not a benchmark for bragging. It is a pressure test for interaction effects.

If higher recall improves citation coverage but doubles reviewer corrections, the retrieval change is not a clear win. If latency rises but correction rate falls on high-risk cases, the tradeoff may be acceptable. The useful answer depends on the product's operating constraint.

That is the complexity lens in practice: measure the behavior that emerges when the parts interact, not only the part that changed. The smallest useful test is the one that forces the coupling into view before a user finds it.

What to do differently#

Do not stop at component accuracy.

When an AI system touches a real decision, map the path from user request to answer, tool call, human handoff, and feedback. Then test that path as a system.

The question is not whether retrieval, generation, and routing each look good in isolation. The question is what behavior emerges when they interact under load, ambiguity, stale context, and human capacity limits.