What Complexity Science Teaches About AI Evaluation

A practical AI evaluation essay showing how locally strong retrieval, reasoning, and tool-use components can interact into globally weak product behavior.

By Jovani Pink March 7, 2026 6 min — Systems & Complexity Notes

Outcome focus: Improved evaluation strategy by testing full decision paths, interaction effects, feedback loops, and second-order behavior instead of isolated component scores.

The retrieval score improved and the product got worse.

That is the kind of AI evaluation failure complexity science helps explain. A component metric moves in the right direction. The team celebrates. Then the product behavior degrades because the component does not live alone.

In a retrieval-augmented support assistant, better recall can mean more evidence. It can also mean longer context, more conflicting snippets, slower responses, and a generator that hedges instead of deciding. The component improved locally. The system degraded globally.

AI evaluation fails when teams assume independent parts inside a coupled system.

The concrete example#

Imagine a support assistant for internal agents. It answers policy questions, cites sources, and recommends escalation when confidence is low.

The team evaluates three components:

ComponentLocal metricLooks good when
RetrieverRecall@kMore relevant documents appear in top results
GeneratorRubric scoreAnswer is complete, polite, and grounded
Escalation classifierPrecision/recallUncertain cases route to review

Those are useful metrics. They are not enough.

The team changes the retriever from top-5 to top-12 because recall improves. In isolation, the move is defensible. In the full workflow, several things change:

  • average context length increases,
  • conflicting policy versions appear more often,
  • answer latency rises,
  • generator confidence drops on edge cases,
  • escalation volume increases,
  • human reviewers start ignoring escalations because too many are low-value.

The measured retrieval metric improved. The operating system got worse.

A local retrieval change can create a feedback loop that changes generator behavior, human review, and future evaluation labels.

What complexity adds#

Complexity science is not a decorative metaphor here. It offers a few practical warnings.

First, local optimization can degrade global behavior. Better component scores do not guarantee better product outcomes.

Second, feedback loops change the system after release. A support assistant does not only answer questions. It changes how humans ask, when they escalate, and which examples become future evaluation data.

Third, interaction effects matter. The retriever, generator, tool layer, approval flow, and human reviewer are not independent variables.

Fourth, the system can cross thresholds. A small increase in escalation volume may be harmless until reviewer capacity is saturated. After that, review quality collapses quickly.

The tradeoff#

The tradeoff is evaluation speed versus evaluation realism.

Component tests are fast. They are cheap to run. They give clean signals. I use them.

Full decision-path tests are slower. They require realistic fixtures, tool mocks, latency budgets, and human review assumptions. They are harder to maintain.

The mistake is choosing one. The practical evaluation stack needs both:

  1. component tests for tight engineering feedback,
  2. slice tests for full user journeys,
  3. longitudinal monitoring for feedback loops after release.

If I can only afford one for a high-risk AI feature, I choose the slice test. It is less elegant and more useful.

Evaluation artifact: a decision-path slice#

Here is the kind of slice I want for the support assistant:

StepEvaluation questionFailure signal
User questionIs the request realistic and ambiguous enough?Only toy prompts pass
RetrievalAre current and conflicting policies visible?Missing governing doc or stale doc wins
GenerationDoes answer cite the right source and state uncertainty?Confident answer with weak evidence
EscalationDoes uncertainty route correctly?Too many low-value escalations or missed high-risk cases
Human handoffDoes reviewer receive useful context?Reviewer repeats the model's work
FeedbackDoes the resolved case update eval labels?Future evals learn from noisy or missing labels

This slice evaluates behavior the user actually experiences. It also exposes second-order costs: latency, reviewer load, and feedback quality.

Emergent failure story#

The emergent failure I watch for is reviewer fatigue.

At first, an escalation classifier that is slightly overcautious looks safe. It catches more uncertain cases. Nobody wants the model to guess on sensitive policy questions.

Then the queue grows. Reviewers start skimming. Low-risk escalations crowd out high-risk ones. Feedback quality drops because reviewers leave shorter notes. The next evaluation set inherits weaker labels. The model appears harder to improve even though the original classifier change looked conservative.

This is why "safe" local behavior can become unsafe system behavior.

Practical design#

The evaluation design I would ship has four layers.

LayerPurposeCadence
Unit evalsRetriever, generator, classifier, and tool behaviorEvery change
Slice evalsFull decision paths with realistic fixturesEvery release
Stress evalsCapacity, latency, conflicting evidence, stale docsBefore major launch
Production monitorsDrift, escalation load, feedback quality, user correctionOngoing

Each layer answers a different question. Unit evals ask "did the part behave?" Slice evals ask "did the path work?" Stress evals ask "where does it break?" Production monitors ask "what changed after users adapted?"

A small stress test#

The stress test I would add first is conflicting evidence under load.

Take twenty support questions where the current policy and an archived policy both look relevant. Run the workflow with normal retrieval depth, then with a higher recall setting. Track four things:

SignalWhy it matters
Correct current-source citationThe system found the governing evidence
Answer latencyExtra context did not break the user experience
Escalation decisionThe model knew when conflict required review
Reviewer correction rateHumans did not have to repair most answers

This test is intentionally small. It is not a benchmark for bragging. It is a pressure test for interaction effects.

If higher recall improves citation coverage but doubles reviewer corrections, the retrieval change is not a clear win. If latency rises but correction rate falls on high-risk cases, the tradeoff may be acceptable. The useful answer depends on the product's operating constraint.

That is the complexity lens in practice: measure the behavior that emerges when the parts interact, not only the part that changed. The smallest useful test is the one that forces the coupling into view before a user finds it.

What to do differently#

Do not stop at component accuracy.

When an AI system touches a real decision, map the path from user request to answer, tool call, human handoff, and feedback. Then test that path as a system.

The question is not whether retrieval, generation, and routing each look good in isolation. The question is what behavior emerges when they interact under load, ambiguity, stale context, and human capacity limits.

Back to all writing
On this page
  1. The concrete example
  2. What complexity adds
  3. The tradeoff
  4. Evaluation artifact: a decision-path slice
  5. Emergent failure story
  6. Practical design
  7. A small stress test
  8. What to do differently