Outcome focus: Improved evaluation strategy by testing full decision paths, interaction effects, feedback loops, and second-order behavior instead of isolated component scores.
complexityai evaluationsystems thinkingproductdecision intelligence
The retrieval score improved and the product got worse.
That is the kind of AI evaluation failure complexity science helps explain. A component metric moves in the right direction. The team celebrates. Then the product behavior degrades because the component does not live alone.
In a retrieval-augmented support assistant, better recall can mean more evidence. It can also mean longer context, more conflicting snippets, slower responses, and a generator that hedges instead of deciding. The component improved locally. The system degraded globally.
AI evaluation fails when teams assume independent parts inside a coupled system.
The concrete example#
Imagine a support assistant for internal agents. It answers policy questions, cites sources, and recommends escalation when confidence is low.
The team evaluates three components:
| Component | Local metric | Looks good when |
|---|---|---|
| Retriever | Recall@k | More relevant documents appear in top results |
| Generator | Rubric score | Answer is complete, polite, and grounded |
| Escalation classifier | Precision/recall | Uncertain cases route to review |
Those are useful metrics. They are not enough.
The team changes the retriever from top-5 to top-12 because recall improves. In isolation, the move is defensible. In the full workflow, several things change:
- average context length increases,
- conflicting policy versions appear more often,
- answer latency rises,
- generator confidence drops on edge cases,
- escalation volume increases,
- human reviewers start ignoring escalations because too many are low-value.
The measured retrieval metric improved. The operating system got worse.
What complexity adds#
Complexity science is not a decorative metaphor here. It offers a few practical warnings.
First, local optimization can degrade global behavior. Better component scores do not guarantee better product outcomes.
Second, feedback loops change the system after release. A support assistant does not only answer questions. It changes how humans ask, when they escalate, and which examples become future evaluation data.
Third, interaction effects matter. The retriever, generator, tool layer, approval flow, and human reviewer are not independent variables.
Fourth, the system can cross thresholds. A small increase in escalation volume may be harmless until reviewer capacity is saturated. After that, review quality collapses quickly.
The tradeoff#
The tradeoff is evaluation speed versus evaluation realism.
Component tests are fast. They are cheap to run. They give clean signals. I use them.
Full decision-path tests are slower. They require realistic fixtures, tool mocks, latency budgets, and human review assumptions. They are harder to maintain.
The mistake is choosing one. The practical evaluation stack needs both:
- component tests for tight engineering feedback,
- slice tests for full user journeys,
- longitudinal monitoring for feedback loops after release.
If I can only afford one for a high-risk AI feature, I choose the slice test. It is less elegant and more useful.
Evaluation artifact: a decision-path slice#
Here is the kind of slice I want for the support assistant:
| Step | Evaluation question | Failure signal |
|---|---|---|
| User question | Is the request realistic and ambiguous enough? | Only toy prompts pass |
| Retrieval | Are current and conflicting policies visible? | Missing governing doc or stale doc wins |
| Generation | Does answer cite the right source and state uncertainty? | Confident answer with weak evidence |
| Escalation | Does uncertainty route correctly? | Too many low-value escalations or missed high-risk cases |
| Human handoff | Does reviewer receive useful context? | Reviewer repeats the model's work |
| Feedback | Does the resolved case update eval labels? | Future evals learn from noisy or missing labels |
This slice evaluates behavior the user actually experiences. It also exposes second-order costs: latency, reviewer load, and feedback quality.
Emergent failure story#
The emergent failure I watch for is reviewer fatigue.
At first, an escalation classifier that is slightly overcautious looks safe. It catches more uncertain cases. Nobody wants the model to guess on sensitive policy questions.
Then the queue grows. Reviewers start skimming. Low-risk escalations crowd out high-risk ones. Feedback quality drops because reviewers leave shorter notes. The next evaluation set inherits weaker labels. The model appears harder to improve even though the original classifier change looked conservative.
This is why "safe" local behavior can become unsafe system behavior.
Practical design#
The evaluation design I would ship has four layers.
| Layer | Purpose | Cadence |
|---|---|---|
| Unit evals | Retriever, generator, classifier, and tool behavior | Every change |
| Slice evals | Full decision paths with realistic fixtures | Every release |
| Stress evals | Capacity, latency, conflicting evidence, stale docs | Before major launch |
| Production monitors | Drift, escalation load, feedback quality, user correction | Ongoing |
Each layer answers a different question. Unit evals ask "did the part behave?" Slice evals ask "did the path work?" Stress evals ask "where does it break?" Production monitors ask "what changed after users adapted?"
A small stress test#
The stress test I would add first is conflicting evidence under load.
Take twenty support questions where the current policy and an archived policy both look relevant. Run the workflow with normal retrieval depth, then with a higher recall setting. Track four things:
| Signal | Why it matters |
|---|---|
| Correct current-source citation | The system found the governing evidence |
| Answer latency | Extra context did not break the user experience |
| Escalation decision | The model knew when conflict required review |
| Reviewer correction rate | Humans did not have to repair most answers |
This test is intentionally small. It is not a benchmark for bragging. It is a pressure test for interaction effects.
If higher recall improves citation coverage but doubles reviewer corrections, the retrieval change is not a clear win. If latency rises but correction rate falls on high-risk cases, the tradeoff may be acceptable. The useful answer depends on the product's operating constraint.
That is the complexity lens in practice: measure the behavior that emerges when the parts interact, not only the part that changed. The smallest useful test is the one that forces the coupling into view before a user finds it.
What to do differently#
Do not stop at component accuracy.
When an AI system touches a real decision, map the path from user request to answer, tool call, human handoff, and feedback. Then test that path as a system.
The question is not whether retrieval, generation, and routing each look good in isolation. The question is what behavior emerges when they interact under load, ambiguity, stale context, and human capacity limits.