DSPy + RAG Evaluation Ops in Production

Outcome focus: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.

DSPy is useful when it stops being an experiment and becomes part of the release workflow.

The first mistake is treating prompt optimization like a one-time tuning step. A RAG system changes whenever documents change, chunking changes, embeddings change, prompts change, tools change, or the product asks different questions. A good prompt program can become bad without anyone touching the model.

That is why I treat DSPy and RAG evaluation as ops, not notebook work.

The production problem#

A RAG system can fail in two different places:

retrieval did not bring the right evidence,
generation did not use the evidence correctly.

If those are scored together, the fix becomes guesswork.

A bad answer might need better chunking, better metadata filters, a revised retriever, a changed prompt program, stricter citation rules, or a different abstention behavior. The evaluation pipeline should tell you which layer failed.

Evaluation dimensions#

I start with a small table of dimensions instead of one blended score.

Dimension	Question	Example metric
Retrieval recall	Did the right source appear?	gold doc in top-k
Retrieval precision	Were irrelevant chunks avoided?	relevant chunk ratio
Grounding	Did answer use retrieved evidence?	citation support score
Completeness	Did answer address required facts?	rubric pass/fail
Abstention	Did system refuse when evidence was missing?	correct abstention rate
Format	Did answer match product contract?	schema validation
Regression	Did known cases stay fixed?	pinned case pass rate

This is the artifact I want every RAG release to have. If a metric is not tied to a decision, delete it. If a release can regress without a metric catching it, add the metric.

Golden set shape#

The golden set should include more than happy-path questions.

rag-golden-case.yaml

id: refund-policy-conflict-003
question: "Can a premium customer get a refund after 45 days if the product was never activated?"
expected_behavior:
  answer_must_include:
    - standard refund window
    - premium exception rule
    - activation status condition
  answer_must_not_include:
    - unconditional approval
    - policy older than 2025-10
retrieval:
  required_sources:
    - policy/refunds/current.md
    - policy/accounts/premium-exceptions.md
  forbidden_sources:
    - policy/refunds/archive-2024.md
labels:
  difficulty: conflict_resolution
  risk: medium
  requires_abstention: false

The exact schema can change. The principle should not: the case names retrieval expectations and answer behavior separately.

DSPy in the loop#

DSPy is strongest when the program structure is explicit. The release flow I like is:

DSPy optimization becomes production-safe when candidates are compared against pinned cases before promotion.

The important move is comparing the candidate against the current released program, not just against a static target. A candidate that improves average score while breaking high-risk pinned cases should not ship.

Regression thresholds#

I would start with thresholds like these:

Gate	Release rule
High-risk pinned cases	100% pass
Overall golden set	Candidate must not drop more than 1 point from baseline
Retrieval required-source recall	No regression on high-risk cases
Unsupported citation	0 allowed on high-risk cases
Format contract	100% valid output schema
Abstention cases	No false confidence on missing-evidence cases

The numbers are not universal. The shape is. Some cases are allowed to trade off. Some are not.

Versioning#

RAG releases need versioned artifacts:

prompt program version,
model id,
retriever configuration,
embedding model,
chunking strategy,
document snapshot,
golden set version,
evaluation result id.

Without those, "quality got worse" becomes archaeology.

Here is the minimum release record I want:

rag-release-record.json

{
  "release": "support-rag-2026-02-24",
  "programVersion": "dspy-support-answer-v12",
  "model": "gpt-class-prod",
  "retriever": "hybrid-v4",
  "embeddingModel": "text-embedding-current",
  "documentSnapshot": "policy-docs-2026-02-21",
  "goldenSet": "support-rag-golden-v7",
  "passedHighRiskCases": true,
  "overallScoreDelta": 0.7,
  "knownFailures": [
    "low confidence on legacy billing edge case"
  ]
}

This is not heavy process. It is the minimum information needed to debug a regression next month.

The tradeoff#

The tradeoff is slower prompt iteration in exchange for release confidence.

The fast path is to keep the notebook open, optimize, inspect a few answers, and ship the prompt. That is fine for exploration. It is weak production practice.

The release path adds friction:

golden cases have to be maintained,
labels have to be reviewed,
document snapshots have to be captured,
known failures have to be recorded.

I accept that friction because RAG systems fail quietly. A bad answer can look polished. A missing citation can look authoritative. A stale document can sound current.

Failure examples#

The failure examples I want in the eval set are not exotic.

Failure	Case pattern
Stale source wins	Current and archived policy both match query
Missing evidence	User asks a question outside documented policy
Conflicting snippets	Two departments define same term differently
Over-answering	System gives operational instruction without approval
Citation mismatch	Answer is right, cited evidence does not support it
Format drift	Answer breaks the downstream JSON contract

The best regression cases usually come from production misses. Every time support corrects the assistant, the correction should become a candidate golden case.

How I would run the release#

The release meeting should be short because the artifacts are already prepared.

I would bring four views:

View	Question it answers
Candidate vs current	Did the new program improve the right cases?
High-risk pinned cases	Did any non-negotiable behavior regress?
Retrieval failures	Are misses caused by source selection, chunking, or filters?
Generation failures	Are misses caused by reasoning, format, abstention, or citation use?

The decision is not "ship the best score." The decision is whether the candidate is safer and more useful than the current release for the product's actual risk profile.

If the candidate improves average score but fails one high-risk refund-policy case, I do not ship it. If it slightly lowers average score but fixes three recurring high-risk misses and keeps routine cases stable, I would consider it. Aggregate metrics should inform the release, not overrule the product risk.

The rejected pattern is prompt roulette: tweak, inspect ten answers, ship. It feels fast until nobody can explain which version introduced the regression. A slower release record is cheaper than reconstructing quality history from chat screenshots and memory.

What to do differently#

Do not ask whether DSPy improved the prompt.

Ask what changed, which cases improved, which cases regressed, whether retrieval or generation caused the delta, and whether the product can tolerate the known failures.

That is the difference between prompt optimization and RAG evaluation ops.