DSPy + RAG Evaluation Ops in Production

How to turn DSPy and RAG evaluation into a production release loop with golden sets, retrieval checks, generation rubrics, regression thresholds, and versioned prompt programs.

By Jovani Pink February 24, 2026 6 min — Platform & AI Engineering

Outcome focus: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.

DSPy is useful when it stops being an experiment and becomes part of the release workflow.

The first mistake is treating prompt optimization like a one-time tuning step. A RAG system changes whenever documents change, chunking changes, embeddings change, prompts change, tools change, or the product asks different questions. A good prompt program can become bad without anyone touching the model.

That is why I treat DSPy and RAG evaluation as ops, not notebook work.

The production problem#

A RAG system can fail in two different places:

  1. retrieval did not bring the right evidence,
  2. generation did not use the evidence correctly.

If those are scored together, the fix becomes guesswork.

A bad answer might need better chunking, better metadata filters, a revised retriever, a changed prompt program, stricter citation rules, or a different abstention behavior. The evaluation pipeline should tell you which layer failed.

Evaluation dimensions#

I start with a small table of dimensions instead of one blended score.

DimensionQuestionExample metric
Retrieval recallDid the right source appear?gold doc in top-k
Retrieval precisionWere irrelevant chunks avoided?relevant chunk ratio
GroundingDid answer use retrieved evidence?citation support score
CompletenessDid answer address required facts?rubric pass/fail
AbstentionDid system refuse when evidence was missing?correct abstention rate
FormatDid answer match product contract?schema validation
RegressionDid known cases stay fixed?pinned case pass rate

This is the artifact I want every RAG release to have. If a metric is not tied to a decision, delete it. If a release can regress without a metric catching it, add the metric.

Golden set shape#

The golden set should include more than happy-path questions.

rag-golden-case.yaml
id: refund-policy-conflict-003
question: "Can a premium customer get a refund after 45 days if the product was never activated?"
expected_behavior:
  answer_must_include:
    - standard refund window
    - premium exception rule
    - activation status condition
  answer_must_not_include:
    - unconditional approval
    - policy older than 2025-10
retrieval:
  required_sources:
    - policy/refunds/current.md
    - policy/accounts/premium-exceptions.md
  forbidden_sources:
    - policy/refunds/archive-2024.md
labels:
  difficulty: conflict_resolution
  risk: medium
  requires_abstention: false

The exact schema can change. The principle should not: the case names retrieval expectations and answer behavior separately.

DSPy in the loop#

DSPy is strongest when the program structure is explicit. The release flow I like is:

DSPy optimization becomes production-safe when candidates are compared against pinned cases before promotion.

The important move is comparing the candidate against the current released program, not just against a static target. A candidate that improves average score while breaking high-risk pinned cases should not ship.

Regression thresholds#

I would start with thresholds like these:

GateRelease rule
High-risk pinned cases100% pass
Overall golden setCandidate must not drop more than 1 point from baseline
Retrieval required-source recallNo regression on high-risk cases
Unsupported citation0 allowed on high-risk cases
Format contract100% valid output schema
Abstention casesNo false confidence on missing-evidence cases

The numbers are not universal. The shape is. Some cases are allowed to trade off. Some are not.

Versioning#

RAG releases need versioned artifacts:

  • prompt program version,
  • model id,
  • retriever configuration,
  • embedding model,
  • chunking strategy,
  • document snapshot,
  • golden set version,
  • evaluation result id.

Without those, "quality got worse" becomes archaeology.

Here is the minimum release record I want:

rag-release-record.json
{
  "release": "support-rag-2026-02-24",
  "programVersion": "dspy-support-answer-v12",
  "model": "gpt-class-prod",
  "retriever": "hybrid-v4",
  "embeddingModel": "text-embedding-current",
  "documentSnapshot": "policy-docs-2026-02-21",
  "goldenSet": "support-rag-golden-v7",
  "passedHighRiskCases": true,
  "overallScoreDelta": 0.7,
  "knownFailures": [
    "low confidence on legacy billing edge case"
  ]
}

This is not heavy process. It is the minimum information needed to debug a regression next month.

The tradeoff#

The tradeoff is slower prompt iteration in exchange for release confidence.

The fast path is to keep the notebook open, optimize, inspect a few answers, and ship the prompt. That is fine for exploration. It is weak production practice.

The release path adds friction:

  • golden cases have to be maintained,
  • labels have to be reviewed,
  • document snapshots have to be captured,
  • known failures have to be recorded.

I accept that friction because RAG systems fail quietly. A bad answer can look polished. A missing citation can look authoritative. A stale document can sound current.

Failure examples#

The failure examples I want in the eval set are not exotic.

FailureCase pattern
Stale source winsCurrent and archived policy both match query
Missing evidenceUser asks a question outside documented policy
Conflicting snippetsTwo departments define same term differently
Over-answeringSystem gives operational instruction without approval
Citation mismatchAnswer is right, cited evidence does not support it
Format driftAnswer breaks the downstream JSON contract

The best regression cases usually come from production misses. Every time support corrects the assistant, the correction should become a candidate golden case.

How I would run the release#

The release meeting should be short because the artifacts are already prepared.

I would bring four views:

ViewQuestion it answers
Candidate vs currentDid the new program improve the right cases?
High-risk pinned casesDid any non-negotiable behavior regress?
Retrieval failuresAre misses caused by source selection, chunking, or filters?
Generation failuresAre misses caused by reasoning, format, abstention, or citation use?

The decision is not "ship the best score." The decision is whether the candidate is safer and more useful than the current release for the product's actual risk profile.

If the candidate improves average score but fails one high-risk refund-policy case, I do not ship it. If it slightly lowers average score but fixes three recurring high-risk misses and keeps routine cases stable, I would consider it. Aggregate metrics should inform the release, not overrule the product risk.

The rejected pattern is prompt roulette: tweak, inspect ten answers, ship. It feels fast until nobody can explain which version introduced the regression. A slower release record is cheaper than reconstructing quality history from chat screenshots and memory.

What to do differently#

Do not ask whether DSPy improved the prompt.

Ask what changed, which cases improved, which cases regressed, whether retrieval or generation caused the delta, and whether the product can tolerate the known failures.

That is the difference between prompt optimization and RAG evaluation ops.

Back to all writing
On this page
  1. The production problem
  2. Evaluation dimensions
  3. Golden set shape
  4. DSPy in the loop
  5. Regression thresholds
  6. Versioning
  7. The tradeoff
  8. Failure examples
  9. How I would run the release
  10. What to do differently