Tag: evaluation
4 entries tagged "evaluation" — 4 posts, 0 links.
Posts
A scikit-learn NPS ordinal classifier with SMOTE, probability calibration, utility-based thresholding, and PSI drift checks. The parts that make it useful to the retention team, not just accurate on a dashboard.
Outcome: Shipped a calibrated multiclass NPS model with a utility-driven operating threshold and a PSI-based drift loop, giving the retention team a per-customer detractor probability they can act on and a rule for when to retrain.
How to turn DSPy and RAG evaluation into a production release loop with golden sets, retrieval checks, generation rubrics, regression thresholds, and versioned prompt programs.
Outcome: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.
A practical evaluation loop for multi-agent workflows that catches demo-friendly failures in task handoff, tool use, permissions, latency, and completion criteria before release.
Outcome: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.
How imbalanced classifiers can keep a strong AUC while producing probabilities that break thresholds, alerts, and cost-sensitive decisions in production.
Outcome: Defined a production calibration gate that logs Brier score, ECE, reliability diagrams, cost-sensitive thresholds, run metadata, and promotion criteria for imbalanced classifiers.
All tags