Tag: evaluation

4 entries tagged "evaluation" — 4 posts, 0 links.

Posts

Apr 22, 20268 min — Platform & AI

Building an NPS Classifier You Can Actually Act On

A scikit-learn NPS ordinal classifier with SMOTE, probability calibration, utility-based thresholding, and PSI drift checks. The parts that make it useful to the retention team, not just accurate on a dashboard.

Outcome: Shipped a calibrated multiclass NPS model with a utility-driven operating threshold and a PSI-based drift loop, giving the retention team a per-customer detractor probability they can act on and a rule for when to retrain.

Feb 24, 20266 min — Platform & AI

DSPy + RAG Evaluation Ops in Production

How to turn DSPy and RAG evaluation into a production release loop with golden sets, retrieval checks, generation rubrics, regression thresholds, and versioned prompt programs.

Outcome: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.

Feb 10, 20266 min — Platform & AI

Evaluating Multi-Agent Workflows for Enterprise Reliability

A practical evaluation loop for multi-agent workflows that catches demo-friendly failures in task handoff, tool use, permissions, latency, and completion criteria before release.

Outcome: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.

Jan 15, 202611 min — Platform & AI

When 0.3 Does Not Mean 30 Percent

How imbalanced classifiers can keep a strong AUC while producing probabilities that break thresholds, alerts, and cost-sensitive decisions in production.

Outcome: Defined a production calibration gate that logs Brier score, ECE, reliability diagrams, cost-sensitive thresholds, run metadata, and promotion criteria for imbalanced classifiers.

All tags