Tag: evaluation

5 entries tagged "evaluation" — 5 posts, 0 links.

Posts

Jun 16, 2026 — 7 min — Platform & AI

Python Architecture for AI and Data Systems in 2026

A Python architecture map for AI, data, and backend teams that need notebooks, prompts, evaluations, services, repositories, and infrastructure to stop collapsing into one folder.

Outcome: Defined a production Python layout for AI and data systems that separates experimentation, evaluation, domain logic, infrastructure adapters, and deployable service code.

python ai engineering data engineering software architecture evaluation

Apr 22, 2026 — 8 min — Platform & AI

Building an NPS Classifier You Can Actually Act On

A scikit-learn NPS ordinal classifier with SMOTE, probability calibration, utility-based thresholding, and PSI drift checks. The parts that make it useful to the retention team, not just accurate on a dashboard.

Outcome: Shipped a calibrated multiclass NPS model with a utility-driven operating threshold and a PSI-based drift loop, giving the retention team a per-customer detractor probability they can act on and a rule for when to retrain.

ml nps classification calibration drift evaluation

Feb 24, 2026 — 6 min — Platform & AI

DSPy + RAG Evaluation Ops in Production

How to turn DSPy and RAG evaluation into a production release loop with golden sets, retrieval checks, generation rubrics, regression thresholds, and versioned prompt programs.

Outcome: Promoted the note into an essay by defining a repeatable RAG evaluation workflow that separates retrieval quality from generation quality and blocks prompt-program regressions before release.

dspy rag evaluation mlops agents

Feb 10, 2026 — 6 min — Platform & AI

Evaluating Multi-Agent Workflows for Enterprise Reliability

A practical evaluation loop for multi-agent workflows that catches demo-friendly failures in task handoff, tool use, permissions, latency, and completion criteria before release.

Outcome: Established a repeatable evaluation workflow that gates multi-agent releases on task completion, handoff quality, tool correctness, latency, and recoverability instead of demo impressions.

agents evaluation reliability enterprise ai observability

Jan 15, 2026 — 11 min — Platform & AI

When 0.3 Does Not Mean 30 Percent

How imbalanced classifiers can keep a strong AUC while producing probabilities that break thresholds, alerts, and cost-sensitive decisions in production.

Outcome: Defined a production calibration gate that logs Brier score, ECE, reliability diagrams, cost-sensitive thresholds, run metadata, and promotion criteria for imbalanced classifiers.

ml calibration classification evaluation probability reliability

All tags