Outcome focus: Shipped a calibrated multiclass NPS model with a utility-driven operating threshold and a PSI-based drift loop, giving the retention team a per-customer detractor probability they can act on and a rule for when to retrain.
mlnpsclassificationcalibrationdriftevaluation
Teams ship NPS classifiers and then something uncomfortable happens. The model reports 72 percent accuracy, someone on retention asks what to do with it, and no one can answer. The metric is honest but it is tied to nothing the business can action. The probabilities the model emits are not trustworthy enough to threshold either, so no one wants to put a customer-facing workflow behind them.
The code below is what I land on when I want an NPS multiclass model a retention team can use. It handles the three things that usually go wrong: class imbalance, uncalibrated probabilities, and no operating point tied to business value. It also leaves hooks for drift detection, because NPS feedback shifts with product changes, incidents, and seasonality.
The problem, stated precisely#
NPS is ordinal with three buckets (detractor, passive, promoter) and the distribution is heavily imbalanced. In a typical SaaS panel I see something like 15 / 55 / 30, which means a classifier that predicts "passive" for everyone scores 55 percent accuracy and is operationally useless.
Three things have to be true for the output to be useful:
- The classifier cannot ignore detractors. That means class-weighted training plus some form of minority resampling during fit.
- The probabilities have to be calibrated. If the model says
P(detractor) = 0.7, roughly seven of ten of those should in fact be detractors. - The operating threshold has to come from expected business value, not from default 0.5. An unactioned detractor has a cost. A false-positive retention gesture has a different, usually smaller, cost.
The code#
# requirements: scikit-learn, imbalanced-learn, lightgbm, joblib, pandas, numpy, pydantic
import json
import joblib
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.ensemble import RandomForestClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import (
f1_score,
average_precision_score,
precision_recall_curve,
classification_report,
)
from imblearn.pipeline import make_pipeline as imb_pipeline
from imblearn.over_sampling import SMOTE
from pydantic import BaseModel
# 0. Load data. Replace with your parquet or SQL export.
def load_data():
rng = np.random.RandomState(0)
n = 2000
X = rng.normal(size=(n, 20))
# 0 = detractor, 1 = passive, 2 = promoter
y = rng.choice([0, 1, 2], size=n, p=[0.15, 0.55, 0.30])
dfX = pd.DataFrame(X, columns=[f"x{i}" for i in range(X.shape[1])])
dfX["y"] = y
return dfX
df = load_data()
X, y = df.drop(columns=["y"]), df["y"]
X_train, X_hold, y_train, y_hold = train_test_split(
X, y, stratify=y, test_size=0.2, random_state=42
)
# 1. Imbalance-aware pipeline: scaler, SMOTE (fit-time only), forest.
smote = SMOTE(random_state=42, sampling_strategy="auto")
base_clf = RandomForestClassifier(
n_estimators=200, class_weight="balanced", random_state=42
)
base_pipeline = imb_pipeline(StandardScaler(), smote, base_clf)
# 2. Calibrate the entire pipeline via 5-fold CV. Each fold refits the
# pipeline on the training split, so SMOTE runs inside training only
# and calibration sees genuine held-out predictions.
model = CalibratedClassifierCV(estimator=base_pipeline, method="isotonic", cv=5)
model.fit(X_train, y_train)
probs = model.predict_proba(X_hold)
preds = np.argmax(probs, axis=1)
# 3. Optional: LightGBM with a focal-style sample weight for harder examples.
try:
import lightgbm as lgb
lgbm = lgb.LGBMClassifier(
objective="multiclass",
num_class=3,
n_estimators=400,
class_weight="balanced",
).fit(X_train, y_train)
p_train = lgbm.predict_proba(X_train)
p_true = p_train[np.arange(len(y_train)), y_train]
gamma, alpha = 2.0, 0.75
focal_w = alpha * ((1 - p_true) ** gamma) + 1.0
lgbm_focal = lgb.LGBMClassifier(
objective="multiclass", num_class=3, n_estimators=400
).fit(X_train, y_train, sample_weight=focal_w)
except ImportError:
lgbm = lgbm_focal = None
# 4. Metrics. Macro F1 treats classes fairly. Per-class PR-AUC is
# the right rarity-sensitive metric for the detractor head.
f1_macro = f1_score(y_hold, preds, average="macro")
Yb = label_binarize(y_hold, classes=[0, 1, 2])
ap_per_class = [
average_precision_score(Yb[:, c], probs[:, c]) for c in range(probs.shape[1])
]
pr_auc_macro = float(np.mean(ap_per_class))
pr_curves = {}
for c in range(probs.shape[1]):
P, R, T = precision_recall_curve(Yb[:, c], probs[:, c])
pr_curves[f"class_{c}"] = {
"precision": P.tolist(),
"recall": R.tolist(),
"thresholds": T.tolist(),
}
report = classification_report(
y_hold,
preds,
target_names=["detractor", "passive", "promoter"],
output_dict=True,
)
# 5. Pick the detractor threshold that maximizes expected utility.
# benefit_tp and cost_fp come from the retention team, not the ML team.
def choose_threshold_by_utility(
p_det, y_true, thresholds=np.linspace(0, 1, 101), benefit_tp=150, cost_fp=10
):
best = {"threshold": None, "utility": -np.inf}
for t in thresholds:
act = (p_det >= t).astype(int)
tp = int(((act == 1) & (y_true == 0)).sum())
fp = int(((act == 1) & (y_true != 0)).sum())
util = tp * benefit_tp - fp * cost_fp
if util > best["utility"]:
best = {
"threshold": float(t),
"utility": int(util),
"tp": tp,
"fp": fp,
}
return best
best_thresh = choose_threshold_by_utility(probs[:, 0], y_hold)
# 6. Drift signals: PSI on top features, plus predicted-probability summary.
def psi(expected, actual, buckets=10):
expected, actual = np.asarray(expected), np.asarray(actual)
breaks = np.nanpercentile(expected, np.linspace(0, 100, buckets + 1))
eps = 1e-8
total = 0.0
for i in range(buckets):
lb, ub = breaks[i], breaks[i + 1]
e = max(((expected >= lb) & (expected <= ub)).mean(), eps)
a = max(((actual >= lb) & (actual <= ub)).mean(), eps)
total += (e - a) * np.log(e / a)
return total
psi_report = {c: psi(X_train[c].values, X_hold[c].values) for c in X_train.columns[:5]}
pred_proba_stats = (
pd.DataFrame(probs, columns=["p_detractor", "p_passive", "p_promoter"])
.describe()
.to_dict()
)
# 7. Save artifacts for CI gating and reproducible promotion.
class FeatureMeta(BaseModel):
name: str
dtype: str
mean: float | None = None
std: float | None = None
version: str
metrics = {
"model_version": "v1",
"model_type": "calibrated_random_forest_pipeline",
"f1_macro": float(f1_macro),
"pr_auc_macro": pr_auc_macro,
}
joblib.dump(model, "model_pipeline_v1.joblib")
with open("eval_report_v1.json", "w") as f:
json.dump(
{
"metrics": metrics,
"report": report,
"pr_curves": pr_curves,
"psi": psi_report,
"pred_proba_stats": pred_proba_stats,
},
f,
)
with open("features_v1.json", "w") as f:
json.dump(
[
FeatureMeta(
name=c,
dtype=str(X_train[c].dtype),
mean=float(X_train[c].mean()),
std=float(X_train[c].std()),
version="v1",
).model_dump()
for c in X_train.columns
],
f,
)
print("f1_macro", f1_macro)
print("pr_auc_macro", pr_auc_macro)
print("best_threshold_for_detractor_action", best_thresh)Walk through#
Imbalance-aware training#
imbalanced-learn's Pipeline runs SMOTE inside .fit() only. Inference never touches the oversampler, so the prediction distribution is not contaminated. Combined with class_weight="balanced" on the forest, the detractor class stops getting ignored by the loss.
I default to a random forest because it is fast, behaves well on tabular features, and gives stable results without much tuning. The LightGBM block with focal-style sample weights is the upgrade I reach for when the forest plateaus. It is a fifteen-line patch and usually moves detractor recall by a few points. The "focal weight" is a heuristic inspired by focal loss: boost the loss on hard or rare examples by multiplying by (1 - p_true) ** gamma.
Calibrated probabilities#
CalibratedClassifierCV wraps a base classifier and learns a monotonic mapping (isotonic regression here) from its raw scores to empirically calibrated probabilities. You need this if you plan to threshold the output. An uncalibrated random forest will routinely emit 0.9-confidence predictions with the actual hit rate closer to 0.7.
One wiring note that bit me the first time I did this. The common snippet on Stack Overflow is:
calibrated = CalibratedClassifierCV(estimator=clf, method="isotonic", cv="prefit")
calibrated.fit(X_train, y_train)That pattern only works if clf has already been fit on a separate fold. Otherwise you are calibrating on training noise. The cleanest pattern is cv=5 on an unfit estimator, which is what the code above does. The CalibratedClassifierCV wrapper also accepts a Pipeline, so you can calibrate the whole SMOTE-aware path in one call and keep the imbalance handling correct inside every CV fold.
Utility-based thresholding#
This is the short block that changes what the whole model means.
best_thresh = choose_threshold_by_utility(
probs[:, 0], y_hold, benefit_tp=150, cost_fp=10
)For every candidate threshold, score true positives at benefit_tp (what a caught churn-likely detractor is worth) and false positives at cost_fp (the cost of reaching out to someone who was not actually a detractor). The threshold that maximizes expected utility is your operating point. It is almost never 0.5. On the NPS panels I have seen, the detractor threshold usually lands between 0.22 and 0.35.
The benefit_tp and cost_fp numbers have to come from the retention team, not the ML team. A retention lead can usually quote the value of catching a churning customer and the rough cost of a false-positive outreach within an hour. That conversation is the most important part of the project and it has nothing to do with the model.
Drift signals#
Population Stability Index is the simplest signal that is almost always worth computing. Bucket the reference distribution into deciles, check how much mass the new data puts into each decile, and sum a weighted log ratio.
| PSI | Interpretation |
|---|---|
| < 0.10 | Noise. Do not act. |
| 0.10–0.25 | Investigate. Feature worth a look. |
| > 0.25 | Meaningful shift. Trigger retraining review. |
Prediction-probability summary stats over rolling windows catch a different failure mode. When the average P(detractor) quietly creeps up over three weeks, something in the input distribution or the world has changed and the operating threshold may no longer be optimal even though every feature looks stable.
What ships with the model#
The three artifacts at the end are deliberate:
model_pipeline_v1.joblibis the full scikit pipeline including scaler and sampler. Inference is a one-line load.eval_report_v1.jsoncaptures the metric snapshot at promotion time. CI gates compare the next version's report against this baseline and refuse to promote if macro F1 drops or detractor PR-AUC falls by more than ten percent.features_v1.jsonis the feature contract. The drift checks read it to know which columns to monitor and what their training-time statistics were.
What this buys you#
The team stops arguing about accuracy. They are looking at two numbers per week: PR-AUC on the detractor class and PSI on the top five features. When either trips a threshold, the retraining pipeline runs. When neither does, the model keeps operating at the utility-optimal threshold from its last promotion.
That is the shape of an NPS model that earns its seat in production.