When 0.3 Does Not Mean 30 Percent

Outcome focus: Defined a production calibration gate that logs Brier score, ECE, reliability diagrams, cost-sensitive thresholds, run metadata, and promotion criteria for imbalanced classifiers.

A model reports P(churn) = 0.30.

The retention team reads that as "roughly thirty out of a hundred similar customers will churn." They set an outreach rule, send offers, and start planning capacity around the score distribution.

Weeks later, the check comes back ugly. Around that score band, only eleven out of a hundred churned. The AUC looked fine. Average Precision looked better than the baseline. The confusion matrix at 0.5 looked fine enough to pass a demo.

The number the team acted on did not mean what it said.

That is a calibration failure. The model may still rank customers correctly. High scores may still be riskier than low scores. But once a business workflow treats a score as a probability, the model has made a stronger promise than "higher means higher." It has promised magnitude.

For imbalanced problems like fraud, churn, defects, complaints, adverse events, support escalations, and severe detractor prediction, this promise breaks constantly. Rare positives make ranking and probability quality drift apart. A model can be useful for ordering cases and still unusable for thresholds, expected value, alert volume, or human-review routing.

This post is the practical gate I would put around that problem: measure calibration, visualize it, calibrate post-hoc when needed, choose a threshold from real false-positive and false-negative costs, and fail promotion when probability quality regresses.

The official scikit-learn docs I checked for this update are on 1.8.0. A few current details matter: CalibratedClassifierCV now documents sigmoid, isotonic, and temperature calibration; already-fitted classifiers should be wrapped with FrozenEstimator; and brier_score_loss has scale_by_half="auto" so binary Brier scores stay in the customary [0, 1] range while multiclass scores keep the [0, 2] form.

The Promotion Gate#

Here is the operating loop.

Calibration becomes useful when it is wired into promotion, not when it is left as a notebook chart.

The gate has one job: stop a model from replacing production when its probabilities are less reliable, even if its ranking metric improved.

The tradeoff is that some models with better AUC will not ship immediately. That can feel frustrating when the leaderboard improved. I am comfortable with that delay because the downstream workflow is not acting on a leaderboard. It is acting on probability bands and thresholds.

Calibration In Plain English#

Calibration asks whether predicted probabilities mean what they say.

If a model assigns 0.30 to a large group of similar cases, about 30 percent of those cases should be positive over time. If only 11 percent are positive, the model is overconfident. If 47 percent are positive, it is underconfident.

The scikit-learn probability calibration guide makes the same point with predict_proba: a well-calibrated binary classifier should produce scores that can be interpreted as confidence levels. The guide also gives an important warning: proper scoring rules such as Brier score and log loss assess calibration, resolution, and uncertainty together. A lower Brier score is useful, but it is not a pure calibration metric by itself.

That is why I log three kinds of evidence together:

Evidence	What it tells you	What it can hide
AUC / Average Precision	Whether the model ranks positives above negatives	Whether the probability values are believable
Brier / log loss	Whether probability predictions are numerically good overall	Whether the gain came from better ranking rather than better calibration
Reliability diagram / ECE	Whether probability bands line up with observed outcomes	Whether the model ranks cases well enough to be useful

Do not replace AUC with calibration metrics. Pair them.

Why Imbalance Makes This Worse#

Imbalanced classification gives you a small positive class and a large opportunity to fool yourself.

A fraud model with two percent positives can get high accuracy by mostly saying no. A churn model can improve AUC while still stretching the probability scale. A random forest with class weights can recover recall while producing probability estimates that are too conservative near zero and one. scikit-learn's calibration guide specifically notes that random forests often have trouble predicting probabilities near 0 and 1 because averaging high-variance trees pulls estimates away from the extremes.

I have seen teams ship the ranking part and discover the calibration failure only after the business asks a very normal question:

How many cases will this rule create next week?

If the score distribution is not calibrated, alert volume, expected value, staffing, SLA planning, and intervention budgets all get built on sand.

Calibrate On Data The Base Model Did Not Train On#

The clean scikit-learn baseline is CalibratedClassifierCV with cross-validation. The wrapper trains base-estimator clones on each training fold and fits calibrators on held-out fold predictions. That matters because calibration trained on the base model's own training predictions will usually be too optimistic.

calibrate.py

from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
 
base = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    class_weight="balanced",
)
 
model = CalibratedClassifierCV(
    estimator=base,
    method="sigmoid",
    cv=5,
    ensemble="auto",
)
 
model.fit(X_train, y_train)
proba = model.predict_proba(X_val)[:, 1]

For binary imbalanced work, I usually start with method="sigmoid" because it is constrained and stable. The current scikit-learn docs also note that sigmoid can be preferred for very uncalibrated classifiers on very imbalanced datasets because the extra intercept helps shift a classifier biased toward the majority class.

method="isotonic" is more flexible. It can fix nonlinear calibration curves, but it can overfit when the calibration set is small. The scikit-learn docs warn against isotonic when calibration samples are too low, and the practical version is even sharper: count positives, not rows. A calibration set with 50,000 rows and 200 positives is still thin for a flexible curve.

method="temperature" is now documented in scikit-learn 1.8. It is especially relevant for multiclass models and neural-network-style logits. Guo, Pleiss, Sun, and Weinberger's paper On Calibration of Modern Neural Networks is still the standard reference: modern neural networks can be poorly calibrated, and temperature scaling is a simple one-parameter post-processing method that often works well.

Already-fitted classifiers are still possible, but current scikit-learn guidance is to wrap them with FrozenEstimator and calibrate on separate data:

frozen_estimator_calibration.py

from sklearn.calibration import CalibratedClassifierCV
from sklearn.frozen import FrozenEstimator
 
base.fit(X_train, y_train)
 
calibrated = CalibratedClassifierCV(
    estimator=FrozenEstimator(base),
    method="sigmoid",
)
calibrated.fit(X_calibration, y_calibration)

That pattern is useful when the base model came from a different training job. It is also a footgun if X_calibration is not truly disjoint.

Measure ECE, But Do Not Worship It#

Expected calibration error is useful because it compresses a reliability diagram into one number. The usual implementation bins predictions, compares the observed positive rate to the average predicted probability in each bin, weights by bin size, and sums the gaps.

ECE is not a universal truth. It changes with bin count, binning strategy, sample size, and class prevalence. Use it as a consistent gate within your own pipeline, not as a cross-team leaderboard metric.

calibration_metrics.py

import numpy as np
from sklearn.calibration import calibration_curve
from sklearn.metrics import (
    average_precision_score,
    brier_score_loss,
    roc_auc_score,
)
 
 
def expected_calibration_error(y_true, y_prob, n_bins=15):
    bins = np.linspace(0.0, 1.0, n_bins + 1)
    bin_ids = np.digitize(y_prob, bins, right=True)
    bin_ids = np.clip(bin_ids, 1, n_bins) - 1
 
    ece = 0.0
    n = len(y_prob)
 
    for i in range(n_bins):
        mask = bin_ids == i
        if not np.any(mask):
            continue
        bin_weight = mask.mean()
        observed_rate = y_true[mask].mean()
        predicted_rate = y_prob[mask].mean()
        ece += bin_weight * abs(observed_rate - predicted_rate)
 
    return float(ece)
 
 
metrics = {
    "roc_auc": float(roc_auc_score(y_val, proba)),
    "average_precision": float(average_precision_score(y_val, proba)),
    "brier": float(brier_score_loss(y_val, proba, scale_by_half="auto")),
    "ece_15_bins": expected_calibration_error(y_val, proba, n_bins=15),
}
 
prob_true, prob_pred = calibration_curve(
    y_val,
    proba,
    n_bins=15,
    strategy="uniform",
)

The calibration_curve docs call out an easy-to-miss detail: bins with no samples are not returned, so the output arrays can be shorter than n_bins. That matters when you log curves. Store the returned prob_true and prob_pred arrays directly instead of assuming every bin exists.

Thresholding Is A Decision Problem#

scikit-learn's decision-threshold guide separates two problems:

learning a statistical model that estimates probabilities
making a decision from those probabilities

The default binary classification rule in scikit-learn predicts positive when predict_proba exceeds 0.5. That default is almost never the business optimum on imbalanced problems.

If a false negative costs ten times more than a false positive, the threshold should usually move down. If the review team can only handle 300 alerts per day, capacity becomes part of the threshold policy. If an intervention annoys customers, false positives need a real cost.

Here is the plain manual sweep:

threshold_sweep.py

import numpy as np
 
 
def choose_threshold_by_cost(y_true, y_prob, fp_cost, fn_cost):
    thresholds = np.linspace(0.0, 1.0, 501)
    rows = []
 
    for threshold in thresholds:
        pred = y_prob >= threshold
        fp = int(((pred == 1) & (y_true == 0)).sum())
        fn = int(((pred == 0) & (y_true == 1)).sum())
        expected_cost = fp * fp_cost + fn * fn_cost
        rows.append(
            {
                "threshold": float(threshold),
                "false_positives": fp,
                "false_negatives": fn,
                "expected_cost": float(expected_cost),
            }
        )
 
    return min(rows, key=lambda row: row["expected_cost"]), rows
 
 
chosen, sweep = choose_threshold_by_cost(
    y_true=y_val,
    y_prob=proba,
    fp_cost=1.0,
    fn_cost=10.0,
)

Current scikit-learn also has TunedThresholdClassifierCV, which tunes a decision threshold with internal cross-validation to maximize a metric. It is useful when your threshold objective can be expressed as a scorer. For a simple cost table, I still like the manual sweep because it leaves the cost model visible to the people who own the consequences.

The threshold should be persisted with the model artifact. A probability model without its operating threshold is only half of a decision system.

Runnable Example#

This script trains a toy imbalanced classifier, calibrates probabilities, logs calibration metrics, writes a reliability diagram, chooses a cost-sensitive threshold, and persists a metrics record that a CI gate can compare against.

calibration_run.py

import hashlib
import json
import time
from pathlib import Path
 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    average_precision_score,
    brier_score_loss,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split
 
ARTIFACT_DIR = Path("calibration_artifacts")
ARTIFACT_DIR.mkdir(exist_ok=True)
 
 
def expected_calibration_error(y_true, y_prob, n_bins=15):
    bins = np.linspace(0.0, 1.0, n_bins + 1)
    bin_ids = np.digitize(y_prob, bins, right=True)
    bin_ids = np.clip(bin_ids, 1, n_bins) - 1
 
    ece = 0.0
    for i in range(n_bins):
        mask = bin_ids == i
        if not np.any(mask):
            continue
        ece += mask.mean() * abs(y_true[mask].mean() - y_prob[mask].mean())
    return float(ece)
 
 
def choose_threshold_by_cost(y_true, y_prob, fp_cost, fn_cost):
    thresholds = np.linspace(0.0, 1.0, 501)
    best = None
 
    for threshold in thresholds:
        pred = y_prob >= threshold
        fp = int(((pred == 1) & (y_true == 0)).sum())
        fn = int(((pred == 0) & (y_true == 1)).sum())
        expected_cost = fp * fp_cost + fn * fn_cost
 
        row = {
            "threshold": float(threshold),
            "false_positives": fp,
            "false_negatives": fn,
            "expected_cost": float(expected_cost),
        }
 
        if best is None or row["expected_cost"] < best["expected_cost"]:
            best = row
 
    return best
 
 
X, y = make_classification(
    n_samples=20_000,
    n_features=20,
    n_informative=8,
    n_redundant=4,
    weights=[0.98, 0.02],
    flip_y=0.01,
    random_state=42,
)
 
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.3,
    stratify=y,
    random_state=42,
)
 
base = RandomForestClassifier(
    n_estimators=300,
    random_state=42,
    class_weight="balanced",
    n_jobs=-1,
)
 
model = CalibratedClassifierCV(
    estimator=base,
    method="sigmoid",
    cv=5,
    ensemble="auto",
)
model.fit(X_train, y_train)
 
proba = model.predict_proba(X_val)[:, 1]
 
metrics = {
    "roc_auc": float(roc_auc_score(y_val, proba)),
    "average_precision": float(average_precision_score(y_val, proba)),
    "brier": float(brier_score_loss(y_val, proba, scale_by_half="auto")),
    "ece_15_bins": expected_calibration_error(y_val, proba, n_bins=15),
}
 
prob_true, prob_pred = calibration_curve(
    y_val,
    proba,
    n_bins=15,
    strategy="uniform",
)
 
plt.plot(prob_pred, prob_true, marker="o", label="calibrated model")
plt.plot([0, 1], [0, 1], "--", color="gray", label="ideal")
plt.xlabel("Mean predicted probability")
plt.ylabel("Observed positive rate")
plt.title("Reliability diagram")
plt.legend()
plt.savefig(ARTIFACT_DIR / "reliability.png", dpi=160, bbox_inches="tight")
plt.close()
 
threshold = choose_threshold_by_cost(
    y_true=y_val,
    y_prob=proba,
    fp_cost=1.0,
    fn_cost=10.0,
)
 
dataset_hash = hashlib.sha256(
    np.ascontiguousarray(X_val).tobytes()
    + np.ascontiguousarray(y_val).tobytes()
).hexdigest()
 
run = {
    "timestamp": time.time(),
    "model": "RandomForestClassifier + CalibratedClassifierCV(sigmoid)",
    "calibration": {
        "method": "sigmoid",
        "cv": 5,
    },
    "metrics": metrics,
    "threshold": threshold,
    "n_val": int(len(y_val)),
    "positive_rate_val": float(y_val.mean()),
    "dataset_hash": dataset_hash,
}
 
(ARTIFACT_DIR / "run_metrics.json").write_text(
    json.dumps(run, indent=2) + "\n",
)
 
print(json.dumps(run, indent=2))

For older scikit-learn versions before 1.7, remove scale_by_half="auto" from brier_score_loss. For current stable docs, keeping it explicit makes binary and multiclass Brier behavior easier to reason about.

A Tiny CI Gate#

The candidate metrics file should be compared to the current production baseline. Here is a minimal gate.

calibration_gate.py

import json
import sys
from pathlib import Path
 
 
def load(path):
    return json.loads(Path(path).read_text())
 
 
baseline = load("baseline_run_metrics.json")
candidate = load("calibration_artifacts/run_metrics.json")
 
base_metrics = baseline["metrics"]
cand_metrics = candidate["metrics"]
 
base_cost = baseline["threshold"]["expected_cost"]
cand_cost = candidate["threshold"]["expected_cost"]
 
failures = []
 
if cand_metrics["ece_15_bins"] > base_metrics["ece_15_bins"] + 0.02:
    failures.append("ECE regressed by more than 0.02 absolute")
 
if cand_metrics["brier"] > base_metrics["brier"] * 1.05:
    failures.append("Brier regressed by more than 5 percent relative")
 
if cand_cost > base_cost * 1.10:
    failures.append("Expected cost regressed by more than 10 percent")
 
if failures:
    print("\n".join(failures))
    sys.exit(1)
 
print("Calibration gate passed")

Those thresholds are intentionally plain. Tune them for your domain, but do not let them become vague. A promotion gate needs numbers small enough to catch real regressions and large enough to tolerate normal run-to-run noise.

Operational Checklist#

Log ROC-AUC and Average Precision for ranking.
Log Brier, log loss if useful, ECE, and the reliability diagram for probability quality.
Keep the calibration set disjoint from the base model's training predictions.
Start with method="sigmoid" for binary imbalanced data; try isotonic only when positive calibration examples are plentiful.
Consider method="temperature" for multiclass logits and neural-network-style outputs.
Choose thresholds from false-positive cost, false-negative cost, capacity, or policy constraints.
Persist the chosen threshold, cost model, metrics, calibration method, dataset hash, package versions, and reliability plot.
Fail CI when ECE, Brier, or expected cost regresses beyond a written tolerance.

Calibration is not something to glance at after training. It is part of the contract between a model score and the people or systems that act on it.

If the score is only used for ranking, say so and optimize ranking. If the score is used as a probability, make it earn that interpretation before it reaches production.

When 0.3 Does Not Mean 30 Percent

The Promotion Gate#

Calibration In Plain English#

Why Imbalance Makes This Worse#

Calibrate On Data The Base Model Did Not Train On#

Measure ECE, But Do Not Worship It#

Thresholding Is A Decision Problem#

Runnable Example#

A Tiny CI Gate#

Operational Checklist#

References#