Outcome focus: Defined a calibration workflow that separates ranking from probability quality, uses scikit-learn calibration correctly, and carries thresholds and monitoring into production.
machine learningcalibrationmlopsclassificationmodel evaluation
A classifier can have a good AUC and still lie with confidence.
That sentence sounds dramatic, but it is one of the most common failures in operational machine learning.
The model ranks cases well. High-risk users tend to score above low-risk users. The ROC-AUC looks healthy. Average Precision is better than the baseline. The team is happy.
Then someone uses the score as a probability.
They treat 0.70 as roughly 70 percent likely. They compute expected value. They set an outreach threshold. They promise an SLA. They decide which cases should be automated and which should go to human review.
Weeks later, the model's ranking was fine, but the probability scale was wrong.
That is a calibration problem.
Calibration is not an academic add-on. It is an operating control. If people are going to act on probabilities, those probabilities need to mean what they say.
This post is the practical playbook. The companion pieces When 0.3 Does Not Mean 30 Percent and A scikit-learn Pipeline for Calibrated Decisions go deeper on calibration failure and end-to-end pipeline design.
Ranking and probability are different promises#
Discrimination asks:
Can the model put positives above negatives?
Calibration asks:
When the model says 30 percent, does the event happen roughly 30 percent of the time?
Those are different promises.
A model can rank well and be miscalibrated. A model can be calibrated but not very useful for ranking. In many real workflows, you need both.
If the business only wants the top 1,000 cases each week, ranking may be enough. AUC and Average Precision can support that conversation.
If the business wants to set thresholds by expected cost, estimate capacity, automate low-risk cases, or compare risk across segments, probability quality matters.
This is where Brier score, log loss, and reliability diagrams come in.
The baseline scikit-learn pattern#
Here is a minimal calibration example using current scikit-learn API style. Older examples often show base_estimator=...; current scikit-learn documentation uses estimator=....
from sklearn.calibration import CalibratedClassifierCV
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss, log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
X, y = make_classification(
n_samples=5000,
n_features=20,
weights=[0.8, 0.2],
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=0,
stratify=y,
)
base = LogisticRegression(max_iter=1000)
calibrated = CalibratedClassifierCV(
estimator=base,
method="sigmoid",
cv=5,
)
calibrated.fit(X_train, y_train)
proba = calibrated.predict_proba(X_test)[:, 1]
print("ROC-AUC:", roc_auc_score(y_test, proba))
print("Brier:", brier_score_loss(y_test, proba))
print("LogLoss:", log_loss(y_test, proba))This is not production code by itself.
It is the smallest honest example of the control.
CalibratedClassifierCV fits the base classifier and calibrates its outputs using held-out folds. That is important. Calibration should not be fit on the same predictions the classifier trained on. With cv=5, scikit-learn clones the estimator, trains on training folds, and calibrates on the held-out fold predictions.
That gives you a less optimistic probability mapping.
Sigmoid or isotonic#
There are two common calibration methods.
sigmoid is Platt scaling. It fits a logistic mapping from raw model scores to probabilities. It is constrained and usually more stable when calibration data is limited.
isotonic is more flexible. It fits a monotonic calibration curve. It can correct more complicated miscalibration patterns, but it can overfit when there are too few calibration examples.
The important phrase is not "too few rows."
It is "too few positive examples."
On imbalanced data, a dataset can look large while still giving the calibrator only a small number of positive cases in each fold. That is where isotonic can become a beautiful overfit.
My default:
- Use
sigmoidfirst. - Try
isotonicwhen there are enough positives and the reliability diagram shows nonlinear miscalibration. - Validate on a time-based holdout when the production problem has time.
Measure ranking and calibration together#
Do not replace AUC with calibration metrics.
Use both.
A useful evaluation set should include:
- ROC-AUC for ranking across thresholds.
- Average Precision for ranking under class imbalance.
- Brier score for probability quality.
- Log loss for probability quality with sharper penalty for confident mistakes.
- Reliability diagram for visual calibration.
- Segment-level calibration for important business cuts.
The segment point matters.
Overall calibration can hide bad local behavior. A model can be calibrated across the full population while overestimating risk for one state, channel, plan tier, store group, or customer segment.
If the business will act differently by segment, inspect calibration by segment.
Thresholding comes after calibration#
Thresholds are not model metrics.
They are policy decisions.
A threshold can be chosen by:
- expected false positive and false negative cost
- team capacity
- regulatory risk
- customer experience constraints
- intervention budget
- SLA commitments
- precision or recall floor
If probabilities are calibrated, threshold selection becomes a business conversation. If probabilities are not calibrated, threshold selection becomes theater with math on top.
Here is a plain cost sweep:
import numpy as np
def choose_threshold(y_true, proba, fp_cost, fn_cost):
best = None
for threshold in np.linspace(0.01, 0.99, 99):
pred = proba >= threshold
fp = ((pred == 1) & (y_true == 0)).sum()
fn = ((pred == 0) & (y_true == 1)).sum()
cost = fp * fp_cost + fn * fn_cost
candidate = {
"threshold": float(threshold),
"cost": float(cost),
"false_positives": int(fp),
"false_negatives": int(fn),
}
if best is None or candidate["cost"] < best["cost"]:
best = candidate
return bestThat function is simple on purpose.
The hard part is agreeing on the cost model.
Production artifacts#
When a calibrated model ships, persist more than the estimator.
Persist:
- fitted preprocessing pipeline
- fitted calibrated classifier
- package versions
- feature schema
- calibration method
- selected threshold
- threshold rationale
- training data snapshot
- evaluation results
- reliability diagrams
- segment calibration table
- golden batch
The threshold deserves a first-class record. Six months later, the team should know whether the threshold came from expected cost, capacity, risk policy, or a stakeholder preference.
If the threshold was political, write that down too.
Ambiguity ages badly.
Monitoring#
Calibration needs monitoring because the world moves.
Track:
- score distribution
- probability bucket counts
- decision rate at the threshold
- feature drift
- label arrival delay
- Brier score when labels arrive
- log loss when labels arrive
- calibration curve when labels arrive
- segment-level calibration
- override rate
- business KPI lift
For rare positive classes, label delay can be long. That does not remove the need for monitoring. It means you need leading indicators while waiting for labels: input drift, score drift, unknown category rate, and operational adoption.
When labels arrive, close the loop.
If 0.70 no longer behaves like 70 percent, the model has lost part of its contract.
Failure modes#
The first failure mode is optimizing AUC and reporting probabilities.
Ranking quality does not guarantee probability quality.
The second is calibrating on training predictions.
That fits the calibrator to noise and makes deployment look safer than it is.
The third is using isotonic calibration with too few positives.
Flexibility without data is just overfitting with a nicer name.
The fourth is class weighting without calibration checks.
Class weighting can help rare-class recall and damage probability meaning.
The fifth is shipping a threshold without a threshold policy.
The model says risk. The business decides action.
The sixth is measuring calibration only once.
Calibration is not a launch task. It is a maintenance task.
The point#
A reliable probability is a promise.
It says the model's score can be used for more than ranking. It can support expected value, capacity planning, risk controls, and thresholded workflows.
That promise has to be earned.
Fit the calibrator correctly.
Measure Brier and log loss alongside AUC.
Look at reliability diagrams.
Check segments.
Choose thresholds after calibration.
Monitor the probability scale after launch.
That is how a classifier becomes part of an operating system instead of another impressive notebook.
Related notes#
- When 0.3 Does Not Mean 30 Percent
- A scikit-learn Pipeline for Calibrated Decisions
- Plain-Language Machine Learning Metrics for Real Decisions
- The Preprocessing Boundary Between scikit-learn and PyTorch