Outcome focus: Defined an end-to-end scikit-learn classification pipeline that keeps preprocessing, imbalance handling, probability calibration, evaluation, thresholding, and production artifacts aligned.
machine learningscikit-learncalibrationclassificationmlops
Most tabular classifiers do not fail because the algorithm is exotic.
They fail because the pipeline is loose.
The notebook imputes missing values before the split. The production job encodes categories differently. The model handles class imbalance by changing weights, but nobody checks whether the probabilities still mean what they say. The team picks a threshold because 0.5 feels neutral. A dashboard shows AUC, while the actual workflow depends on whether 0.27 behaves like 27 percent.
That is not a modeling problem in isolation.
It is a contract problem.
For many business classifiers, especially risk, propensity, churn, NPS, eligibility, fraud, and outreach models, the real artifact is not only the estimator. The artifact is:
- the preprocessing graph
- the classifier
- the calibration method
- the evaluation protocol
- the operating threshold
- the feature schema
- the persistence format
- the monitoring plan
If those pieces drift apart, the model can keep returning numbers while the decision system quietly gets worse.
This is the scikit-learn pattern I reach for when the data is mixed, the positive class is rare, and the output probability has to support a real decision.
It is not fancy. That is what makes it survive a year in production.
The shape of the problem#
Assume a binary classifier with structured data.
Some features are numeric:
- age
- recent cost
- visits in the last 90 days
- volatility measures
- rejection counts
- time-windowed behavior
Some features are categorical:
- state
- channel
- plan tier
- store id
- product group
- customer segment
The positive class is rare. Maybe only 3 percent of customers are severe detractors. Maybe only 5 percent churn. Maybe only 2 percent have the operational event the team wants to prevent.
The business does not only need labels.
It needs ranked risk and probability-like scores.
That means the pipeline has to do more than maximize accuracy.
It has to produce a score that can be trusted enough to drive thresholded action.
The baseline pattern#
Here is the current scikit-learn version of the pattern.
One small detail matters: use estimator= for CalibratedClassifierCV. Older examples often use base_estimator=, but scikit-learn deprecated that parameter in 1.2 and current docs use estimator.
from sklearn.calibration import CalibratedClassifierCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
average_precision_score,
brier_score_loss,
roc_auc_score,
)
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
num_cols = [
"age",
"rx_cost_30d",
"visits_90d",
"copay_volatility",
]
cat_cols = [
"store_id",
"plan_tier",
"state",
"channel",
]
num_prep = make_pipeline(
SimpleImputer(strategy="median"),
StandardScaler(),
)
cat_prep = make_pipeline(
SimpleImputer(strategy="most_frequent"),
OneHotEncoder(handle_unknown="ignore", sparse_output=False),
)
prep = ColumnTransformer(
transformers=[
("num", num_prep, num_cols),
("cat", cat_prep, cat_cols),
],
remainder="drop",
)
base_clf = LogisticRegression(
class_weight="balanced",
max_iter=2000,
solver="lbfgs",
)
calibrated_clf = CalibratedClassifierCV(
estimator=base_clf,
method="sigmoid",
cv=5,
)
pipe = Pipeline(
steps=[
("prep", prep),
("model", calibrated_clf),
]
)
cv = StratifiedKFold(
n_splits=5,
shuffle=True,
random_state=42,
)
scoring = {
"roc_auc": "roc_auc",
"average_precision": "average_precision",
"brier": "neg_brier_score",
}
results = cross_validate(
pipe,
X,
y,
cv=cv,
scoring=scoring,
n_jobs=-1,
return_train_score=False,
)
summary = {
metric: results[f"test_{metric}"].mean()
for metric in scoring
}
summary["brier"] = -summary["brier"]
print(summary)This is not the end of the project.
It is the beginning of a clean baseline.
Why the preprocessing belongs inside the pipeline#
The ColumnTransformer is the quiet hero here.
It makes preprocessing part of the model contract.
Numeric columns get median imputation and scaling. Categorical columns get most-frequent imputation and one-hot encoding. The fitted imputation values, scaler parameters, one-hot category mapping, and feature ordering stay attached to the pipeline.
That matters because preprocessing leakage is easy.
If you impute before splitting, your validation fold has already influenced the training transform. If you one-hot encode outside the pipeline, feature ordering can change between training and serving. If you fill missing values differently in production, the model may receive legal inputs with shifted meaning.
None of this has to crash.
It can just be wrong.
Keeping preprocessing inside the pipeline means each cross-validation fold fits its own preprocessing on the training portion only. The held-out fold is transformed by that fold's fitted preprocessing graph.
That gives you a more honest estimate.
It also gives production one artifact to load.
This connects directly to the broader artifact-contract point in The Preprocessing Boundary Between scikit-learn and PyTorch. Whether the downstream model is PyTorch or scikit-learn, feature semantics need to travel with the model.
Why handle_unknown="ignore" matters#
Categorical data changes.
New stores appear. New plan tiers get added. New channels show up. A state value arrives in a different case. A product code appears that was not in the training set.
If the encoder is strict, production can fail on an unseen category.
OneHotEncoder(handle_unknown="ignore") avoids that failure by producing zeros for unknown categories in that feature group.
That does not mean unknown categories are harmless.
It means the pipeline degrades instead of crashing.
You should still monitor unknown category rates. A sudden rise can mean a schema change, upstream data issue, new business behavior, or a training set that is no longer representative.
The operational rule is:
Do not let unseen categories crash serving.
Do not ignore them in monitoring.
Why class imbalance changes the conversation#
Class imbalance is not just an annoyance.
It changes what "good" means.
If 3 percent of examples are positive, a model can be 97 percent accurate by predicting every case as negative. That model is useless for finding the people who need action.
class_weight="balanced" is a strong baseline because it adjusts class weights inversely to class frequency. scikit-learn documents the formula as:
n_samples / (n_classes * np.bincount(y))That makes mistakes on the rare class count more during training.
But class weighting has a cost.
It often improves ranking and recall, but it can distort probability estimates. The model was trained under a reweighted loss, while the real-world base rate is still the real-world base rate.
That is why calibration matters.
The model may rank risk correctly while reporting probabilities that are too high or too low.
If the downstream workflow only needs the top 500 cases every week, ranking may be enough. If the workflow uses expected value, SLA risk, clinical escalation, retention spend, support triage, or cost-sensitive thresholds, probability quality matters.
I wrote more about that distinction in When 0.3 Does Not Mean 30 Percent.
Why calibration belongs in the design#
scikit-learn's calibration docs define a well-calibrated classifier plainly: when a classifier assigns probability near 0.8, roughly 80 percent of those samples should belong to the positive class in the long run.
That is the whole idea.
AUC does not tell you that.
AUC tells you whether positives tend to score higher than negatives. Average Precision tells you more about ranking under imbalance. Brier score gives you a probability-quality signal. Calibration curves and reliability diagrams show whether the score means what it says.
In this pipeline, CalibratedClassifierCV wraps the base classifier.
With cv=5, scikit-learn fits copies of the estimator on training folds and calibrates them on held-out folds. That keeps calibration from being fit on the same predictions the classifier trained on.
There are two common methods:
sigmoid, also called Platt scaling, fits a logistic mapping from model score to probability.isotonicfits a more flexible monotonic calibration curve.
My default is sigmoid unless I have enough calibration data and a reason to use isotonic.
The scikit-learn docs warn that isotonic calibration can overfit with too few calibration samples. That is especially important on imbalanced data. "I have 50,000 rows" is not the relevant number. "I have enough positive examples inside calibration folds" is the relevant number.
If positives are rare, isotonic can learn a jagged curve from noise.
Use flexibility when the data can support it.
A version with isotonic calibration#
If you have enough data, the code change is small:
calibrated_clf = CalibratedClassifierCV(
estimator=base_clf,
method="isotonic",
cv=5,
)The evaluation burden is larger.
When switching from sigmoid to isotonic, compare:
- Brier score
- log loss
- reliability diagram
- calibration by score bucket
- calibration by segment
- threshold stability
- performance on a time-based holdout
Do not switch because the training chart looks smoother.
Switch because the calibrated probabilities are better on data that simulates the way the model will be used.
Cross-validation is not the same as the final artifact#
cross_validate tells you how the pipeline behaves across folds.
It does not produce the final model you deploy.
After evaluation, fit the full pipeline on the training set that you have approved for release:
pipe.fit(X_train, y_train)Then evaluate once on a final untouched test set or a time-based holdout:
proba = pipe.predict_proba(X_test)[:, 1]
metrics = {
"roc_auc": roc_auc_score(y_test, proba),
"average_precision": average_precision_score(y_test, proba),
"brier": brier_score_loss(y_test, proba),
}
print(metrics)For operational models, I prefer time-based holdouts when the data has natural time.
Random folds answer one question:
How well does this model generalize across examples drawn from the same period?
Time-based holdouts answer another:
How well does this model survive the future?
Most business models care more about the second question.
Thresholding comes after calibration#
A calibrated model still does not decide what action to take.
It produces a probability.
The threshold is a policy.
For example, if a false negative is 10 times more expensive than a false positive, the threshold should reflect that. If an operations team can only review 1,000 cases per week, the threshold may be set by capacity. If outreach has a hard cost, the threshold should be tied to expected value.
Here is a simple cost sweep:
import numpy as np
def choose_threshold(
y_true,
proba,
false_positive_cost,
false_negative_cost,
):
thresholds = np.linspace(0.01, 0.99, 99)
best = None
for threshold in thresholds:
pred = proba >= threshold
false_positives = ((pred == 1) & (y_true == 0)).sum()
false_negatives = ((pred == 0) & (y_true == 1)).sum()
cost = (
false_positives * false_positive_cost
+ false_negatives * false_negative_cost
)
candidate = {
"threshold": threshold,
"cost": cost,
"false_positives": int(false_positives),
"false_negatives": int(false_negatives),
}
if best is None or candidate["cost"] < best["cost"]:
best = candidate
return bestThis is intentionally plain.
The important part is not the function.
The important part is that threshold selection happens after probability calibration and is tied to a real operating constraint.
Do not let 0.5 become product strategy by accident.
What to persist#
For a pure scikit-learn pipeline, joblib is common:
import joblib
joblib.dump(pipe, "classifier_pipeline.joblib")But the model file alone is not enough.
Persist metadata:
{
"model_name": "risk_classifier",
"model_version": "2026-04-27",
"positive_class": 1,
"numeric_columns": [
"age",
"rx_cost_30d",
"visits_90d",
"copay_volatility"
],
"categorical_columns": [
"store_id",
"plan_tier",
"state",
"channel"
],
"calibration_method": "sigmoid",
"threshold": 0.31,
"threshold_policy": "minimum expected cost on time-based validation set",
"sklearn_version": "1.8.0",
"training_data_snapshot": "warehouse.table@2026-04-01",
"feature_contract_hash": "..."
}Also persist a golden batch:
- a small input table
- expected transformed feature count
- expected probability range
- expected predictions
- expected thresholded actions
The golden batch catches boring but dangerous failures: column order changes, missing feature changes, category handling changes, package version changes, and threshold drift.
Production monitoring#
At serving time, monitor more than accuracy.
Minimum:
- input row counts
- missingness by feature
- unknown category rates
- score distribution
- probability bucket distribution
- decision rate at the chosen threshold
- calibration when labels arrive
- Brier score when labels arrive
- Average Precision when labels arrive
- latency and failure rate
For imbalanced systems, segment metrics matter.
Overall calibration can look acceptable while one segment is badly miscalibrated. If the model supports decisions across states, stores, plans, channels, or customer groups, inspect those cuts.
The post Plain-Language Machine Learning Metrics for Real Decisions goes deeper on why AUC, Average Precision, and Brier answer different questions.
When to change the model family#
Logistic regression is a good baseline.
It is not a religion.
If relationships are nonlinear, interactions matter, or categorical structure is too rich for a linear model, gradient boosted trees may be stronger. LightGBM, XGBoost, CatBoost, HistGradientBoosting, or RandomForest-style models can improve ranking and capture interactions.
The pipeline thinking still applies.
You still need:
- leakage-safe preprocessing
- class imbalance strategy
- probability calibration
- honest validation
- threshold policy
- persistence metadata
- monitoring
For high-cardinality categorical features, plain one-hot encoding can become unwieldy. Options include:
- grouping rare categories
- hashing
- target encoding with leakage-safe cross-fitting
- model families with stronger categorical support
- learned embeddings in a neural model
Do not swap encoders casually.
High-cardinality features are leakage magnets.
If the encoder learns target statistics, it must be fit inside cross-validation in a way that prevents each row from seeing its own label.
Feature interpretation#
For a logistic regression baseline, coefficients can be useful after preprocessing, but one-hot expansion makes interpretation more verbose.
You can inspect fitted feature names:
feature_names = pipe.named_steps["prep"].get_feature_names_out()Then align coefficients from the calibrated model carefully. Because CalibratedClassifierCV fits multiple estimator-calibrator pairs when cv creates an ensemble, coefficient extraction is less direct than with a single logistic regression.
For model explanation, decide what you need:
- global drivers
- local explanations
- segment-level behavior
- operational reason codes
- compliance-friendly narratives
SHAP can help, especially for tree models, but explanation tooling should be validated too. A beautiful explanation chart that follows a leaky feature is still wrong.
The failure modes#
The first failure mode is preprocessing outside the pipeline.
It feels harmless until validation becomes optimistic or production transforms differ.
The second failure mode is optimizing AUC and shipping probabilities.
AUC is ranking. It is not calibration.
The third failure mode is using class_weight="balanced" and assuming probabilities are still calibrated.
Class weighting can help the rare class. It can also shift probability meaning.
The fourth failure mode is isotonic calibration on too few positives.
Flexible calibration can memorize noise.
The fifth failure mode is thresholding before the business decision is defined.
Thresholds should reflect cost, capacity, risk, or policy. They should not be inherited from a default.
The sixth failure mode is saving only the model object.
Production needs the pipeline, metadata, feature contract, threshold, versions, and golden batch.
The pattern in one sentence#
Fit the whole decision pipeline, not just the classifier.
The preprocessing has to be leakage-safe.
The imbalance strategy has to match the metric.
The calibration has to make probabilities mean what they say.
The threshold has to match the business action.
The artifact has to carry the schema and metadata into production.
That is the difference between a model that scores records and a model that can support decisions.
Related notes#
- When 0.3 Does Not Mean 30 Percent
- Plain-Language Machine Learning Metrics for Real Decisions
- The Preprocessing Boundary Between scikit-learn and PyTorch
- From Algorithms to AI Systems
- NPS Classifier Calibration and Drift