A scikit-learn Pipeline for Calibrated Decisions

Outcome focus: Defined an end-to-end scikit-learn classification pipeline that keeps preprocessing, imbalance handling, probability calibration, evaluation, thresholding, and production artifacts aligned.

Most tabular classifiers do not fail because the algorithm is exotic.

They fail because the pipeline is loose.

The notebook imputes missing values before the split. The production job encodes categories differently. The model handles class imbalance by changing weights, but nobody checks whether the probabilities still mean what they say. The team picks a threshold because 0.5 feels neutral. A dashboard shows AUC, while the actual workflow depends on whether 0.27 behaves like 27 percent.

That is not a modeling problem in isolation.

It is a contract problem.

For many business classifiers, especially risk, propensity, churn, NPS, eligibility, fraud, and outreach models, the real artifact is not only the estimator. The artifact is:

the preprocessing graph
the classifier
the calibration method
the evaluation protocol
the operating threshold
the feature schema
the persistence format
the monitoring plan

If those pieces drift apart, the model can keep returning numbers while the decision system quietly gets worse.

This is the scikit-learn pattern I reach for when the data is mixed, the positive class is rare, and the output probability has to support a real decision.

It is not fancy. That is what makes it survive a year in production.

The shape of the problem#

Assume a binary classifier with structured data.

Some features are numeric:

age
recent cost
visits in the last 90 days
volatility measures
rejection counts
time-windowed behavior

Some features are categorical:

state
channel
plan tier
store id
product group
customer segment

The positive class is rare. Maybe only 3 percent of customers are severe detractors. Maybe only 5 percent churn. Maybe only 2 percent have the operational event the team wants to prevent.

The business does not only need labels.

It needs ranked risk and probability-like scores.

That means the pipeline has to do more than maximize accuracy.

It has to produce a score that can be trusted enough to drive thresholded action.

The baseline pattern#

Here is the current scikit-learn version of the pattern.

One small detail matters: use estimator= for CalibratedClassifierCV. Older examples often use base_estimator=, but scikit-learn deprecated that parameter in 1.2 and current docs use estimator.

from sklearn.calibration import CalibratedClassifierCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    average_precision_score,
    brier_score_loss,
    roc_auc_score,
)
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
 
num_cols = [
    "age",
    "rx_cost_30d",
    "visits_90d",
    "copay_volatility",
]
 
cat_cols = [
    "store_id",
    "plan_tier",
    "state",
    "channel",
]
 
num_prep = make_pipeline(
    SimpleImputer(strategy="median"),
    StandardScaler(),
)
 
cat_prep = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore", sparse_output=False),
)
 
prep = ColumnTransformer(
    transformers=[
        ("num", num_prep, num_cols),
        ("cat", cat_prep, cat_cols),
    ],
    remainder="drop",
)
 
base_clf = LogisticRegression(
    class_weight="balanced",
    max_iter=2000,
    solver="lbfgs",
)
 
calibrated_clf = CalibratedClassifierCV(
    estimator=base_clf,
    method="sigmoid",
    cv=5,
)
 
pipe = Pipeline(
    steps=[
        ("prep", prep),
        ("model", calibrated_clf),
    ]
)
 
cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42,
)
 
scoring = {
    "roc_auc": "roc_auc",
    "average_precision": "average_precision",
    "brier": "neg_brier_score",
}
 
results = cross_validate(
    pipe,
    X,
    y,
    cv=cv,
    scoring=scoring,
    n_jobs=-1,
    return_train_score=False,
)
 
summary = {
    metric: results[f"test_{metric}"].mean()
    for metric in scoring
}
 
summary["brier"] = -summary["brier"]
 
print(summary)

This is not the end of the project.

It is the beginning of a clean baseline.

Why the preprocessing belongs inside the pipeline#

The ColumnTransformer is the quiet hero here.

It makes preprocessing part of the model contract.

Numeric columns get median imputation and scaling. Categorical columns get most-frequent imputation and one-hot encoding. The fitted imputation values, scaler parameters, one-hot category mapping, and feature ordering stay attached to the pipeline.

That matters because preprocessing leakage is easy.

If you impute before splitting, your validation fold has already influenced the training transform. If you one-hot encode outside the pipeline, feature ordering can change between training and serving. If you fill missing values differently in production, the model may receive legal inputs with shifted meaning.

None of this has to crash.

It can just be wrong.

Keeping preprocessing inside the pipeline means each cross-validation fold fits its own preprocessing on the training portion only. The held-out fold is transformed by that fold's fitted preprocessing graph.

That gives you a more honest estimate.

It also gives production one artifact to load.

This connects directly to the broader artifact-contract point in The Preprocessing Boundary Between scikit-learn and PyTorch. Whether the downstream model is PyTorch or scikit-learn, feature semantics need to travel with the model.

Why `handle_unknown="ignore"` matters#

Categorical data changes.

New stores appear. New plan tiers get added. New channels show up. A state value arrives in a different case. A product code appears that was not in the training set.

If the encoder is strict, production can fail on an unseen category.

OneHotEncoder(handle_unknown="ignore") avoids that failure by producing zeros for unknown categories in that feature group.

That does not mean unknown categories are harmless.

It means the pipeline degrades instead of crashing.

You should still monitor unknown category rates. A sudden rise can mean a schema change, upstream data issue, new business behavior, or a training set that is no longer representative.

The operational rule is:

Do not let unseen categories crash serving.

Do not ignore them in monitoring.

Why class imbalance changes the conversation#

Class imbalance is not just an annoyance.

It changes what "good" means.

If 3 percent of examples are positive, a model can be 97 percent accurate by predicting every case as negative. That model is useless for finding the people who need action.

class_weight="balanced" is a strong baseline because it adjusts class weights inversely to class frequency. scikit-learn documents the formula as:

n_samples / (n_classes * np.bincount(y))

That makes mistakes on the rare class count more during training.

But class weighting has a cost.

It often improves ranking and recall, but it can distort probability estimates. The model was trained under a reweighted loss, while the real-world base rate is still the real-world base rate.

That is why calibration matters.

The model may rank risk correctly while reporting probabilities that are too high or too low.

If the downstream workflow only needs the top 500 cases every week, ranking may be enough. If the workflow uses expected value, SLA risk, clinical escalation, retention spend, support triage, or cost-sensitive thresholds, probability quality matters.

I wrote more about that distinction in When 0.3 Does Not Mean 30 Percent.

Why calibration belongs in the design#

scikit-learn's calibration docs define a well-calibrated classifier plainly: when a classifier assigns probability near 0.8, roughly 80 percent of those samples should belong to the positive class in the long run.

That is the whole idea.

AUC does not tell you that.

AUC tells you whether positives tend to score higher than negatives. Average Precision tells you more about ranking under imbalance. Brier score gives you a probability-quality signal. Calibration curves and reliability diagrams show whether the score means what it says.

In this pipeline, CalibratedClassifierCV wraps the base classifier.

With cv=5, scikit-learn fits copies of the estimator on training folds and calibrates them on held-out folds. That keeps calibration from being fit on the same predictions the classifier trained on.

There are two common methods:

sigmoid, also called Platt scaling, fits a logistic mapping from model score to probability.
isotonic fits a more flexible monotonic calibration curve.

My default is sigmoid unless I have enough calibration data and a reason to use isotonic.

The scikit-learn docs warn that isotonic calibration can overfit with too few calibration samples. That is especially important on imbalanced data. "I have 50,000 rows" is not the relevant number. "I have enough positive examples inside calibration folds" is the relevant number.

If positives are rare, isotonic can learn a jagged curve from noise.

Use flexibility when the data can support it.

A version with isotonic calibration#

If you have enough data, the code change is small:

calibrated_clf = CalibratedClassifierCV(
    estimator=base_clf,
    method="isotonic",
    cv=5,
)

The evaluation burden is larger.

When switching from sigmoid to isotonic, compare:

Brier score
log loss
reliability diagram
calibration by score bucket
calibration by segment
threshold stability
performance on a time-based holdout

Do not switch because the training chart looks smoother.

Switch because the calibrated probabilities are better on data that simulates the way the model will be used.

Cross-validation is not the same as the final artifact#

cross_validate tells you how the pipeline behaves across folds.

It does not produce the final model you deploy.

After evaluation, fit the full pipeline on the training set that you have approved for release:

pipe.fit(X_train, y_train)

Then evaluate once on a final untouched test set or a time-based holdout:

proba = pipe.predict_proba(X_test)[:, 1]
 
metrics = {
    "roc_auc": roc_auc_score(y_test, proba),
    "average_precision": average_precision_score(y_test, proba),
    "brier": brier_score_loss(y_test, proba),
}
 
print(metrics)

For operational models, I prefer time-based holdouts when the data has natural time.

Random folds answer one question:

How well does this model generalize across examples drawn from the same period?

Time-based holdouts answer another:

How well does this model survive the future?

Most business models care more about the second question.

Thresholding comes after calibration#

A calibrated model still does not decide what action to take.

It produces a probability.

The threshold is a policy.

For example, if a false negative is 10 times more expensive than a false positive, the threshold should reflect that. If an operations team can only review 1,000 cases per week, the threshold may be set by capacity. If outreach has a hard cost, the threshold should be tied to expected value.

Here is a simple cost sweep:

import numpy as np
 
def choose_threshold(
    y_true,
    proba,
    false_positive_cost,
    false_negative_cost,
):
    thresholds = np.linspace(0.01, 0.99, 99)
    best = None
 
    for threshold in thresholds:
        pred = proba >= threshold
 
        false_positives = ((pred == 1) & (y_true == 0)).sum()
        false_negatives = ((pred == 0) & (y_true == 1)).sum()
 
        cost = (
            false_positives * false_positive_cost
            + false_negatives * false_negative_cost
        )
 
        candidate = {
            "threshold": threshold,
            "cost": cost,
            "false_positives": int(false_positives),
            "false_negatives": int(false_negatives),
        }
 
        if best is None or candidate["cost"] < best["cost"]:
            best = candidate
 
    return best

This is intentionally plain.

The important part is not the function.

The important part is that threshold selection happens after probability calibration and is tied to a real operating constraint.

Do not let 0.5 become product strategy by accident.

What to persist#

For a pure scikit-learn pipeline, joblib is common:

import joblib
 
joblib.dump(pipe, "classifier_pipeline.joblib")

But the model file alone is not enough.

Persist metadata:

{
  "model_name": "risk_classifier",
  "model_version": "2026-04-27",
  "positive_class": 1,
  "numeric_columns": [
    "age",
    "rx_cost_30d",
    "visits_90d",
    "copay_volatility"
  ],
  "categorical_columns": [
    "store_id",
    "plan_tier",
    "state",
    "channel"
  ],
  "calibration_method": "sigmoid",
  "threshold": 0.31,
  "threshold_policy": "minimum expected cost on time-based validation set",
  "sklearn_version": "1.8.0",
  "training_data_snapshot": "warehouse.table@2026-04-01",
  "feature_contract_hash": "..."
}

Also persist a golden batch:

a small input table
expected transformed feature count
expected probability range
expected predictions
expected thresholded actions

The golden batch catches boring but dangerous failures: column order changes, missing feature changes, category handling changes, package version changes, and threshold drift.

Production monitoring#

At serving time, monitor more than accuracy.

Minimum:

input row counts
missingness by feature
unknown category rates
score distribution
probability bucket distribution
decision rate at the chosen threshold
calibration when labels arrive
Brier score when labels arrive
Average Precision when labels arrive
latency and failure rate

For imbalanced systems, segment metrics matter.

Overall calibration can look acceptable while one segment is badly miscalibrated. If the model supports decisions across states, stores, plans, channels, or customer groups, inspect those cuts.

The post Plain-Language Machine Learning Metrics for Real Decisions goes deeper on why AUC, Average Precision, and Brier answer different questions.

When to change the model family#

Logistic regression is a good baseline.

It is not a religion.

If relationships are nonlinear, interactions matter, or categorical structure is too rich for a linear model, gradient boosted trees may be stronger. LightGBM, XGBoost, CatBoost, HistGradientBoosting, or RandomForest-style models can improve ranking and capture interactions.

The pipeline thinking still applies.

You still need:

leakage-safe preprocessing
class imbalance strategy
probability calibration
honest validation
threshold policy
persistence metadata
monitoring

For high-cardinality categorical features, plain one-hot encoding can become unwieldy. Options include:

grouping rare categories
hashing
target encoding with leakage-safe cross-fitting
model families with stronger categorical support
learned embeddings in a neural model

Do not swap encoders casually.

High-cardinality features are leakage magnets.

If the encoder learns target statistics, it must be fit inside cross-validation in a way that prevents each row from seeing its own label.

Feature interpretation#

For a logistic regression baseline, coefficients can be useful after preprocessing, but one-hot expansion makes interpretation more verbose.

You can inspect fitted feature names:

feature_names = pipe.named_steps["prep"].get_feature_names_out()

Then align coefficients from the calibrated model carefully. Because CalibratedClassifierCV fits multiple estimator-calibrator pairs when cv creates an ensemble, coefficient extraction is less direct than with a single logistic regression.

For model explanation, decide what you need:

global drivers
local explanations
segment-level behavior
operational reason codes
compliance-friendly narratives

SHAP can help, especially for tree models, but explanation tooling should be validated too. A beautiful explanation chart that follows a leaky feature is still wrong.

The failure modes#

The first failure mode is preprocessing outside the pipeline.

It feels harmless until validation becomes optimistic or production transforms differ.

The second failure mode is optimizing AUC and shipping probabilities.

AUC is ranking. It is not calibration.

The third failure mode is using class_weight="balanced" and assuming probabilities are still calibrated.

Class weighting can help the rare class. It can also shift probability meaning.

The fourth failure mode is isotonic calibration on too few positives.

Flexible calibration can memorize noise.

The fifth failure mode is thresholding before the business decision is defined.

Thresholds should reflect cost, capacity, risk, or policy. They should not be inherited from a default.

The sixth failure mode is saving only the model object.

Production needs the pipeline, metadata, feature contract, threshold, versions, and golden batch.

The pattern in one sentence#

Fit the whole decision pipeline, not just the classifier.

The preprocessing has to be leakage-safe.

The imbalance strategy has to match the metric.

The calibration has to make probabilities mean what they say.

The threshold has to match the business action.

The artifact has to carry the schema and metadata into production.

That is the difference between a model that scores records and a model that can support decisions.

A scikit-learn Pipeline for Calibrated Decisions

The shape of the problem#

The baseline pattern#

Why the preprocessing belongs inside the pipeline#

Why handle_unknown="ignore" matters#

Why class imbalance changes the conversation#

Why calibration belongs in the design#

A version with isotonic calibration#

Cross-validation is not the same as the final artifact#

Thresholding comes after calibration#

What to persist#

Production monitoring#

When to change the model family#

Feature interpretation#

The failure modes#

The pattern in one sentence#

Related notes#

Sources#

Why `handle_unknown="ignore"` matters#