When the Model Should Say It Doesn't Know: Conformal Prediction Sets with MAPIE

Outcome focus: Added coverage-guaranteed prediction sets and operational abstention gates to a classification pipeline, cutting acted-upon error rate without retraining the model.

A model that is confidently wrong is worse than a model that abstains.

The default output of a classifier is a single predicted label — the argmax of the softmax. That label carries no coverage guarantee. The probability score next to it is not a reliable frequency estimate; most models are trained to maximize accuracy, not to produce calibrated confidence, so "0.82 probability" does not mean the model is right 82% of the time. When someone downstream treats that number as a signal and acts on it, they have no real information about what fraction of those decisions will be correct.

The question worth asking before you ship a classifier into any decision-making process: what happens when the model is wrong and confident? Is there a mechanism for the system to say "I don't know"?

Conformal prediction is one rigorous answer. Instead of a single label, it outputs a prediction set — {cat, fox} — with a provable coverage guarantee: at whatever target you set, the true label will be inside the set that fraction of the time over random test examples. On easy inputs the set is small (often just one label). On hard inputs it grows. The size of the set is doing real work: a set of five labels at 95% coverage is telling you something. A set of one label at 95% coverage is also telling you something different.

Why a confidence threshold is not enough#

The naive alternative to conformal prediction is to threshold on the softmax maximum: if max(softmax) < 0.7, abstain. It is intuitive and easy to implement. It also fails in ways that are hard to detect.

Modern neural networks are frequently overconfident. A ResNet trained on ImageNet will assign 0.9+ confidence to out-of-distribution inputs with no hesitation. Platt scaling or temperature scaling can improve calibration, but even a well-calibrated model does not give you a coverage guarantee. If you set a threshold to achieve 95% coverage on your validation set, there is nothing preventing that coverage from dropping to 88% on next quarter's data if the input distribution shifts. You have a heuristic, not a guarantee.

Conformal prediction is distribution-free and finite-sample valid. The guarantee holds for any model, any data distribution, any sample size, without asymptotic assumptions. The only requirement is that the calibration set and the test set are exchangeable — drawn from the same distribution. That is a real assumption, and violation of it (covariate shift at deployment) will erode coverage, but you get to reason about that explicitly rather than discover it in production metrics.

The calibration split#

The coverage guarantee lives in the calibration set. The logic is: for each calibration example, compute a nonconformity score measuring how surprising the true label is given the model's output. Then set a threshold at the (1-α)(1 + 1/n_cal) quantile of those scores. At test time, predict every label whose nonconformity score is below that threshold.

APS (Adaptive Prediction Sets) defines the nonconformity score in a way that adapts to example difficulty. Sort the classes by softmax probability descending. Sum the probabilities until you have accumulated enough to include the true label. That cumulative sum is the nonconformity score for that calibration example. Hard examples — where the true label is buried low in the softmax ranking — have high nonconformity scores. The threshold you extract from the calibration set automatically adjusts for this.

The calibration split has to be clean: no calibration examples in training, no training examples in calibration. Cross-fitting on the same data breaks the exchangeability assumption and the guarantee collapses. This is the thing that most tutorials gloss over.

conformal_setup.py

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from mapie.classification import MapieClassifier
import numpy as np
 
# Three-way split: train the model on train, compute nonconformity scores on cal,
# evaluate coverage claims on test. Never let cal examples touch the fit.
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=0
)
X_cal, X_test, y_cal, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=0
)
 
base = LogisticRegression(max_iter=1000, C=1.0)
base.fit(X_train, y_train)
 
# cv="prefit" tells MAPIE the base estimator is already trained.
# The fit() call on (X_cal, y_cal) only computes nonconformity scores.
mapie = MapieClassifier(estimator=base, method="aps", cv="prefit")
mapie.fit(X_cal, y_cal)

cv="prefit" is the right flag when you train the base model separately — which you almost always do if the model is a PyTorch network rather than a scikit-learn estimator. MAPIE then uses the calibration fit only to store the nonconformity score quantile.

Reading prediction sets#

mapie.predict returns two things: the argmax prediction (same as calling base.predict) and the prediction set as a boolean array where True means that label is in the set.

prediction_sets.py

# alpha=0.05 targets 95% marginal coverage.
# include_last_label=True includes the label that pushes cumulative prob over threshold —
# exclude it for slightly smaller sets at the cost of a hair under the target coverage.
y_point, y_set = mapie.predict(X_test, alpha=0.05, include_last_label=True)
 
n_classes = y_set.shape[1]
set_sizes = y_set.sum(axis=1)
 
print(f"Mean set size:        {set_sizes.mean():.2f}")
print(f"Single-label sets:    {(set_sizes == 1).mean():.1%}")
print(f"Full-set abstentions: {(set_sizes == n_classes).mean():.1%}")
 
# Empirical coverage check — should be at or above 0.95 on any held-out test set
# drawn from the same distribution as the calibration set.
in_set = y_set[np.arange(len(y_test)), y_test]
print(f"Empirical coverage:   {in_set.mean():.3f}")

The empirical coverage line is the sanity check you always run first. If it falls below 0.95 on the test set, either the calibration and test distributions are different or something in the split is contaminated. I have seen calibration contamination from stratified splits that overlapped with training data — numpy's random state was reset mid-script and the same indices got reused. The coverage check catches it immediately.

Mean set size tells you whether the model is actually resolving predictions or hedging. A mean set size of 1.2 means the model is confident on most examples. A mean set size of 4.0 on a ten-class problem means the model is genuinely uncertain on most of the data and you should think about whether the feature set is adequate.

Temperature scaling before conformal#

Conformal prediction sets are valid without calibrated probabilities — the guarantee holds regardless of how well-calibrated the underlying softmax is. But calibration matters for how useful the sets are in practice.

Poorly calibrated probabilities produce uneven set sizes. If the model is systematically overconfident, the nonconformity scores will cluster near zero for most examples and spike only for the worst cases. The resulting sets will be deceptively small on examples where the model is confidently wrong. Temperature scaling before building the conformal wrapper gives more informative set sizes without breaking the coverage guarantee.

Temperature scaling fits a single scalar T on the calibration set by minimizing cross-entropy on raw logits with model weights frozen. T > 1 softens the distribution — reduces overconfidence. T < 1 sharpens it. The argmax never changes.

temperature_scaling.py

import torch
import torch.nn as nn
 
class TemperatureScaler(nn.Module):
    def __init__(self):
        super().__init__()
        # Initialize to T=1 (no rescaling). log_T=0 → exp(0)=1.
        self.log_T = nn.Parameter(torch.zeros(1))
 
    def forward(self, logits):
        return logits / self.log_T.exp()
 
    @property
    def temperature(self):
        return self.log_T.exp().item()
 
 
def fit_temperature(model, cal_loader, device="cuda"):
    model.eval()
    scaler = TemperatureScaler().to(device)
    nll = nn.CrossEntropyLoss()
 
    # Collect all calibration logits in one pass with weights frozen.
    logits_list, labels_list = [], []
    with torch.no_grad():
        for xb, yb in cal_loader:
            logits_list.append(model(xb.to(device)).cpu())
            labels_list.append(yb)
 
    logits_all = torch.cat(logits_list)
    labels_all = torch.cat(labels_list)
 
    # LBFGS converges in a few steps for a single scalar parameter.
    opt = torch.optim.LBFGS([scaler.log_T], lr=0.01, max_iter=100)
 
    def closure():
        opt.zero_grad()
        loss = nll(scaler(logits_all), labels_all)
        loss.backward()
        return loss
 
    opt.step(closure)
    print(f"Calibrated temperature: {scaler.temperature:.3f}")
    return scaler

Check ECE (Expected Calibration Error) before and after to confirm calibration actually improved. ECE bins examples by confidence and measures the gap between mean confidence and actual accuracy in each bin. A model with ECE of 0.12 before and 0.03 after temperature scaling has materially more trustworthy probability estimates.

ece.py

import numpy as np
 
def ece(probs, y_true, n_bins=15):
    confidences = probs.max(axis=1)
    predictions = probs.argmax(axis=1)
    correct = (predictions == y_true)
    score = 0.0
    for i in range(n_bins):
        lo, hi = i / n_bins, (i + 1) / n_bins
        mask = (confidences >= lo) & (confidences < hi)
        if mask.sum() > 0:
            bin_acc = correct[mask].mean()
            bin_conf = confidences[mask].mean()
            score += mask.mean() * abs(bin_acc - bin_conf)
    return score
 
probs_uncal = softmax(raw_logits, axis=1)
probs_cal   = softmax(raw_logits / T_optimal, axis=1)
 
print(f"ECE before: {ece(probs_uncal, y_test):.4f}")
print(f"ECE after:  {ece(probs_cal,   y_test):.4f}")

One wiring note: the temperature scaler has to be fit on the calibration set, and the calibration set then gets used again to build the conformal nonconformity scores via MAPIE. That is fine — you are running two separate fits on the same held-out data, and neither one involves gradient updates to the base model. The coverage guarantee still holds because the nonconformity scores are computed on data the model never trained on.

Risk-coverage curves and AURC#

Prediction sets tell you about coverage at a fixed alpha. Risk-coverage curves ask a broader question: what is the tradeoff between how often you predict versus how often you are right when you do?

The idea is simple. Rank all test examples by a confidence score — 1 over set size works well, or calibrated max softmax probability. Then sweep a threshold from most confident to least. At each threshold, compute coverage (fraction of examples above threshold) and risk (error rate on those examples). Plot them. A model that is genuinely selective looks like a curve that drops sharply toward low risk at low coverage — the examples it is most confident about really are the ones it gets right. A miscalibrated or overconfident model looks nearly flat — confidence is not predictive of correctness.

AURC (Area Under the Risk-Coverage curve) summarizes this in one number. Lower is better. A perfect selective classifier has AURC of zero: it gets everything right on the examples it answers. A random confidence assignment has AURC approximately equal to the model's overall error rate — confidence does not help at all.

risk_coverage.py

import numpy as np
from sklearn.metrics import accuracy_score
 
def risk_coverage_curve(y_true, y_pred, confidence_scores):
    order = np.argsort(confidence_scores)[::-1]
    y_true_s = y_true[order]
    y_pred_s = y_pred[order]
    n = len(y_true)
    coverages, risks = [], []
    for k in range(1, n + 1):
        coverages.append(k / n)
        risks.append(1 - accuracy_score(y_true_s[:k], y_pred_s[:k]))
    return np.array(coverages), np.array(risks)
 
 
y_point, y_set = mapie.predict(X_test, alpha=0.05, include_last_label=True)
set_sizes = y_set.sum(axis=1)
 
# Use inverse set size as confidence: smaller set = more confident
conf_set  = 1.0 / set_sizes.clip(min=1)
# Use calibrated max softmax as an alternative confidence signal
conf_prob = probs_cal.max(axis=1)
 
cov_set,  risk_set  = risk_coverage_curve(y_test, y_point, conf_set)
cov_prob, risk_prob = risk_coverage_curve(y_test, y_point, conf_prob)
 
aurc_set  = np.trapz(risk_set,  cov_set)
aurc_prob = np.trapz(risk_prob, cov_prob)
 
print(f"AURC (set size confidence): {aurc_set:.4f}")
print(f"AURC (max prob confidence): {aurc_prob:.4f}")

Comparing the two AURC numbers tells you whether set size or calibrated probability is the better confidence signal for your model. On well-calibrated models they tend to be close. On overconfident models, set size usually wins because the conformal procedure already corrected for the calibration problem. I have seen cases where the two AURC values differ by 40% — which usually means the base probabilities are poorly calibrated and temperature scaling has not fully resolved it.

Operational gates#

The point of all this is to wire it into a decision. In practice, selective prediction means the pipeline either passes an example through to a downstream action or routes it to a human review queue. The gate is a coverage target translated into an abstention rule.

Two ways to frame the gate:

Set-size gate: abstain if set size exceeds a threshold. This is natural with MAPIE because the alpha parameter controls the coverage target directly. At alpha=0.05, examples where the model is genuinely uncertain will have large sets. A gate of "abstain if set size > 2" lets you tune coverage and precision together.

operational_gate.py

abstain_threshold = 2  # tune against validation risk-coverage curve
 
y_point, y_set = mapie.predict(X_test, alpha=0.05, include_last_label=True)
set_sizes = y_set.sum(axis=1)
 
act_mask     = set_sizes <= abstain_threshold
abstain_mask = ~act_mask
 
acted_coverage = act_mask.mean()
acted_accuracy = (y_point[act_mask] == y_test[act_mask]).mean()
abstain_rate   = abstain_mask.mean()
 
print(f"Acted on:   {acted_coverage:.1%} of examples")
print(f"Accuracy:   {acted_accuracy:.3f} on acted-on examples")
print(f"Abstained:  {abstain_rate:.1%}")

Coverage target gate: read the risk-coverage curve and find the coverage level where risk drops below your tolerance. Then set the threshold at the confidence score value that corresponds to that coverage level. This is more principled but requires computing the curve on a validation set before deploying.

The right choice between them depends on what downstream means. If the cost of a wrong prediction and the cost of a missed prediction are roughly equivalent, tune toward a coverage target. If wrong predictions are expensive (a medical flag, a fraud decision, a compliance trigger) and abstentions are cheap (routed to review), set the abstention threshold conservatively and accept a lower coverage rate.

Either way, the number you should report alongside model accuracy is acted-on accuracy and abstention rate together. Accuracy alone on selective-prediction systems is nearly meaningless — a model that abstains on 80% of examples can trivially achieve high accuracy on the 20% it answers. The full picture requires all three numbers.

What ships with this#

The three artifacts worth keeping after this is set up:

Calibrated base model with temperature scalar and ECE before/after. Lets you verify calibration did not degrade when the model is retrained.
MAPIE fitted wrapper serialized with joblib. The nonconformity score quantile is stored inside — it represents the coverage contract with the calibration set. When the model is retrained, refit the wrapper on a fresh calibration split.
Risk-coverage curve and AURC logged at each promotion. If AURC increases on the next model version, confidence has become less predictive of correctness even if accuracy improved. That is a regression worth flagging.

Conformal prediction is not magic and the guarantee is not unconditional. It holds when calibration and test distributions are exchangeable. It erodes under covariate shift. Monitoring set-size distribution over time is a reasonable early signal that the input distribution has moved — if mean set size starts climbing, the model is encountering examples that look harder than the calibration data did. That is worth investigating before the coverage guarantee becomes optimistic.

A model that knows what it does not know does not solve the hard problem of distribution shift. But it makes the hard problem visible instead of silent.