The Preprocessing Boundary Between scikit-learn and PyTorch

Outcome focus: Defined an artifact contract that keeps column preprocessing, feature order, model weights, metadata, and inference behavior synchronized across batch and serving environments.

Most tabular model failures are not dramatic.

They are small mismatches.

The training notebook imputes missing values one way. The batch job fills them another way. A categorical encoder sees a new value in production. A REST service sends columns in a different order. A Snowpark UDF uses a slightly different version of scikit-learn. A PyTorch model receives a tensor with the right shape but the wrong feature meaning.

Nothing crashes.

The predictions are just wrong.

That is why the preprocessing boundary matters. In tabular machine learning, the model is rarely only the neural network. The model is the feature contract plus the learned weights. If the preprocessing graph is not shipped with the PyTorch model, the system is depending on memory, documentation, and luck.

I do not like luck at the model boundary.

The pattern I trust is simple:

Use scikit-learn for column-aware preprocessing.
Use PyTorch for the tensor model.
Fit the preprocessing graph once on training data.
Persist the preprocessing graph and model weights as separate artifacts.
Persist metadata that describes feature order, model input dimension, versions, and artifact hashes.
Load all artifacts together inside one inference wrapper.

That keeps training and inference attached to the same contract.

Why scikit-learn and PyTorch belong together#

PyTorch is excellent at tensor computation.

It is not where I want to express most tabular column logic.

For tabular data, scikit-learn has the right vocabulary: ColumnTransformer, Pipeline, SimpleImputer, StandardScaler, OneHotEncoder, and friends. These tools know about columns, categorical variables, missing values, dense and sparse arrays, and repeatable transform graphs.

PyTorch should not have to care whether region was one-hot encoded before or after channel, or whether income was imputed with a median fit only on the training split.

That is preprocessing's job.

The PyTorch model should receive a numeric tensor whose feature order is stable. The scikit-learn graph should own the messy part: raw columns to model-ready matrix.

This division keeps both sides honest.

scikit-learn handles column semantics. PyTorch handles representation learning, loss functions, gradients, and inference.

The artifact contract#

The production unit should include at least four artifacts.

The first is the fitted preprocessor. This is the ColumnTransformer or Pipeline after fit. It contains learned imputation values, scaler means and variances, one-hot category mappings, and column ordering.

The second is the PyTorch state_dict. PyTorch's own docs recommend saving and loading model weights through state_dict for inference instead of saving the entire model object. That keeps the class definition in code and the learned parameters in the artifact.

The third is metadata. This is not optional. The metadata should store the model input dimension, feature names, target name, package versions, training data snapshot identifier, artifact hashes, and any serving-time assumptions.

The fourth is evaluation evidence. A small golden batch with expected outputs can catch deployment mismatch before users do.

The key idea is that inference should never have to guess.

It should not infer input dimension from a fake row. It should not assume feature names from current code. It should not trust that the serving environment happens to match the notebook.

Make the contract explicit.

Persistence and safety#

scikit-learn's own model persistence docs are clear about the tradeoffs.

Pickle-based formats such as pickle, joblib, and cloudpickle are flexible, but loading them can execute arbitrary code. That is fine only when the artifact source is trusted and verified. skops.io is more secure than pickle-based formats because it does not blindly load arbitrary code. It allows inspection of untrusted types before load.

That does not make skops.io magic.

The skops docs still warn that the feature is under development and that loading files from untrusted sources requires caution. In production, this means the artifact pipeline should be controlled. Artifacts should come from CI or a governed training job. They should be hashed. They should be reviewed before deployment.

For PyTorch, torch.save and torch.load also use pickle under the hood. PyTorch's current docs recommend saving model weights with model.state_dict() and loading with weights_only=True when possible.

So the safe-ish baseline is:

skops.io for the fitted scikit-learn preprocessing graph.
PyTorch state_dict for model weights.
Metadata JSON for non-learned configuration.
Hashes and package versions for reproducibility.

Do not load arbitrary model artifacts from places you do not trust.

The model file is code-adjacent.

A minimal training example#

This example uses scikit-learn for preprocessing and PyTorch for a binary classifier.

import hashlib
import json
from pathlib import Path
 
import pandas as pd
import skops.io as skio
import torch
import torch.nn as nn
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
 
 
ARTIFACT_DIR = Path("artifacts")
ARTIFACT_DIR.mkdir(exist_ok=True)
 
num_cols = ["age", "income"]
cat_cols = ["region", "channel"]
target_col = "label"
 
preprocessor = ColumnTransformer(
    transformers=[
        (
            "num",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="median")),
                    ("scaler", StandardScaler(with_mean=True, with_std=True)),
                ],
            ),
            num_cols,
        ),
        (
            "cat",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="most_frequent")),
                    (
                        "onehot",
                        OneHotEncoder(handle_unknown="ignore", sparse_output=False),
                    ),
                ],
            ),
            cat_cols,
        ),
    ],
    remainder="drop",
)
 
df = pd.DataFrame(
    {
        "age": [44, None, 31, 52, 39],
        "income": [70_000, 55_000, None, 88_000, 61_000],
        "region": ["NE", "SW", "NE", "MW", "SW"],
        "channel": ["web", "store", "web", "phone", "store"],
        "label": [1, 0, 1, 0, 1],
    },
)
 
X_train = preprocessor.fit_transform(df.drop(columns=[target_col]))
y_train = torch.tensor(df[target_col].values, dtype=torch.float32).view(-1, 1)
 
 
class MLP(nn.Module):
    def __init__(self, d_in: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
        )
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)
 
 
input_dim = X_train.shape[1]
model = MLP(input_dim)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.BCEWithLogitsLoss()
 
X_tensor = torch.tensor(X_train, dtype=torch.float32)
 
model.train()
for _ in range(200):
    optimizer.zero_grad()
    loss = loss_fn(model(X_tensor), y_train)
    loss.backward()
    optimizer.step()
 
feature_names = preprocessor.get_feature_names_out().tolist()
 
skio.dump(preprocessor, ARTIFACT_DIR / "preprocessor.skops")
torch.save(model.state_dict(), ARTIFACT_DIR / "model_state.pt")
 
metadata = {
    "model_class": "MLP",
    "input_dim": input_dim,
    "feature_names": feature_names,
    "num_cols": num_cols,
    "cat_cols": cat_cols,
    "target_col": target_col,
    "torch_version": torch.__version__,
}
 
(ARTIFACT_DIR / "metadata.json").write_text(json.dumps(metadata, indent=2))
 
for path in ["preprocessor.skops", "model_state.pt", "metadata.json"]:
    digest = hashlib.sha256((ARTIFACT_DIR / path).read_bytes()).hexdigest()
    print(path, digest)

The important part is not the toy model.

The important part is that the fitted preprocessing graph, the model weights, and the metadata are created together.

A safer inference wrapper#

The inference wrapper should load all artifacts on startup, validate the feature contract, and make prediction calls boring.

import json
from pathlib import Path
 
import pandas as pd
import skops.io as skio
import torch
import torch.nn as nn
 
 
class MLP(nn.Module):
    def __init__(self, d_in: int):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1),
        )
 
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)
 
 
class TabularTorchInference:
    def __init__(self, artifact_dir: str = "artifacts"):
        self.artifact_dir = Path(artifact_dir)
        self.metadata = json.loads((self.artifact_dir / "metadata.json").read_text())
 
        unknown_types = skio.get_untrusted_types(
            file=self.artifact_dir / "preprocessor.skops",
        )
        self.preprocessor = skio.load(
            self.artifact_dir / "preprocessor.skops",
            trusted=unknown_types,
        )
 
        self.feature_names = self.metadata["feature_names"]
        loaded_feature_names = self.preprocessor.get_feature_names_out().tolist()
        if loaded_feature_names != self.feature_names:
            raise ValueError("Preprocessor feature names do not match metadata.")
 
        self.model = MLP(d_in=self.metadata["input_dim"])
        state = torch.load(
            self.artifact_dir / "model_state.pt",
            map_location="cpu",
            weights_only=True,
        )
        self.model.load_state_dict(state)
        self.model.eval()
 
    @torch.no_grad()
    def predict_proba(self, df: pd.DataFrame):
        required_cols = self.metadata["num_cols"] + self.metadata["cat_cols"]
        missing = sorted(set(required_cols) - set(df.columns))
        if missing:
            raise ValueError(f"Missing required columns: {missing}")
 
        X = self.preprocessor.transform(df[required_cols])
        X_tensor = torch.tensor(X, dtype=torch.float32)
        logits = self.model(X_tensor).squeeze(1)
        return torch.sigmoid(logits).cpu().numpy()

In a real service, I would add structured logging, request validation, model version headers, and a golden-batch startup check. But the shape is right.

Load artifacts once. Validate the contract. Transform raw rows through the fitted graph. Convert to tensor. Run the model in eval mode under torch.no_grad.

Feature order is the contract#

The most fragile part of this setup is feature order.

The PyTorch model does not know that column 17 means cat__region_NE. It only sees a float at position 17. If the preprocessing output order changes, the model may still receive a tensor with the expected dimension, but the meaning of the tensor has changed.

That is one of the worst failure modes because it can be silent.

Persisting preprocessor.get_feature_names_out() gives the inference wrapper something to check. If the loaded preprocessor produces a different feature list from metadata, fail early.

I would also store:

Raw input columns.
Feature names after preprocessing.
Input dimension.
Target name.
Training data snapshot ID.
Library versions.
Artifact hashes.
Model version.
Training run ID.

This metadata is cheap compared with a bad deployment.

Unknown categories and missing values#

Tabular inference breaks when production data behaves differently from training data.

The example uses OneHotEncoder(handle_unknown="ignore"). That means a new category at inference time does not crash the transform. It gets encoded as all zeros for that categorical feature's known levels.

That may be the right choice.

It may also hide drift.

If a new channel value appears for 30 percent of production traffic, silently ignoring it is not good enough. The model is now operating on a category it never learned. The service should monitor unknown category rates and missing value rates.

The same goes for imputers. Median imputation is not neutral. It is a modeling assumption. If missingness changes in production, predictions can change.

The preprocessor keeps training and inference consistent.

It does not make production data healthy.

Batch, REST, Snowpark, and Vertex AI#

The same artifact pattern works in several deployment shapes.

For batch jobs, load the artifacts once per job, transform the batch, run the PyTorch model, and write predictions. Airflow, Dagster, dbt-adjacent Python jobs, and scheduled containers all fit this path.

For REST services, load the artifacts on startup. FastAPI is a natural wrapper. Validate input with Pydantic, transform with scikit-learn, infer with PyTorch, and return probabilities plus model version.

For Snowflake or Snowpark, there are two common patterns. One is to load the skops preprocessor and PyTorch model inside a Python UDF or stored procedure. That keeps scoring close to the data but requires careful package and artifact management. The other is to do preprocessing in Snowflake and call a model endpoint for inference. That reduces in-database model complexity but creates a service dependency.

For GCP, I would store artifacts in GCS and deploy a custom Vertex AI Prediction container or Cloud Run service. The container loads the preprocessor, metadata, and model weights at startup. Artifact paths and model version become environment configuration.

The core contract does not change.

Raw columns go into the fitted preprocessor. Numeric tensors go into PyTorch. Metadata verifies the handoff.

When to move beyond this pattern#

This pattern is strongest for tabular ML with rich preprocessing.

It is not always the right final form.

If ultra-low latency matters, the Python preprocessing graph may become a bottleneck. You may convert the PyTorch model to TorchScript or ONNX, but the preprocessing still has to be handled. If you cannot represent preprocessing in the serving runtime, you still need an upstream transform step.

If broad runtime portability matters, ONNX may be attractive. scikit-learn's docs describe ONNX as useful when serving without a Python environment is the priority, while also noting that not all models are supported. End-to-end ONNX can work when preprocessing can be represented as ONNX operators or reliably applied upstream.

If edge deployment matters, all-Torch preprocessing may be worth the pain. That can make TorchScript or mobile deployment cleaner, but it gives up some of scikit-learn's column-aware convenience.

If explainability and governance matter more than latency, keeping the scikit-learn preprocessing graph visible may be an advantage.

The choice depends on where the system must be stable.

Failure modes#

The first failure mode is refitting preprocessing at inference time. That is a different model.

The second is rebuilding preprocessing logic by hand in the service. It will drift.

The third is not persisting feature names. The tensor shape can match while the feature meaning changes.

The fourth is loading artifacts unsafely. pickle, joblib, cloudpickle, torch.load, and even safer alternatives need a trust model.

The fifth is saving the whole PyTorch model object instead of the state_dict. That couples the artifact to code structure more tightly than necessary.

The sixth is forgetting model.eval(). Dropout and batch norm behave differently in training mode.

The seventh is ignoring unknown category rates. handle_unknown="ignore" prevents crashes, not drift.

The eighth is version mismatch. scikit-learn's docs are direct: loading models across different scikit-learn versions is unsupported and inadvisable.

The ninth is no golden batch. A small expected-output test would catch many deployment mistakes.

The tenth is no artifact identity. If the service cannot say which preprocessor and weights produced a prediction, the system is not auditable.

The point#

The right production boundary is not "scikit-learn or PyTorch."

It is scikit-learn for the column contract and PyTorch for the tensor model.

That boundary works when it is explicit. The fitted preprocessor is an artifact. The model weights are an artifact. The metadata is an artifact. The golden batch is an artifact. Together, they define the thing being deployed.

The model is not just model.pt.

The model is the whole path from raw input row to probability.

When that path is serialized, versioned, validated, and loaded as one unit, training and inference have a fighting chance of staying in sync.

That is the difference between a notebook result and a model service.

The Preprocessing Boundary Between scikit-learn and PyTorch

Why scikit-learn and PyTorch belong together#

The artifact contract#

Persistence and safety#

A minimal training example#

A safer inference wrapper#

Feature order is the contract#

Unknown categories and missing values#

Batch, REST, Snowpark, and Vertex AI#

When to move beyond this pattern#

Failure modes#

The point#

Sources#