Outcome focus: Gave peers a review-ready vocabulary for inspecting ML systems by connecting core terms to design choices, failure modes, and release questions.
machine learningmodel evaluationfeature engineeringneural networksmlops
The most expensive machine learning confusion usually hides inside ordinary words.
"Feature" sounds like a column. Sometimes it is a column. Sometimes it is a derived aggregate, an embedding, a timestamped entity value, or a leaky proxy for the label.
"Accuracy" sounds like quality. Sometimes it is quality. Sometimes it is a class-imbalance trap.
"Validation" sounds like testing. It is not the final test.
"Regression" sounds like a model family. It can mean a task type, a linear method, or the confusingly named logistic regression, which is usually a classifier.
I have watched model reviews slow down because the team was using correct words at the wrong layer. The data scientist meant "the validation fold improved." The product owner heard "the production decision is safer." The engineer meant "the preprocessing pipeline is fit only on training data." Someone else heard "we normalized the whole dataset before the split." Those are different claims.
A glossary helps, but only if it behaves like an operating tool. The point is not to memorize terms. The point is to ask sharper review questions before a model becomes a decision system.
Google's Machine Learning Glossary is a solid reference for formal definitions. This post is narrower. It translates common ML terms into the questions I want peers to ask in a design review, experiment review, or release gate.
The Map#
Most of the terms fit into one loop.
The map matters because terms change meaning depending on where they sit. A feature in exploratory analysis is not yet a production feature. A model score on the validation set is not yet a release result. A learning rate is not learned by the model in the same way a weight is. An embedding is not magic semantic dust; it is a vector representation with training assumptions and failure modes.
The Short Operating Glossary#
Use this table as the model-review version.
| Term | Operational meaning | Review question |
|---|---|---|
| Feature | Input signal used by the model | Is it available, stable, legal, and timestamp-correct at prediction time? |
| Label or target | Outcome the model learns to predict | Does the label match the decision, or only a convenient measurement? |
| Feature engineering | Transformation from raw data to model-ready signal | Is the transformation fit only on training data and reproducible in serving? |
| Train set | Data used to learn parameters | Does it represent the production population without future leakage? |
| Validation set | Data used to tune choices | Has the team avoided overfitting the design to this set? |
| Test set | Final held-out evidence | Has it stayed untouched until the model and threshold were chosen? |
| Classification | Predicts a class or class probability | Are false positives and false negatives priced separately? |
| Regression | Predicts a continuous value | Is closeness of the number actually the decision, or should it be ranking/classification? |
| Parameter | Learned internal value | Which values are learned from data, and which are set by the team? |
| Hyperparameter | Configuration set before or around training | What search budget and objective justify the chosen value? |
| Loss | Training objective minimized by the optimizer | Does the loss align with the release metric? |
| Metric | Evaluation summary | Does the metric reflect the action the model will drive? |
| Regularization | Complexity penalty or training constraint | Which overfitting failure is it meant to reduce? |
| Embedding | Dense vector representation | What data trained the representation, and what similarity does it encode? |
This table is deliberately practical. A term is not understood until the team can use it to reject a bad design.
Features Are Not Just Columns#
A feature is an input signal. In a table, it might look like a column: age, monthly_spend, contract_type, support_ticket_count. In text, it might be token IDs or a sentence embedding. In images, it may be pixel values or intermediate representations learned by a convolutional network. In recommendation systems, it may be a learned user or item vector.
The review question is availability.
If a churn model uses support_ticket_count_last_7d, the team has to prove that value exists at the moment of prediction. If the production workflow scores users every morning at 8 AM, the feature cannot depend on tickets that close at 5 PM unless the prediction is explicitly retrospective.
Feature types matter because they imply preprocessing:
| Feature type | Common handling | Risk |
|---|---|---|
| Numerical | impute, scale, cap outliers, transform | Unit mismatch, outliers, leakage through global scaling |
| Categorical | one-hot encode, ordinal encode, hash, embed | Unknown categories, fake ordering, high cardinality |
| Text | tokenize, embed, summarize, classify | Prompt/data drift, truncation, domain mismatch |
| Image | resize, normalize, augment, encode | distribution shift from capture conditions |
| Time-based | windows, lags, freshness checks | future leakage and stale features |
The scikit-learn preprocessing guide is useful because it treats preprocessing as part of the estimator graph, not a loose notebook habit. That is the right instinct. Feature engineering should be attached to the model artifact and release path.
Feature Engineering Is a Contract#
Feature engineering is selecting, transforming, creating, and validating features so the model can learn useful structure.
The common moves are familiar:
- fill missing values with a learned imputation rule,
- encode categories,
- scale numerical features,
- bucket continuous values,
- create interaction terms,
- extract text sentiment or embeddings,
- aggregate behavior over time windows,
- reduce dimensionality with methods such as PCA.
The mistake is doing these transformations before the split or outside the deployable artifact.
If you standardize a numerical column using the mean and standard deviation of the entire dataset before splitting, the training data has learned from the test distribution. The model may not see labels directly, but the evaluation is still contaminated. The same applies to imputation, feature selection, target encoding, PCA, and any transformation that learns from data.
Here is the minimal shape I expect in tabular reviews:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
numeric_features = ["monthly_spend", "support_ticket_count"]
categorical_features = ["contract_type", "region"]
X_train_val, X_test, y_train_val, y_test = train_test_split(
X,
y,
test_size=0.20,
stratify=y,
random_state=42,
)
X_train, X_val, y_train, y_val = train_test_split(
X_train_val,
y_train_val,
test_size=0.25,
stratify=y_train_val,
random_state=42,
)
preprocess = ColumnTransformer(
transformers=[
(
"num",
Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]
),
numeric_features,
),
(
"cat",
OneHotEncoder(handle_unknown="ignore"),
categorical_features,
),
]
)
model = Pipeline(
steps=[
("preprocess", preprocess),
("classifier", LogisticRegression(max_iter=1000, class_weight="balanced")),
]
)
model.fit(X_train, y_train)
val_scores = model.predict_proba(X_val)[:, 1]
print(average_precision_score(y_val, val_scores))
# Use X_test only after the model, threshold, and preprocessing choices are fixed.The important detail is not the specific model. It is that the transformation graph is fit on X_train, compared on X_val, and saved as part of the production artifact.
Train, Validation, and Test Are Different Promises#
The train/validation/test split is not administrative. It protects the team from believing its own experiments too early.
The scikit-learn cross-validation docs make the core warning plain: testing a model on the same data used to learn it is a methodological mistake because a model can memorize seen samples and fail on unseen data.
The three-way split exists because model development has multiple decisions:
| Split | Used for | Do not use it for |
|---|---|---|
| Training | learning weights, tree splits, embeddings, imputation values | claiming final generalization |
| Validation | tuning hyperparameters, thresholds, feature sets, architecture | repeated indefinite fishing without accounting for it |
| Test | final unbiased estimate after decisions are fixed | model selection, threshold shopping, narrative rescue |
Cross-validation helps when data is scarce, but it does not remove the need for final evidence. It spreads validation across folds so model selection is less dependent on one lucky split. The final test set should still stay out of the tuning loop.
For time-dependent data, random splits are often wrong. A churn model, demand forecast, fraud model, or pricing model usually needs a time-aware split. If tomorrow's pattern leaks into yesterday's training row, the evaluation is theater.
Classification and Regression Are Decision Shapes#
Classification predicts a category or class probability. Regression predicts a continuous value.
The easy examples are simple:
| Task | Shape | Example output |
|---|---|---|
| Spam detection | binary classification | spam or not_spam |
| Ticket routing | multiclass classification | billing, technical, account |
| House price estimate | regression | $485000 |
| Delivery time estimate | regression | 43 minutes |
The hard cases are decision-shaped.
An NPS score is numeric, but the business may only need to identify likely detractors for outreach. A fraud score is probabilistic, but the business may operate it as a ranked review queue. A demand forecast is regression, but the operational question may be whether inventory crosses a reorder threshold.
Task type should follow the action. If the action is "estimate the value," use regression language. If the action is "choose a class," use classification language. If the action is "rank a work queue," report ranking and threshold behavior instead of pretending the hard class label is the whole product.
Linear and Logistic Regression Are Not Twins#
Linear regression predicts a continuous value as a weighted sum of features plus a bias:
prediction = w1 * x1 + w2 * x2 + ... + wn * xn + bThe weights are learned coefficients. The bias is the intercept. Training usually minimizes an error such as mean squared error, where large misses are penalized heavily.
Logistic regression uses a linear score too, but it maps that score through the logistic sigmoid function to estimate a probability for classification:
z = w1 * x1 + w2 * x2 + ... + wn * xn + b
probability = 1 / (1 + exp(-z))Then a threshold turns the probability into a class decision. The default 0.5 threshold is only a default. It is not a law.
The scikit-learn linear models docs cover both model families, which is useful but also part of the naming trap. In review, I ask:
- Are we predicting a number or a class probability?
- Is the threshold chosen from business cost, not convenience?
- Are the coefficients interpretable after preprocessing?
- Is regularization part of the model?
Logistic regression is often a strong baseline because it is simple, fast, and inspectable. A neural network should have to beat it on the metric that matters, not only on glamour.
Neural Networks Learn Representations#
A neural network is a function built from layers of learned parameters and activation functions.
The simple feed-forward picture is:
- The input layer receives features.
- Hidden layers transform those features through weighted sums and activations.
- The output layer produces a score, probability, class distribution, or continuous value.
- A loss function compares the output with the target.
- Training updates the weights and biases to reduce that loss.
Deep neural networks are powerful because intermediate layers can learn representations that are hard to hand-engineer. That is why they work well for images, audio, language, and other high-dimensional patterns.
The tradeoff is review difficulty. A linear model's coefficients are easier to inspect. A deep model may perform better but require more careful evaluation, calibration checks, interpretability tooling, drift monitoring, and operational fallback.
Do not ask only whether a neural network is "more accurate." Ask what complexity it buys and how the team will notice when the learned representation stops matching production data.
Activations, Sigmoid, Softmax, and Logits#
An activation function adds non-linearity. Without non-linear activations, stacked linear layers collapse into another linear transformation.
Common activations have different jobs:
| Function | Common use | Output shape |
|---|---|---|
| ReLU | hidden layers | max(0, x) |
| Sigmoid | binary output probability | value between 0 and 1 |
| Tanh | hidden layers, some recurrent nets | value between -1 and 1 |
| Softmax | multiclass output | probabilities that sum to 1 |
The raw model score before sigmoid or softmax is often called a logit. Logits are useful for training and calibration work, but most users should not see them. A logit of 2.0 is not "twice as confident" as a logit of 1.0 in plain product language.
Sigmoid is especially easy to overread. It maps a score into the interval (0, 1), but the output is only a trustworthy probability if the model is calibrated for the data and use case. A model can output 0.83 and still be poorly calibrated.
Loss, Gradient Descent, Backpropagation, and Optimizers#
These terms are related, but they are not interchangeable.
| Term | Meaning | Plain review phrasing |
|---|---|---|
| Loss function | The objective minimized during training | What mistake is the model punished for? |
| Gradient | Direction and rate of loss change with respect to parameters | Which way should the parameters move? |
| Backpropagation | Efficient gradient computation through the network | How are gradients assigned across layers? |
| Gradient descent | Update rule that moves parameters against the gradient | How does the model step toward lower loss? |
| Optimizer | Concrete update algorithm | Which stepping strategy are we using? |
PyTorch's autograd tutorial describes backpropagation as the common neural-network training algorithm where parameters are adjusted using gradients of the loss. PyTorch's optimizer docs then cover the implementations that perform those updates.
SGD, Adam, Adagrad, and RMSprop are optimizers. They differ in how they use gradients, momentum-like history, and adaptive learning rates. Adam is a common default because it works well across many problems, but it is not proof that the training setup is sound.
The review question is alignment. Does the loss function match the operational metric? If the product cares about top 50 recall for a review queue, a generic cross-entropy loss may still work, but the release gate needs ranking metrics too. If the business cares about large under-forecasts more than large over-forecasts, a symmetric regression loss may be the wrong objective.
Weights, Biases, Parameters, and Hyperparameters#
Weights and biases are learned parameters.
Hyperparameters are chosen outside the normal parameter-learning process. Examples include learning rate, batch size, regularization strength, number of trees, tree depth, number of layers, hidden dimension, dropout rate, and training epochs.
This difference matters for accountability.
If a weight is wrong, the model learned it from data under the given objective. If the learning rate is wrong, the team configured the training process poorly. If lambda is too high in regularization, the team forced the model to be too simple. If tree depth is too high, the team may have invited memorization.
The model learns parameters. The team owns hyperparameters.
Learning Rate and Convergence#
Learning rate controls how large each optimization step is.
Too high, and training can overshoot the minimum, oscillate, or diverge. Too low, and training can crawl, stall on plateaus, or waste compute without reaching a useful solution. Schedules and adaptive optimizers can help, but they do not remove the need to inspect training curves.
Convergence means training has approached a stable region where loss or model parameters stop changing meaningfully. It does not automatically mean the model is good. A model can converge to a bad solution. A model can converge on training data while validation performance gets worse.
I want to see at least:
- training loss curve,
- validation loss or metric curve,
- learning-rate schedule,
- early stopping rule,
- final checkpoint selection rule.
If the only evidence is "the run finished," the team has not reviewed convergence. It has reviewed job completion.
Overfitting, Underfitting, Bias, and Variance#
Overfitting means the model learned the training data too specifically. It performs well on training data and poorly on unseen data.
Underfitting means the model is too weak, too constrained, or poorly specified to learn the real pattern. It performs poorly on both training and validation data.
The bias-variance tradeoff explains the pressure between simplicity and flexibility:
| Failure | Training performance | Validation/test performance | Typical cause |
|---|---|---|---|
| High bias / underfit | poor | poor | model too simple, features weak, objective wrong |
| High variance / overfit | strong | weak | model too complex, data too small, leakage, too much tuning |
| Better fit | strong enough | strong enough | complexity matches signal and data |
TensorFlow's overfit and underfit tutorial is a useful practical reference because it connects the concept to training curves and regularization techniques.
The failure I trust least is a model that looks excellent on training data and "acceptable" on validation after weeks of manual tuning. That validation set may have quietly become part of the training process through human iteration. At that point, the final test set or a fresh backtest has to carry more weight.
Regularization Is a Complexity Budget#
Regularization reduces overfitting by penalizing or constraining model complexity.
Common forms:
| Technique | What it discourages | Common effect |
|---|---|---|
| L1 regularization | sum of absolute weights | sparse weights, implicit feature selection |
| L2 regularization | sum of squared weights | smaller distributed weights |
| Elastic Net | L1 and L2 together | sparsity plus stability |
| Dropout | reliance on specific neurons | more robust neural representations |
| Early stopping | training past validation improvement | simpler checkpoint |
| Data augmentation | memorizing narrow examples | broader invariances |
Regularization is not a purity ritual. It is a response to a failure mode. If the model underfits, stronger regularization can make it worse. If the features are leaky, regularization may hide the leak instead of fixing it. If the label is noisy, regularization can help, but the better answer may be label repair.
The review question is: which complexity are we trying to reduce, and what evidence says it is the right one?
Ensembles Trade Simplicity for Stability#
Ensemble learning combines multiple models.
The scikit-learn ensemble guide covers the major families:
| Ensemble type | Mechanism | Example |
|---|---|---|
| Bagging | train models independently on bootstrapped samples | Random Forest |
| Boosting | train models sequentially to correct prior errors | Gradient Boosting, AdaBoost |
| Voting | combine predictions from different estimators | majority vote or averaged probability |
| Stacking | train a meta-model on base model predictions | blended classifier/regressor |
Ensembles often improve predictive performance and stability. They can also make inference heavier, debugging harder, and explanation less direct.
The tradeoff is acceptable when the operational gain is real. A fraud model that catches materially more high-risk cases may justify a boosted tree ensemble. A small internal prioritization tool may be better served by a regularized logistic regression that everyone can inspect.
Ask what the ensemble buys:
- higher accuracy,
- lower variance across splits,
- better calibration,
- robustness to feature noise,
- easier handling of non-linear interactions.
If the answer is "it is what won the notebook experiment," keep digging.
Embeddings Are Learned Coordinate Systems#
Embeddings map discrete or high-dimensional objects into dense vectors.
Words, documents, products, users, categories, images, and code snippets can all be represented as embeddings. The promise is that useful relationships become geometric: similar things land near each other, directions encode patterns, and downstream models can consume dense vectors instead of sparse IDs.
Embeddings are central in NLP and recommendation systems. A product recommender might learn user and item embeddings. A semantic search system might embed queries and documents into the same vector space. A tabular model might embed high-cardinality categories instead of exploding them into thousands of one-hot columns.
The review questions are practical:
- What data trained the embedding model?
- Is similarity in this space the similarity the product needs?
- Are embeddings frozen or fine-tuned?
- How are new users, products, categories, or terms handled?
- How often are vectors refreshed?
- Which downstream metric proves the representation helped?
An embedding is not self-explanatory because it is dense. It needs lineage.
The Peer Review Checklist#
When a team says a model is ready, I want this vocabulary to collapse into a short checklist.
| Area | Question |
|---|---|
| Decision | What action does the model change? |
| Label | Does the target measure that action honestly? |
| Features | Are all features available at prediction time? |
| Split | Does the split match the production timeline or grouping? |
| Preprocessing | Is every learned transformation fit only on training data? |
| Baseline | Did a simple model set the floor? |
| Task type | Is the model solving regression, classification, ranking, or something else? |
| Loss | Does the training objective punish the right mistakes? |
| Metrics | Do metrics reflect the release decision and class balance? |
| Tuning | Was the validation set protected from endless iteration? |
| Overfitting | Do training and validation curves show healthy generalization? |
| Regularization | Is complexity constrained for a named reason? |
| Threshold | Is the operating threshold tied to cost or capacity? |
| Artifact | Are preprocessing, model weights, metadata, and evaluation evidence versioned together? |
| Monitoring | Which feature, prediction, and outcome signals will catch drift? |
That checklist is the real glossary.
What to Do Differently#
Do not let ML terminology stay at definition level.
In a model review, every term should attach to a decision. Features attach to availability and leakage. Splits attach to generalization. Regression and classification attach to the action. Loss attaches to training pressure. Metrics attach to release judgment. Regularization attaches to overfitting risk. Embeddings attach to representation lineage.
Once the words are tied to decisions, the conversation gets much cleaner.
The model is no longer "good" because the notebook score improved. It is good only if the feature contract is stable, the split is honest, the objective is aligned, the validation process survived scrutiny, the test result is meaningful, and the production workflow can notice when the assumptions stop holding.