Machine Learning Terms That Make Model Reviews Better

Outcome focus: Gave peers a review-ready vocabulary for inspecting ML systems by connecting core terms to design choices, failure modes, and release questions.

The most expensive machine learning confusion usually hides inside ordinary words.

"Feature" sounds like a column. Sometimes it is a column. Sometimes it is a derived aggregate, an embedding, a timestamped entity value, or a leaky proxy for the label.

"Accuracy" sounds like quality. Sometimes it is quality. Sometimes it is a class-imbalance trap.

"Validation" sounds like testing. It is not the final test.

"Regression" sounds like a model family. It can mean a task type, a linear method, or the confusingly named logistic regression, which is usually a classifier.

I have watched model reviews slow down because the team was using correct words at the wrong layer. The data scientist meant "the validation fold improved." The product owner heard "the production decision is safer." The engineer meant "the preprocessing pipeline is fit only on training data." Someone else heard "we normalized the whole dataset before the split." Those are different claims.

A glossary helps, but only if it behaves like an operating tool. The point is not to memorize terms. The point is to ask sharper review questions before a model becomes a decision system.

Google's Machine Learning Glossary is a solid reference for formal definitions. This post is narrower. It translates common ML terms into the questions I want peers to ask in a design review, experiment review, or release gate.

The Map#

Most of the terms fit into one loop.

ML terminology becomes more useful when each term is attached to a lifecycle boundary.

The map matters because terms change meaning depending on where they sit. A feature in exploratory analysis is not yet a production feature. A model score on the validation set is not yet a release result. A learning rate is not learned by the model in the same way a weight is. An embedding is not magic semantic dust; it is a vector representation with training assumptions and failure modes.

The Short Operating Glossary#

Use this table as the model-review version.

Term	Operational meaning	Review question
Feature	Input signal used by the model	Is it available, stable, legal, and timestamp-correct at prediction time?
Label or target	Outcome the model learns to predict	Does the label match the decision, or only a convenient measurement?
Feature engineering	Transformation from raw data to model-ready signal	Is the transformation fit only on training data and reproducible in serving?
Train set	Data used to learn parameters	Does it represent the production population without future leakage?
Validation set	Data used to tune choices	Has the team avoided overfitting the design to this set?
Test set	Final held-out evidence	Has it stayed untouched until the model and threshold were chosen?
Classification	Predicts a class or class probability	Are false positives and false negatives priced separately?
Regression	Predicts a continuous value	Is closeness of the number actually the decision, or should it be ranking/classification?
Parameter	Learned internal value	Which values are learned from data, and which are set by the team?
Hyperparameter	Configuration set before or around training	What search budget and objective justify the chosen value?
Loss	Training objective minimized by the optimizer	Does the loss align with the release metric?
Metric	Evaluation summary	Does the metric reflect the action the model will drive?
Regularization	Complexity penalty or training constraint	Which overfitting failure is it meant to reduce?
Embedding	Dense vector representation	What data trained the representation, and what similarity does it encode?

This table is deliberately practical. A term is not understood until the team can use it to reject a bad design.

Features Are Not Just Columns#

A feature is an input signal. In a table, it might look like a column: age, monthly_spend, contract_type, support_ticket_count. In text, it might be token IDs or a sentence embedding. In images, it may be pixel values or intermediate representations learned by a convolutional network. In recommendation systems, it may be a learned user or item vector.

The review question is availability.

If a churn model uses support_ticket_count_last_7d, the team has to prove that value exists at the moment of prediction. If the production workflow scores users every morning at 8 AM, the feature cannot depend on tickets that close at 5 PM unless the prediction is explicitly retrospective.

Feature types matter because they imply preprocessing:

Feature type	Common handling	Risk
Numerical	impute, scale, cap outliers, transform	Unit mismatch, outliers, leakage through global scaling
Categorical	one-hot encode, ordinal encode, hash, embed	Unknown categories, fake ordering, high cardinality
Text	tokenize, embed, summarize, classify	Prompt/data drift, truncation, domain mismatch
Image	resize, normalize, augment, encode	distribution shift from capture conditions
Time-based	windows, lags, freshness checks	future leakage and stale features

The scikit-learn preprocessing guide is useful because it treats preprocessing as part of the estimator graph, not a loose notebook habit. That is the right instinct. Feature engineering should be attached to the model artifact and release path.

Feature Engineering Is a Contract#

Feature engineering is selecting, transforming, creating, and validating features so the model can learn useful structure.

The common moves are familiar:

fill missing values with a learned imputation rule,
encode categories,
scale numerical features,
bucket continuous values,
create interaction terms,
extract text sentiment or embeddings,
aggregate behavior over time windows,
reduce dimensionality with methods such as PCA.

The mistake is doing these transformations before the split or outside the deployable artifact.

If you standardize a numerical column using the mean and standard deviation of the entire dataset before splitting, the training data has learned from the test distribution. The model may not see labels directly, but the evaluation is still contaminated. The same applies to imputation, feature selection, target encoding, PCA, and any transformation that learns from data.

Here is the minimal shape I expect in tabular reviews:

split-then-fit-preprocessing.py

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
 
numeric_features = ["monthly_spend", "support_ticket_count"]
categorical_features = ["contract_type", "region"]
 
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    stratify=y,
    random_state=42,
)
 
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val,
    y_train_val,
    test_size=0.25,
    stratify=y_train_val,
    random_state=42,
)
 
preprocess = ColumnTransformer(
    transformers=[
        (
            "num",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="median")),
                    ("scaler", StandardScaler()),
                ]
            ),
            numeric_features,
        ),
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore"),
            categorical_features,
        ),
    ]
)
 
model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("classifier", LogisticRegression(max_iter=1000, class_weight="balanced")),
    ]
)
 
model.fit(X_train, y_train)
val_scores = model.predict_proba(X_val)[:, 1]
print(average_precision_score(y_val, val_scores))
 
# Use X_test only after the model, threshold, and preprocessing choices are fixed.

The important detail is not the specific model. It is that the transformation graph is fit on X_train, compared on X_val, and saved as part of the production artifact.

Train, Validation, and Test Are Different Promises#

The train/validation/test split is not administrative. It protects the team from believing its own experiments too early.

The scikit-learn cross-validation docs make the core warning plain: testing a model on the same data used to learn it is a methodological mistake because a model can memorize seen samples and fail on unseen data.

The three-way split exists because model development has multiple decisions:

Split	Used for	Do not use it for
Training	learning weights, tree splits, embeddings, imputation values	claiming final generalization
Validation	tuning hyperparameters, thresholds, feature sets, architecture	repeated indefinite fishing without accounting for it
Test	final unbiased estimate after decisions are fixed	model selection, threshold shopping, narrative rescue

Cross-validation helps when data is scarce, but it does not remove the need for final evidence. It spreads validation across folds so model selection is less dependent on one lucky split. The final test set should still stay out of the tuning loop.

For time-dependent data, random splits are often wrong. A churn model, demand forecast, fraud model, or pricing model usually needs a time-aware split. If tomorrow's pattern leaks into yesterday's training row, the evaluation is theater.

Classification and Regression Are Decision Shapes#

Classification predicts a category or class probability. Regression predicts a continuous value.

The easy examples are simple:

Task	Shape	Example output
Spam detection	binary classification	`spam` or `not_spam`
Ticket routing	multiclass classification	`billing`, `technical`, `account`
House price estimate	regression	`$485000`
Delivery time estimate	regression	`43 minutes`

The hard cases are decision-shaped.

An NPS score is numeric, but the business may only need to identify likely detractors for outreach. A fraud score is probabilistic, but the business may operate it as a ranked review queue. A demand forecast is regression, but the operational question may be whether inventory crosses a reorder threshold.

Task type should follow the action. If the action is "estimate the value," use regression language. If the action is "choose a class," use classification language. If the action is "rank a work queue," report ranking and threshold behavior instead of pretending the hard class label is the whole product.

Linear and Logistic Regression Are Not Twins#

Linear regression predicts a continuous value as a weighted sum of features plus a bias:

prediction = w1 * x1 + w2 * x2 + ... + wn * xn + b

The weights are learned coefficients. The bias is the intercept. Training usually minimizes an error such as mean squared error, where large misses are penalized heavily.

Logistic regression uses a linear score too, but it maps that score through the logistic sigmoid function to estimate a probability for classification:

z = w1 * x1 + w2 * x2 + ... + wn * xn + b
probability = 1 / (1 + exp(-z))

Then a threshold turns the probability into a class decision. The default 0.5 threshold is only a default. It is not a law.

The scikit-learn linear models docs cover both model families, which is useful but also part of the naming trap. In review, I ask:

Are we predicting a number or a class probability?
Is the threshold chosen from business cost, not convenience?
Are the coefficients interpretable after preprocessing?
Is regularization part of the model?

Logistic regression is often a strong baseline because it is simple, fast, and inspectable. A neural network should have to beat it on the metric that matters, not only on glamour.

Neural Networks Learn Representations#

A neural network is a function built from layers of learned parameters and activation functions.

The simple feed-forward picture is:

The input layer receives features.
Hidden layers transform those features through weighted sums and activations.
The output layer produces a score, probability, class distribution, or continuous value.
A loss function compares the output with the target.
Training updates the weights and biases to reduce that loss.

Deep neural networks are powerful because intermediate layers can learn representations that are hard to hand-engineer. That is why they work well for images, audio, language, and other high-dimensional patterns.

The tradeoff is review difficulty. A linear model's coefficients are easier to inspect. A deep model may perform better but require more careful evaluation, calibration checks, interpretability tooling, drift monitoring, and operational fallback.

Do not ask only whether a neural network is "more accurate." Ask what complexity it buys and how the team will notice when the learned representation stops matching production data.

Activations, Sigmoid, Softmax, and Logits#

An activation function adds non-linearity. Without non-linear activations, stacked linear layers collapse into another linear transformation.

Common activations have different jobs:

Function	Common use	Output shape
ReLU	hidden layers	`max(0, x)`
Sigmoid	binary output probability	value between 0 and 1
Tanh	hidden layers, some recurrent nets	value between -1 and 1
Softmax	multiclass output	probabilities that sum to 1

The raw model score before sigmoid or softmax is often called a logit. Logits are useful for training and calibration work, but most users should not see them. A logit of 2.0 is not "twice as confident" as a logit of 1.0 in plain product language.

Sigmoid is especially easy to overread. It maps a score into the interval (0, 1), but the output is only a trustworthy probability if the model is calibrated for the data and use case. A model can output 0.83 and still be poorly calibrated.

Loss, Gradient Descent, Backpropagation, and Optimizers#

These terms are related, but they are not interchangeable.

Term	Meaning	Plain review phrasing
Loss function	The objective minimized during training	What mistake is the model punished for?
Gradient	Direction and rate of loss change with respect to parameters	Which way should the parameters move?
Backpropagation	Efficient gradient computation through the network	How are gradients assigned across layers?
Gradient descent	Update rule that moves parameters against the gradient	How does the model step toward lower loss?
Optimizer	Concrete update algorithm	Which stepping strategy are we using?

PyTorch's autograd tutorial describes backpropagation as the common neural-network training algorithm where parameters are adjusted using gradients of the loss. PyTorch's optimizer docs then cover the implementations that perform those updates.

SGD, Adam, Adagrad, and RMSprop are optimizers. They differ in how they use gradients, momentum-like history, and adaptive learning rates. Adam is a common default because it works well across many problems, but it is not proof that the training setup is sound.

The review question is alignment. Does the loss function match the operational metric? If the product cares about top 50 recall for a review queue, a generic cross-entropy loss may still work, but the release gate needs ranking metrics too. If the business cares about large under-forecasts more than large over-forecasts, a symmetric regression loss may be the wrong objective.

Weights, Biases, Parameters, and Hyperparameters#

Weights and biases are learned parameters.

Hyperparameters are chosen outside the normal parameter-learning process. Examples include learning rate, batch size, regularization strength, number of trees, tree depth, number of layers, hidden dimension, dropout rate, and training epochs.

This difference matters for accountability.

If a weight is wrong, the model learned it from data under the given objective. If the learning rate is wrong, the team configured the training process poorly. If lambda is too high in regularization, the team forced the model to be too simple. If tree depth is too high, the team may have invited memorization.

The model learns parameters. The team owns hyperparameters.

Learning Rate and Convergence#

Learning rate controls how large each optimization step is.

Too high, and training can overshoot the minimum, oscillate, or diverge. Too low, and training can crawl, stall on plateaus, or waste compute without reaching a useful solution. Schedules and adaptive optimizers can help, but they do not remove the need to inspect training curves.

Convergence means training has approached a stable region where loss or model parameters stop changing meaningfully. It does not automatically mean the model is good. A model can converge to a bad solution. A model can converge on training data while validation performance gets worse.

I want to see at least:

training loss curve,
validation loss or metric curve,
learning-rate schedule,
early stopping rule,
final checkpoint selection rule.

If the only evidence is "the run finished," the team has not reviewed convergence. It has reviewed job completion.

Overfitting, Underfitting, Bias, and Variance#

Overfitting means the model learned the training data too specifically. It performs well on training data and poorly on unseen data.

Underfitting means the model is too weak, too constrained, or poorly specified to learn the real pattern. It performs poorly on both training and validation data.

The bias-variance tradeoff explains the pressure between simplicity and flexibility:

Failure	Training performance	Validation/test performance	Typical cause
High bias / underfit	poor	poor	model too simple, features weak, objective wrong
High variance / overfit	strong	weak	model too complex, data too small, leakage, too much tuning
Better fit	strong enough	strong enough	complexity matches signal and data

TensorFlow's overfit and underfit tutorial is a useful practical reference because it connects the concept to training curves and regularization techniques.

The failure I trust least is a model that looks excellent on training data and "acceptable" on validation after weeks of manual tuning. That validation set may have quietly become part of the training process through human iteration. At that point, the final test set or a fresh backtest has to carry more weight.

Regularization Is a Complexity Budget#

Regularization reduces overfitting by penalizing or constraining model complexity.

Common forms:

Technique	What it discourages	Common effect
L1 regularization	sum of absolute weights	sparse weights, implicit feature selection
L2 regularization	sum of squared weights	smaller distributed weights
Elastic Net	L1 and L2 together	sparsity plus stability
Dropout	reliance on specific neurons	more robust neural representations
Early stopping	training past validation improvement	simpler checkpoint
Data augmentation	memorizing narrow examples	broader invariances

Regularization is not a purity ritual. It is a response to a failure mode. If the model underfits, stronger regularization can make it worse. If the features are leaky, regularization may hide the leak instead of fixing it. If the label is noisy, regularization can help, but the better answer may be label repair.

The review question is: which complexity are we trying to reduce, and what evidence says it is the right one?

Ensembles Trade Simplicity for Stability#

Ensemble learning combines multiple models.

The scikit-learn ensemble guide covers the major families:

Ensemble type	Mechanism	Example
Bagging	train models independently on bootstrapped samples	Random Forest
Boosting	train models sequentially to correct prior errors	Gradient Boosting, AdaBoost
Voting	combine predictions from different estimators	majority vote or averaged probability
Stacking	train a meta-model on base model predictions	blended classifier/regressor

Ensembles often improve predictive performance and stability. They can also make inference heavier, debugging harder, and explanation less direct.

The tradeoff is acceptable when the operational gain is real. A fraud model that catches materially more high-risk cases may justify a boosted tree ensemble. A small internal prioritization tool may be better served by a regularized logistic regression that everyone can inspect.

Ask what the ensemble buys:

higher accuracy,
lower variance across splits,
better calibration,
robustness to feature noise,
easier handling of non-linear interactions.

If the answer is "it is what won the notebook experiment," keep digging.

Embeddings Are Learned Coordinate Systems#

Embeddings map discrete or high-dimensional objects into dense vectors.

Words, documents, products, users, categories, images, and code snippets can all be represented as embeddings. The promise is that useful relationships become geometric: similar things land near each other, directions encode patterns, and downstream models can consume dense vectors instead of sparse IDs.

Embeddings are central in NLP and recommendation systems. A product recommender might learn user and item embeddings. A semantic search system might embed queries and documents into the same vector space. A tabular model might embed high-cardinality categories instead of exploding them into thousands of one-hot columns.

The review questions are practical:

What data trained the embedding model?
Is similarity in this space the similarity the product needs?
Are embeddings frozen or fine-tuned?
How are new users, products, categories, or terms handled?
How often are vectors refreshed?
Which downstream metric proves the representation helped?

An embedding is not self-explanatory because it is dense. It needs lineage.

The Peer Review Checklist#

When a team says a model is ready, I want this vocabulary to collapse into a short checklist.

Area	Question
Decision	What action does the model change?
Label	Does the target measure that action honestly?
Features	Are all features available at prediction time?
Split	Does the split match the production timeline or grouping?
Preprocessing	Is every learned transformation fit only on training data?
Baseline	Did a simple model set the floor?
Task type	Is the model solving regression, classification, ranking, or something else?
Loss	Does the training objective punish the right mistakes?
Metrics	Do metrics reflect the release decision and class balance?
Tuning	Was the validation set protected from endless iteration?
Overfitting	Do training and validation curves show healthy generalization?
Regularization	Is complexity constrained for a named reason?
Threshold	Is the operating threshold tied to cost or capacity?
Artifact	Are preprocessing, model weights, metadata, and evaluation evidence versioned together?
Monitoring	Which feature, prediction, and outcome signals will catch drift?

That checklist is the real glossary.

What to Do Differently#

Do not let ML terminology stay at definition level.

In a model review, every term should attach to a decision. Features attach to availability and leakage. Splits attach to generalization. Regression and classification attach to the action. Loss attaches to training pressure. Metrics attach to release judgment. Regularization attaches to overfitting risk. Embeddings attach to representation lineage.

Once the words are tied to decisions, the conversation gets much cleaner.

The model is no longer "good" because the notebook score improved. It is good only if the feature contract is stable, the split is honest, the objective is aligned, the validation process survived scrutiny, the test result is meaningful, and the production workflow can notice when the assumptions stop holding.