Machine Learning Terms That Make Model Reviews Better

A practical ML terminology guide for model reviews where feature definitions, data splits, task type, optimization behavior, overfitting risk, regularization, ensembles, and embeddings need to be discussed precisely.

By Jovani Pink February 4, 2026 18 min — Platform & AI Engineering

Outcome focus: Gave peers a review-ready vocabulary for inspecting ML systems by connecting core terms to design choices, failure modes, and release questions.

The most expensive machine learning confusion usually hides inside ordinary words.

"Feature" sounds like a column. Sometimes it is a column. Sometimes it is a derived aggregate, an embedding, a timestamped entity value, or a leaky proxy for the label.

"Accuracy" sounds like quality. Sometimes it is quality. Sometimes it is a class-imbalance trap.

"Validation" sounds like testing. It is not the final test.

"Regression" sounds like a model family. It can mean a task type, a linear method, or the confusingly named logistic regression, which is usually a classifier.

I have watched model reviews slow down because the team was using correct words at the wrong layer. The data scientist meant "the validation fold improved." The product owner heard "the production decision is safer." The engineer meant "the preprocessing pipeline is fit only on training data." Someone else heard "we normalized the whole dataset before the split." Those are different claims.

A glossary helps, but only if it behaves like an operating tool. The point is not to memorize terms. The point is to ask sharper review questions before a model becomes a decision system.

Google's Machine Learning Glossary is a solid reference for formal definitions. This post is narrower. It translates common ML terms into the questions I want peers to ask in a design review, experiment review, or release gate.

The Map#

Most of the terms fit into one loop.

ML terminology becomes more useful when each term is attached to a lifecycle boundary.

The map matters because terms change meaning depending on where they sit. A feature in exploratory analysis is not yet a production feature. A model score on the validation set is not yet a release result. A learning rate is not learned by the model in the same way a weight is. An embedding is not magic semantic dust; it is a vector representation with training assumptions and failure modes.

The Short Operating Glossary#

Use this table as the model-review version.

TermOperational meaningReview question
FeatureInput signal used by the modelIs it available, stable, legal, and timestamp-correct at prediction time?
Label or targetOutcome the model learns to predictDoes the label match the decision, or only a convenient measurement?
Feature engineeringTransformation from raw data to model-ready signalIs the transformation fit only on training data and reproducible in serving?
Train setData used to learn parametersDoes it represent the production population without future leakage?
Validation setData used to tune choicesHas the team avoided overfitting the design to this set?
Test setFinal held-out evidenceHas it stayed untouched until the model and threshold were chosen?
ClassificationPredicts a class or class probabilityAre false positives and false negatives priced separately?
RegressionPredicts a continuous valueIs closeness of the number actually the decision, or should it be ranking/classification?
ParameterLearned internal valueWhich values are learned from data, and which are set by the team?
HyperparameterConfiguration set before or around trainingWhat search budget and objective justify the chosen value?
LossTraining objective minimized by the optimizerDoes the loss align with the release metric?
MetricEvaluation summaryDoes the metric reflect the action the model will drive?
RegularizationComplexity penalty or training constraintWhich overfitting failure is it meant to reduce?
EmbeddingDense vector representationWhat data trained the representation, and what similarity does it encode?

This table is deliberately practical. A term is not understood until the team can use it to reject a bad design.

Features Are Not Just Columns#

A feature is an input signal. In a table, it might look like a column: age, monthly_spend, contract_type, support_ticket_count. In text, it might be token IDs or a sentence embedding. In images, it may be pixel values or intermediate representations learned by a convolutional network. In recommendation systems, it may be a learned user or item vector.

The review question is availability.

If a churn model uses support_ticket_count_last_7d, the team has to prove that value exists at the moment of prediction. If the production workflow scores users every morning at 8 AM, the feature cannot depend on tickets that close at 5 PM unless the prediction is explicitly retrospective.

Feature types matter because they imply preprocessing:

Feature typeCommon handlingRisk
Numericalimpute, scale, cap outliers, transformUnit mismatch, outliers, leakage through global scaling
Categoricalone-hot encode, ordinal encode, hash, embedUnknown categories, fake ordering, high cardinality
Texttokenize, embed, summarize, classifyPrompt/data drift, truncation, domain mismatch
Imageresize, normalize, augment, encodedistribution shift from capture conditions
Time-basedwindows, lags, freshness checksfuture leakage and stale features

The scikit-learn preprocessing guide is useful because it treats preprocessing as part of the estimator graph, not a loose notebook habit. That is the right instinct. Feature engineering should be attached to the model artifact and release path.

Feature Engineering Is a Contract#

Feature engineering is selecting, transforming, creating, and validating features so the model can learn useful structure.

The common moves are familiar:

  • fill missing values with a learned imputation rule,
  • encode categories,
  • scale numerical features,
  • bucket continuous values,
  • create interaction terms,
  • extract text sentiment or embeddings,
  • aggregate behavior over time windows,
  • reduce dimensionality with methods such as PCA.

The mistake is doing these transformations before the split or outside the deployable artifact.

If you standardize a numerical column using the mean and standard deviation of the entire dataset before splitting, the training data has learned from the test distribution. The model may not see labels directly, but the evaluation is still contaminated. The same applies to imputation, feature selection, target encoding, PCA, and any transformation that learns from data.

Here is the minimal shape I expect in tabular reviews:

split-then-fit-preprocessing.py
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
 
numeric_features = ["monthly_spend", "support_ticket_count"]
categorical_features = ["contract_type", "region"]
 
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X,
    y,
    test_size=0.20,
    stratify=y,
    random_state=42,
)
 
X_train, X_val, y_train, y_val = train_test_split(
    X_train_val,
    y_train_val,
    test_size=0.25,
    stratify=y_train_val,
    random_state=42,
)
 
preprocess = ColumnTransformer(
    transformers=[
        (
            "num",
            Pipeline(
                steps=[
                    ("imputer", SimpleImputer(strategy="median")),
                    ("scaler", StandardScaler()),
                ]
            ),
            numeric_features,
        ),
        (
            "cat",
            OneHotEncoder(handle_unknown="ignore"),
            categorical_features,
        ),
    ]
)
 
model = Pipeline(
    steps=[
        ("preprocess", preprocess),
        ("classifier", LogisticRegression(max_iter=1000, class_weight="balanced")),
    ]
)
 
model.fit(X_train, y_train)
val_scores = model.predict_proba(X_val)[:, 1]
print(average_precision_score(y_val, val_scores))
 
# Use X_test only after the model, threshold, and preprocessing choices are fixed.

The important detail is not the specific model. It is that the transformation graph is fit on X_train, compared on X_val, and saved as part of the production artifact.

Train, Validation, and Test Are Different Promises#

The train/validation/test split is not administrative. It protects the team from believing its own experiments too early.

The scikit-learn cross-validation docs make the core warning plain: testing a model on the same data used to learn it is a methodological mistake because a model can memorize seen samples and fail on unseen data.

The three-way split exists because model development has multiple decisions:

SplitUsed forDo not use it for
Traininglearning weights, tree splits, embeddings, imputation valuesclaiming final generalization
Validationtuning hyperparameters, thresholds, feature sets, architecturerepeated indefinite fishing without accounting for it
Testfinal unbiased estimate after decisions are fixedmodel selection, threshold shopping, narrative rescue

Cross-validation helps when data is scarce, but it does not remove the need for final evidence. It spreads validation across folds so model selection is less dependent on one lucky split. The final test set should still stay out of the tuning loop.

For time-dependent data, random splits are often wrong. A churn model, demand forecast, fraud model, or pricing model usually needs a time-aware split. If tomorrow's pattern leaks into yesterday's training row, the evaluation is theater.

Classification and Regression Are Decision Shapes#

Classification predicts a category or class probability. Regression predicts a continuous value.

The easy examples are simple:

TaskShapeExample output
Spam detectionbinary classificationspam or not_spam
Ticket routingmulticlass classificationbilling, technical, account
House price estimateregression$485000
Delivery time estimateregression43 minutes

The hard cases are decision-shaped.

An NPS score is numeric, but the business may only need to identify likely detractors for outreach. A fraud score is probabilistic, but the business may operate it as a ranked review queue. A demand forecast is regression, but the operational question may be whether inventory crosses a reorder threshold.

Task type should follow the action. If the action is "estimate the value," use regression language. If the action is "choose a class," use classification language. If the action is "rank a work queue," report ranking and threshold behavior instead of pretending the hard class label is the whole product.

Linear and Logistic Regression Are Not Twins#

Linear regression predicts a continuous value as a weighted sum of features plus a bias:

prediction = w1 * x1 + w2 * x2 + ... + wn * xn + b

The weights are learned coefficients. The bias is the intercept. Training usually minimizes an error such as mean squared error, where large misses are penalized heavily.

Logistic regression uses a linear score too, but it maps that score through the logistic sigmoid function to estimate a probability for classification:

z = w1 * x1 + w2 * x2 + ... + wn * xn + b
probability = 1 / (1 + exp(-z))

Then a threshold turns the probability into a class decision. The default 0.5 threshold is only a default. It is not a law.

The scikit-learn linear models docs cover both model families, which is useful but also part of the naming trap. In review, I ask:

  • Are we predicting a number or a class probability?
  • Is the threshold chosen from business cost, not convenience?
  • Are the coefficients interpretable after preprocessing?
  • Is regularization part of the model?

Logistic regression is often a strong baseline because it is simple, fast, and inspectable. A neural network should have to beat it on the metric that matters, not only on glamour.

Neural Networks Learn Representations#

A neural network is a function built from layers of learned parameters and activation functions.

The simple feed-forward picture is:

  1. The input layer receives features.
  2. Hidden layers transform those features through weighted sums and activations.
  3. The output layer produces a score, probability, class distribution, or continuous value.
  4. A loss function compares the output with the target.
  5. Training updates the weights and biases to reduce that loss.

Deep neural networks are powerful because intermediate layers can learn representations that are hard to hand-engineer. That is why they work well for images, audio, language, and other high-dimensional patterns.

The tradeoff is review difficulty. A linear model's coefficients are easier to inspect. A deep model may perform better but require more careful evaluation, calibration checks, interpretability tooling, drift monitoring, and operational fallback.

Do not ask only whether a neural network is "more accurate." Ask what complexity it buys and how the team will notice when the learned representation stops matching production data.

Activations, Sigmoid, Softmax, and Logits#

An activation function adds non-linearity. Without non-linear activations, stacked linear layers collapse into another linear transformation.

Common activations have different jobs:

FunctionCommon useOutput shape
ReLUhidden layersmax(0, x)
Sigmoidbinary output probabilityvalue between 0 and 1
Tanhhidden layers, some recurrent netsvalue between -1 and 1
Softmaxmulticlass outputprobabilities that sum to 1

The raw model score before sigmoid or softmax is often called a logit. Logits are useful for training and calibration work, but most users should not see them. A logit of 2.0 is not "twice as confident" as a logit of 1.0 in plain product language.

Sigmoid is especially easy to overread. It maps a score into the interval (0, 1), but the output is only a trustworthy probability if the model is calibrated for the data and use case. A model can output 0.83 and still be poorly calibrated.

Loss, Gradient Descent, Backpropagation, and Optimizers#

These terms are related, but they are not interchangeable.

TermMeaningPlain review phrasing
Loss functionThe objective minimized during trainingWhat mistake is the model punished for?
GradientDirection and rate of loss change with respect to parametersWhich way should the parameters move?
BackpropagationEfficient gradient computation through the networkHow are gradients assigned across layers?
Gradient descentUpdate rule that moves parameters against the gradientHow does the model step toward lower loss?
OptimizerConcrete update algorithmWhich stepping strategy are we using?

PyTorch's autograd tutorial describes backpropagation as the common neural-network training algorithm where parameters are adjusted using gradients of the loss. PyTorch's optimizer docs then cover the implementations that perform those updates.

SGD, Adam, Adagrad, and RMSprop are optimizers. They differ in how they use gradients, momentum-like history, and adaptive learning rates. Adam is a common default because it works well across many problems, but it is not proof that the training setup is sound.

The review question is alignment. Does the loss function match the operational metric? If the product cares about top 50 recall for a review queue, a generic cross-entropy loss may still work, but the release gate needs ranking metrics too. If the business cares about large under-forecasts more than large over-forecasts, a symmetric regression loss may be the wrong objective.

Weights, Biases, Parameters, and Hyperparameters#

Weights and biases are learned parameters.

Hyperparameters are chosen outside the normal parameter-learning process. Examples include learning rate, batch size, regularization strength, number of trees, tree depth, number of layers, hidden dimension, dropout rate, and training epochs.

This difference matters for accountability.

If a weight is wrong, the model learned it from data under the given objective. If the learning rate is wrong, the team configured the training process poorly. If lambda is too high in regularization, the team forced the model to be too simple. If tree depth is too high, the team may have invited memorization.

The model learns parameters. The team owns hyperparameters.

Learning Rate and Convergence#

Learning rate controls how large each optimization step is.

Too high, and training can overshoot the minimum, oscillate, or diverge. Too low, and training can crawl, stall on plateaus, or waste compute without reaching a useful solution. Schedules and adaptive optimizers can help, but they do not remove the need to inspect training curves.

Convergence means training has approached a stable region where loss or model parameters stop changing meaningfully. It does not automatically mean the model is good. A model can converge to a bad solution. A model can converge on training data while validation performance gets worse.

I want to see at least:

  • training loss curve,
  • validation loss or metric curve,
  • learning-rate schedule,
  • early stopping rule,
  • final checkpoint selection rule.

If the only evidence is "the run finished," the team has not reviewed convergence. It has reviewed job completion.

Overfitting, Underfitting, Bias, and Variance#

Overfitting means the model learned the training data too specifically. It performs well on training data and poorly on unseen data.

Underfitting means the model is too weak, too constrained, or poorly specified to learn the real pattern. It performs poorly on both training and validation data.

The bias-variance tradeoff explains the pressure between simplicity and flexibility:

FailureTraining performanceValidation/test performanceTypical cause
High bias / underfitpoorpoormodel too simple, features weak, objective wrong
High variance / overfitstrongweakmodel too complex, data too small, leakage, too much tuning
Better fitstrong enoughstrong enoughcomplexity matches signal and data

TensorFlow's overfit and underfit tutorial is a useful practical reference because it connects the concept to training curves and regularization techniques.

The failure I trust least is a model that looks excellent on training data and "acceptable" on validation after weeks of manual tuning. That validation set may have quietly become part of the training process through human iteration. At that point, the final test set or a fresh backtest has to carry more weight.

Regularization Is a Complexity Budget#

Regularization reduces overfitting by penalizing or constraining model complexity.

Common forms:

TechniqueWhat it discouragesCommon effect
L1 regularizationsum of absolute weightssparse weights, implicit feature selection
L2 regularizationsum of squared weightssmaller distributed weights
Elastic NetL1 and L2 togethersparsity plus stability
Dropoutreliance on specific neuronsmore robust neural representations
Early stoppingtraining past validation improvementsimpler checkpoint
Data augmentationmemorizing narrow examplesbroader invariances

Regularization is not a purity ritual. It is a response to a failure mode. If the model underfits, stronger regularization can make it worse. If the features are leaky, regularization may hide the leak instead of fixing it. If the label is noisy, regularization can help, but the better answer may be label repair.

The review question is: which complexity are we trying to reduce, and what evidence says it is the right one?

Ensembles Trade Simplicity for Stability#

Ensemble learning combines multiple models.

The scikit-learn ensemble guide covers the major families:

Ensemble typeMechanismExample
Baggingtrain models independently on bootstrapped samplesRandom Forest
Boostingtrain models sequentially to correct prior errorsGradient Boosting, AdaBoost
Votingcombine predictions from different estimatorsmajority vote or averaged probability
Stackingtrain a meta-model on base model predictionsblended classifier/regressor

Ensembles often improve predictive performance and stability. They can also make inference heavier, debugging harder, and explanation less direct.

The tradeoff is acceptable when the operational gain is real. A fraud model that catches materially more high-risk cases may justify a boosted tree ensemble. A small internal prioritization tool may be better served by a regularized logistic regression that everyone can inspect.

Ask what the ensemble buys:

  • higher accuracy,
  • lower variance across splits,
  • better calibration,
  • robustness to feature noise,
  • easier handling of non-linear interactions.

If the answer is "it is what won the notebook experiment," keep digging.

Embeddings Are Learned Coordinate Systems#

Embeddings map discrete or high-dimensional objects into dense vectors.

Words, documents, products, users, categories, images, and code snippets can all be represented as embeddings. The promise is that useful relationships become geometric: similar things land near each other, directions encode patterns, and downstream models can consume dense vectors instead of sparse IDs.

Embeddings are central in NLP and recommendation systems. A product recommender might learn user and item embeddings. A semantic search system might embed queries and documents into the same vector space. A tabular model might embed high-cardinality categories instead of exploding them into thousands of one-hot columns.

The review questions are practical:

  • What data trained the embedding model?
  • Is similarity in this space the similarity the product needs?
  • Are embeddings frozen or fine-tuned?
  • How are new users, products, categories, or terms handled?
  • How often are vectors refreshed?
  • Which downstream metric proves the representation helped?

An embedding is not self-explanatory because it is dense. It needs lineage.

The Peer Review Checklist#

When a team says a model is ready, I want this vocabulary to collapse into a short checklist.

AreaQuestion
DecisionWhat action does the model change?
LabelDoes the target measure that action honestly?
FeaturesAre all features available at prediction time?
SplitDoes the split match the production timeline or grouping?
PreprocessingIs every learned transformation fit only on training data?
BaselineDid a simple model set the floor?
Task typeIs the model solving regression, classification, ranking, or something else?
LossDoes the training objective punish the right mistakes?
MetricsDo metrics reflect the release decision and class balance?
TuningWas the validation set protected from endless iteration?
OverfittingDo training and validation curves show healthy generalization?
RegularizationIs complexity constrained for a named reason?
ThresholdIs the operating threshold tied to cost or capacity?
ArtifactAre preprocessing, model weights, metadata, and evaluation evidence versioned together?
MonitoringWhich feature, prediction, and outcome signals will catch drift?

That checklist is the real glossary.

What to Do Differently#

Do not let ML terminology stay at definition level.

In a model review, every term should attach to a decision. Features attach to availability and leakage. Splits attach to generalization. Regression and classification attach to the action. Loss attaches to training pressure. Metrics attach to release judgment. Regularization attaches to overfitting risk. Embeddings attach to representation lineage.

Once the words are tied to decisions, the conversation gets much cleaner.

The model is no longer "good" because the notebook score improved. It is good only if the feature contract is stable, the split is honest, the objective is aligned, the validation process survived scrutiny, the test result is meaningful, and the production workflow can notice when the assumptions stop holding.

Back to all writing
On this page
  1. The Map
  2. The Short Operating Glossary
  3. Features Are Not Just Columns
  4. Feature Engineering Is a Contract
  5. Train, Validation, and Test Are Different Promises
  6. Classification and Regression Are Decision Shapes
  7. Linear and Logistic Regression Are Not Twins
  8. Neural Networks Learn Representations
  9. Activations, Sigmoid, Softmax, and Logits
  10. Loss, Gradient Descent, Backpropagation, and Optimizers
  11. Weights, Biases, Parameters, and Hyperparameters
  12. Learning Rate and Convergence
  13. Overfitting, Underfitting, Bias, and Variance
  14. Regularization Is a Complexity Budget
  15. Ensembles Trade Simplicity for Stability
  16. Embeddings Are Learned Coordinate Systems
  17. The Peer Review Checklist
  18. What to Do Differently