Outcome focus: Defined a practical feature review loop that prevents teams from dropping useful nonlinear signals or keeping redundant features just because a correlation heatmap looked convincing.
machine learningfeature selectioncorrelationscikit-learnmodel evaluation
Correlation is a good first look and a bad final answer.
That is the feature-selection mistake I keep seeing in tabular machine learning work. A team opens the notebook, runs a correlation matrix, drops highly correlated inputs, keeps the features most correlated with the target, and calls the feature review done.
The move is understandable. Correlation is fast. It is easy to plot. It gives a clean heatmap. It catches obvious duplicates. It helps with linear models. It gives the team a way to talk about redundancy before the modeling work turns into a pile of guesses.
But the heatmap answers a narrow question.
Pearson correlation asks how strongly two numeric variables move together in a linear way. It does not tell you whether a feature has a U-shaped relationship with the target. It does not tell you whether two weak features become strong together. It does not tell you whether a categorical feature matters. It does not prove causality. It does not prove a feature is safe to use in production.
I have seen the bad version of this enough times that I now treat correlation as a screen, not a strategy.
The first version of a feature review I built around correlation was tidy and wrong. The target relationship for one variable was nonlinear: risk increased at both low and high values, but looked flat in the middle. Pearson correlation was close to zero. A tree model used the feature immediately. The heatmap did not lie. It answered the wrong question.
That is the lesson: a feature screen should be layered. Start with correlation. Then test variance, missingness, target association, redundancy, model reliance, leakage, and domain meaning.
The Feature Review Loop#
This is the review loop I trust more than a single heatmap.
The loop has one rule: no method gets to be the only judge.
| Check | What it catches | What it misses |
|---|---|---|
| Data quality screen | constant values, missingness, impossible values, stale features | predictive structure |
| Pearson correlation | linear numeric relationships and obvious redundancy | nonlinear, categorical, interaction effects |
| Spearman correlation | monotonic rank relationships | non-monotonic structure |
| Mutual information | general dependency between feature and target | feature interactions and causal meaning |
| Chi-squared test | categorical/count feature association with class labels | continuous features unless binned |
| L1 models | sparse linear signal | nonlinear signal and correlated-feature instability |
| Tree importance | nonlinear splits and interactions | bias toward high-cardinality or correlated features |
| Permutation importance | what a fitted model relies on for a metric | intrinsic value independent of the model |
| Domain review | leakage, availability, policy, business meaning | hidden statistical structure |
That table is the whole argument.
What Correlation Is Good For#
Correlation is still worth running.
The SciPy pearsonr docs define Pearson correlation as a measure of the linear relationship between two datasets. The statistic ranges from -1 to 1, with the endpoints indicating exact linear relationships.
That makes it useful for four things.
First, correlation is a quick redundancy check. If total_spend_30d and total_spend_31d have Pearson correlation of 0.999, they probably carry nearly the same signal. Keeping both may be harmless for a random forest, but it can make linear coefficients unstable and explanations noisy.
Second, correlation is useful when the model family cares about linear structure. Linear regression, logistic regression, linear SVMs, and generalized linear models all benefit from understanding linear relationships and multicollinearity.
Third, correlation is a fast target screen for numeric features. It can reveal obvious signal, especially in early exploratory work.
Fourth, a correlation heatmap is an excellent conversation artifact. It lets the team see feature families, duplicates, derived variables, and possible leakage candidates.
import pandas as pd
numeric_cols = X_train.select_dtypes(include="number").columns
feature_corr = X_train[numeric_cols].corr(method="pearson")
target_corr = (
X_train[numeric_cols]
.assign(target=y_train)
.corr(method="pearson")["target"]
.drop("target")
.sort_values(key=lambda s: s.abs(), ascending=False)
)
redundant_pairs = (
feature_corr.abs()
.where(~pd.DataFrame(
[[i <= j for j in range(len(feature_corr.columns))]
for i in range(len(feature_corr.index))],
index=feature_corr.index,
columns=feature_corr.columns,
))
.stack()
.loc[lambda s: s >= 0.95]
.sort_values(ascending=False)
)
print(target_corr.head(10))
print(redundant_pairs.head(20))This is a screen, not a ruling. It says "look here."
It should not say "drop everything below this line."
Pearson, Spearman, and the Shape of Signal#
Pearson and Spearman answer different questions.
Pearson is about linear relationship. Spearman is about monotonic rank relationship. The SciPy spearmanr docs describe Spearman correlation as a nonparametric measure of monotonicity.
A monotonic relationship does not have to be linear. If risk rises as usage rises, but not at a constant slope, Spearman may show a stronger relationship than Pearson.
screen = pd.DataFrame(
{
"pearson": X_train[numeric_cols].corrwith(y_train, method="pearson"),
"spearman": X_train[numeric_cols].corrwith(y_train, method="spearman"),
}
)
screen["abs_gap"] = (screen["spearman"].abs() - screen["pearson"].abs()).abs()
screen = screen.sort_values("abs_gap", ascending=False)
print(screen.head(15))Features with a large Pearson/Spearman gap deserve a plot before they deserve a decision.
import matplotlib.pyplot as plt
feature = "days_since_last_login"
fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(X_train[feature], y_train, alpha=0.15)
ax.set_xlabel(feature)
ax.set_ylabel("target")
ax.set_title("Feature-target shape check")
plt.show()For binary targets, a raw scatter may not show much. In that case, bin the feature and inspect target rate by bin.
feature = "days_since_last_login"
bins = pd.qcut(X_train[feature], q=10, duplicates="drop")
target_rate = y_train.groupby(bins, observed=True).mean()
counts = y_train.groupby(bins, observed=True).size()
print(pd.DataFrame({"target_rate": target_rate, "n": counts}))This catches the pattern correlation misses: a feature whose risk is low in the middle and high at both extremes can have near-zero Pearson and Spearman correlation. The model may still need it.
The U-Shaped Failure#
Here is a toy example that makes the failure obvious.
import numpy as np
from sklearn.feature_selection import mutual_info_regression
rng = np.random.default_rng(42)
x = rng.uniform(-1, 1, size=2_000)
y = x**2 + rng.normal(0, 0.03, size=x.size)
pearson = np.corrcoef(x, y)[0, 1]
spearman = pd.Series(x).corr(pd.Series(y), method="spearman")
mi = mutual_info_regression(x.reshape(-1, 1), y, random_state=42)[0]
print(f"pearson={pearson:.3f}")
print(f"spearman={spearman:.3f}")
print(f"mutual_information={mi:.3f}")The expected shape of the output is:
pearson near 0
spearman near 0
mutual_information clearly above 0The feature is useful. It is just not useful in the way Pearson asks.
That is why "low correlation with the target" is not enough to discard a feature. It may mean no signal. It may mean nonlinear signal. It may mean signal only through interaction. It may mean the wrong metric for the feature type.
Mutual Information Is the Next Filter#
Mutual information is useful when the shape may be nonlinear.
The scikit-learn mutual_info_classif docs describe mutual information as a non-negative dependency measure: zero means independence, higher values mean higher dependency. The feature selection guide notes that mutual information methods can capture any kind of statistical dependency, while requiring more samples for accurate estimation.
That makes MI a good second screen.
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import OrdinalEncoder
X_mi = X_train.copy()
cat_cols = X_mi.select_dtypes(include=["object", "category", "bool"]).columns
num_cols = X_mi.columns.difference(cat_cols)
encoder = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
X_mi[cat_cols] = encoder.fit_transform(X_mi[cat_cols].astype(str))
discrete_mask = X_mi.columns.isin(cat_cols)
mi_scores = mutual_info_classif(
X_mi,
y_train,
discrete_features=discrete_mask,
random_state=42,
)
mi_rank = (
pd.Series(mi_scores, index=X_mi.columns, name="mutual_info")
.sort_values(ascending=False)
)
print(mi_rank.head(20))Two cautions matter.
First, MI is still univariate in this usage. It scores each feature against the target independently. A feature that matters only through interaction can still look weak.
Second, encoding matters. Treating a continuous variable as discrete or a discrete variable as continuous can give poor estimates. scikit-learn calls this out directly in the mutual_info_classif notes.
Use MI to widen the screen. Do not turn it into a new superstition.
Chi-Squared for Categorical and Count Features#
For classification with non-negative categorical or count-style features, the chi-squared test is another fast filter.
The scikit-learn chi2 docs describe it as computing chi-squared statistics between each non-negative feature and class. It is common for term counts, one-hot encoded categories, booleans, and frequency-like features.
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
categorical_cols = X_train.select_dtypes(include=["object", "category", "bool"]).columns
cat_selector = Pipeline(
steps=[
(
"onehot",
OneHotEncoder(handle_unknown="ignore"),
),
(
"select",
SelectKBest(score_func=chi2, k=50),
),
]
)
cat_selector.fit(X_train[categorical_cols], y_train)The caveat is in the input requirement: chi-squared expects non-negative features. If a continuous feature can be negative, do not feed it directly into chi2. Bin it or use a different score.
Redundancy Is Not Only Pairwise Correlation#
Highly correlated features can be redundant. But redundancy can also be model-specific.
For linear models, redundant features can destabilize coefficients. If two variables carry the same signal, the model may split weight between them differently across samples. The predictions may be fine while interpretation becomes fragile.
Variance Inflation Factor is one way to inspect multicollinearity in regression-style design matrices. The statsmodels variance_inflation_factor docs describe VIF as a measure for multicollinearity of the design matrix and note a common recommendation that values above 5 indicate high collinearity.
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
numeric = X_train[numeric_cols]
numeric_ready = SimpleImputer(strategy="median").fit_transform(numeric)
numeric_ready = StandardScaler().fit_transform(numeric_ready)
vif = pd.Series(
[
variance_inflation_factor(numeric_ready, i)
for i in range(numeric_ready.shape[1])
],
index=numeric_cols,
name="vif",
).sort_values(ascending=False)
print(vif.head(20))Do not apply VIF mechanically to every model family. It is most natural when you care about coefficient stability and linear inference. For a boosted tree or random forest, correlated features create different problems: importance dilution, redundant split candidates, and harder explanations.
For broad feature redundancy, I like clustering correlations rather than dropping features pair by pair.
import numpy as np
from scipy.cluster import hierarchy
from scipy.spatial.distance import squareform
from scipy.stats import spearmanr
corr = spearmanr(X_train[numeric_cols].fillna(X_train[numeric_cols].median())).correlation
corr = (corr + corr.T) / 2
np.fill_diagonal(corr, 1)
distance = 1 - np.abs(corr)
linkage = hierarchy.ward(squareform(distance))
cluster_ids = hierarchy.fcluster(linkage, t=0.25, criterion="distance")
clusters = (
pd.Series(cluster_ids, index=numeric_cols, name="cluster")
.reset_index(names="feature")
.sort_values(["cluster", "feature"])
)
print(clusters.head(30))This mirrors the approach in scikit-learn's multicollinearity example, where the docs cluster features using Spearman rank-order correlations and choose one feature from each cluster.
The reviewer still has to choose the survivor. Do not automatically keep the first alphabetical column. Prefer the feature with better availability, clearer meaning, lower leakage risk, lower missingness, better stability, and stronger validation behavior.
Low Variance and Missingness Are Feature Checks Too#
Some features do not need a model to reject them.
A column that is constant has no predictive value. A column that is 99.9 percent missing may still be useful in rare cases, but it needs a reason. A feature that changes only after the prediction moment is leakage. A feature that exists in training but not in serving is a production bug waiting for a release date.
scikit-learn's feature selection guide starts with VarianceThreshold, which removes features whose variance does not meet a threshold.
from sklearn.feature_selection import VarianceThreshold
missing_rate = X_train.isna().mean().sort_values(ascending=False)
constant_cols = X_train.columns[X_train.nunique(dropna=False) <= 1]
numeric_for_variance = X_train[numeric_cols].fillna(X_train[numeric_cols].median())
variance_selector = VarianceThreshold(threshold=0.0)
variance_selector.fit(numeric_for_variance)
low_variance_cols = numeric_for_variance.columns[
~variance_selector.get_support()
]
print("highest missingness")
print(missing_rate.head(20))
print("constant columns")
print(list(constant_cols))
print("zero-variance numeric columns")
print(list(low_variance_cols))The operational review questions are blunt:
- Is the feature available when the prediction is made?
- Is the feature stable across time?
- Is the feature missing for a business reason?
- Does missingness itself carry signal?
- Is the feature allowed under privacy, policy, or fairness constraints?
At this point, domain knowledge beats generic selection methods.
Embedded Methods: Let the Model Help, Carefully#
Embedded methods select features as part of training.
L1-regularized models are the classic linear example. LASSO and L1 logistic regression can drive some coefficients to zero. scikit-learn's feature selection docs explain that L1-penalized linear models produce sparse solutions and can be used with SelectFromModel.
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
preprocess = ColumnTransformer(
transformers=[
(
"num",
Pipeline(
steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
]
),
numeric_cols,
),
(
"cat",
OneHotEncoder(handle_unknown="ignore"),
categorical_cols,
),
]
)
l1_selector = SelectFromModel(
LogisticRegression(
penalty="l1",
solver="saga",
C=0.1,
max_iter=5_000,
class_weight="balanced",
)
)
pipeline = Pipeline(
steps=[
("preprocess", preprocess),
("select", l1_selector),
(
"model",
LogisticRegression(
max_iter=5_000,
class_weight="balanced",
),
),
]
)
pipeline.fit(X_train, y_train)The important detail is that the selector lives inside the Pipeline. Feature selection has to happen inside cross-validation or after the train split, not on the full dataset before evaluation.
Tree-based models provide another embedded signal. Random forests, extra trees, and gradient boosting can rank features by split-based importance. These methods can capture nonlinear structure and interactions better than simple correlation.
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
tree_model = Pipeline(
steps=[
("preprocess", preprocess),
(
"model",
RandomForestClassifier(
n_estimators=300,
min_samples_leaf=20,
class_weight="balanced",
random_state=42,
),
),
]
)
tree_model.fit(X_train, y_train)
print(tree_model.score(X_val, y_val))Some tree estimators expose feature_importances_, but not all model families do. And impurity-based importances have known caveats, especially around high-cardinality features and correlated predictors. Treat them as evidence, not truth.
Permutation Importance Answers a Better Question#
Permutation importance asks: if I shuffle this feature and break its relationship to the target, how much does the fitted model's score degrade?
The scikit-learn permutation importance guide describes it as a model inspection technique that is useful for non-linear or opaque estimators. It is model-agnostic because it can be applied to any fitted estimator.
import pandas as pd
from sklearn.inspection import permutation_importance
result = permutation_importance(
tree_model,
X_val,
y_val,
scoring="average_precision",
n_repeats=10,
random_state=42,
n_jobs=-1,
)
perm = (
pd.DataFrame(
{
"feature": X_val.columns,
"importance_mean": result.importances_mean,
"importance_std": result.importances_std,
}
)
.sort_values("importance_mean", ascending=False)
)
print(perm.head(20))The caveat is important: permutation importance tells you how important a feature is to a particular trained model on a particular dataset and metric. It does not tell you intrinsic feature value.
The scikit-learn docs also warn that low importance in a bad model does not mean the feature is useless. Evaluate model performance first, then interpret importance.
Correlated features can also make permutation importance look strange. If two features are redundant, shuffling one may not hurt much because the model can still recover signal from the other. scikit-learn's correlated-features example demonstrates exactly that: a random forest can score well while permutation importance makes individual correlated features look unimportant.
That is why I often inspect importance at the cluster level, not only the single-column level.
Wrapper Methods When the Feature Set Is Small Enough#
Wrapper methods train models across feature subsets. They are expensive, but useful when the feature set is not enormous and the decision cost justifies the compute.
Recursive Feature Elimination repeatedly trains a model, ranks features, removes weaker ones, and continues until a target number remains. RFECV adds cross-validation to choose the number of features.
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
rfecv = RFECV(
estimator=LogisticRegression(max_iter=5_000, class_weight="balanced"),
step=5,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring="average_precision",
n_jobs=-1,
)
rfecv_pipeline = Pipeline(
steps=[
("preprocess", preprocess),
("select", rfecv),
("model", LogisticRegression(max_iter=5_000, class_weight="balanced")),
]
)
rfecv_pipeline.fit(X_train, y_train)Use wrapper methods deliberately. They can overfit the validation procedure if the search space is large and the evaluation set is weak. They are not a shortcut around a held-out test set.
A Practical Feature Review Artifact#
The artifact I like is a feature review table, not a single score.
review = pd.DataFrame(index=X_train.columns)
review["dtype"] = X_train.dtypes.astype(str)
review["missing_rate"] = X_train.isna().mean()
review["unique_count"] = X_train.nunique(dropna=False)
review["pearson_target"] = pd.NA
review.loc[numeric_cols, "pearson_target"] = X_train[numeric_cols].corrwith(
y_train,
method="pearson",
)
review["spearman_target"] = pd.NA
review.loc[numeric_cols, "spearman_target"] = X_train[numeric_cols].corrwith(
y_train,
method="spearman",
)
review["mutual_info"] = pd.Series(mi_scores, index=X_mi.columns)
review["notes"] = ""
review.loc[review["missing_rate"] > 0.80, "notes"] += "high_missing; "
review.loc[review["unique_count"] <= 1, "notes"] += "constant; "
print(
review.sort_values(
["mutual_info", "missing_rate"],
ascending=[False, True],
).head(30)
)Then add human columns:
| Column | Why it matters |
|---|---|
available_at_prediction_time | prevents leakage |
source_system | helps ownership and monitoring |
business_meaning | prevents meaningless encoded leftovers |
policy_risk | flags sensitive, proxy, or restricted data |
stability_window | checks whether the signal survives time |
selected_for_model | records the final choice |
reason | prevents future archaeology |
This table becomes a model review artifact. It is much more useful than a screenshot of a heatmap.
How I Would Decide#
I would not use one ranking.
I would use a decision matrix:
| Situation | Preferred check |
|---|---|
| Numeric redundancy | Pearson/Spearman correlation, clustered redundancy review |
| Linear model coefficient stability | VIF, regularization, coefficient sensitivity |
| Numeric target with possible nonlinear shape | plots, binned response, mutual information, tree model |
| Categorical feature vs class target | chi-squared on encoded non-negative features, MI, target-rate review |
| Many sparse text/count features | chi-squared, MI, L1 model, cross-validated pipeline |
| Feature interactions likely | tree model, permutation importance, partial dependence/ICE, domain review |
| High-stakes model | feature contract, leakage audit, stability backtest, fairness/proxy review |
The selection decision should be made after validation performance, not before. If dropping a correlated feature makes the model easier to explain and does not hurt validation performance, drop it. If keeping a redundant feature improves robustness because one source is sometimes delayed, maybe keep both but define fallback behavior. If a feature scores highly but cannot exist at serving time, reject it.
That last point matters most.
A feature can be statistically useful and operationally invalid.
What Not to Do#
Do not compute feature selection on the full dataset before splitting. That leaks validation and test information into the model development process.
Do not drop every feature with low Pearson correlation. You will lose nonlinear and interaction signal.
Do not keep every highly correlated feature because "the model will figure it out." Some models will, some will not, and explanations can become unstable even when predictions survive.
Do not interpret feature importance as causality. A model can rely on a proxy for an operational process without proving that changing the feature changes the outcome.
Do not ignore categorical variables because the correlation matrix did not include them.
Do not treat one feature-selection method as an oracle. Feature selection is evidence aggregation.
The Close#
Correlation earns its place in the workflow. I still run it early.
But I do not let it make the final feature decision.
The better workflow is layered: data quality, Pearson and Spearman, redundancy clustering, mutual information, chi-squared tests where appropriate, embedded selection, permutation importance, wrapper methods when justified, and domain review before release.
The last step is not statistical. It is operational.
Before a feature ships, the team should be able to say:
- it is available at prediction time,
- it is legal and appropriate to use,
- it has a stable source,
- it improves validation behavior,
- it does not only duplicate a better feature,
- it has a monitored contract,
- and we know what decision it is supposed to improve.
That is the difference between a correlation heatmap and feature engineering judgment.