Feature Effects
You trained a model. predict(x) gives an answer. But the model is a function of many inputs, and you usually care about one of them at a time. What does this feature do to this prediction? The honest answer is surprisingly hard to pin down. Let's build it from first principles.
The question
A trained classifier is a function f: X → P(class). Your data X has many columns — petal length, petal width, whatever. What you actually want isn't the joint prediction f(x). You want the partial relationship: if I fix one feature and vary it across its range, what does the model do on average?
That average is what the phrase feature effect means. It isn't free — you have to construct it.
Why you can't just plot the data
The obvious thing is to scatter P(class=setosa) against petal_length using whatever pairs show up in the training set. Simple.
It lies. Your features are correlated. In iris, petal_length tracks closely with petal_width. A scatter of outputs vs petal_length is showing you the combined effect of both, plus every other confound in the data. It describes what the model happens to predict on the rows it has seen — it doesn't answer the counterfactual: if I nudged only petal_length, what would change?
To get at that, we have to construct rows the model has never seen.
Partial dependence: the recipe
Here's the move. Suppose you want to know what the model predicts when petal_length = 1.5, holding everything else "natural." You probably don't have a row with exactly that value — but you have a whole dataset full of "natural." So use it.
- Take your training set.
- Copy it. Overwrite the
petal_lengthcolumn with1.5on every row. - Ask the model for its probability on every one of those synthetic rows.
- Average the probabilities. That number is your partial dependence at
petal_length = 1.5.
Then repeat for 1.6, 1.7, ... across the range. You've built a curve.
v becomes one point on the final curve. Overwrite the column, score the whole dataset, average the probabilities. The real values of every other feature are what make the average meaningful.Conceptually you're marginalizing over the other features. Each row has its own natural value for the rest of the columns, so averaging lets the real distribution of the other features stand in for the population. "If the population kept all its other characteristics but was forced to have this one feature pinned at v, what would the model predict on average?" That's the partial dependence at v.
Written compactly:
where denotes every feature except , and the expectation is taken over the empirical distribution of your data. In practice: pick a grid of values (typically 20–30 across the feature's observed range), run the algorithm above for each point, plot.
One batch, not a loop
Naively, you'd call model.predict once per grid point, each call scoring N rows. That's gridN calls with gridN round-trips of overhead.
Better: stack everything into one call. Build a batch of gridN × N synthetic rows — N copies with the column pinned at grid[0], then N with the column pinned at grid[1], and so on. One call, fully vectorized. For gridN=20, N=120 that's 2,400 rows in one shot versus twenty round trips. This matters enormously when the model lives behind a worker or the network.
import numpy as np # X_train: (N, num_features). grid: (grid_n,). j: the feature index we're sweeping. N = X_train.shape[0] # Tile the whole training set grid_n times, then overwrite column j per block. batch = np.tile(X_train, (grid_n, 1)) # (grid_n * N, num_features) for g, v in enumerate(grid): batch[g * N : (g + 1) * N, j] = v probas = model.predict_proba(batch) # (grid_n * N, num_classes) pd = probas.reshape(grid_n, N, -1).mean(axis=1) # (grid_n, num_classes)
Importance: does this feature matter at all?
Partial dependence shows shape. It doesn't tell you whether the model actually uses this feature — a perfectly flat curve could mean "doesn't matter" or could be hiding interactions.
The cleaner question is: if I break this feature's relationship to the label, how much worse does the model do?
Permutation importance answers this. Score the model on the validation set — call the accuracy A. Shuffle one column in place (only that column — every other value in the row stays put), score again — A'. The drop A − A' is that feature's importance.
Shuffling is subtle but right. You could drop the column and retrain, but then you're measuring a different model. Shuffling keeps the model identical and only destroys the column's information content at inference time. It's the closest thing to an in-place ablation.
- Shuffle the column in a copy, not in place on the original. Don't drop rows, don't touch other columns.
- Seed the shuffle. It's a Monte Carlo estimate; without a seed your importances jitter between runs.
- Clip the drop at zero. Noise can make a useless feature look slightly beneficial once in a while — that's sampling, not signal.
Reading a feature effect plot
Once you've run partial dependence for every feature and sorted them by permutation importance, the plot per feature looks like this:
- A flat line at some value means the model's output for that class is insensitive to this feature in that range. It doesn't mean the model ignores the feature everywhere — just there.
- A steep rise or fall is a region where the model is sensitive. Small nudges move predictions.
- Lines crossing each other (or the 0.5 mark in binary) are decision boundaries — the point along this axis where the predicted label flips.
- Overlaying binned predicted-vs-actual on the same axis is a calibration check. If the model predicts 80% setosa in a bin where only 40% actually are, curve and dots disagree, and you've found a miscalibration zone.
Things to watch out for
- Correlated features break PD's honesty. Pinning
petal_length = 1.5on a row withpetal_width = 4.0constructs a combo that never occurs in nature. The model happily extrapolates, but the extrapolation may be meaningless. Accumulated Local Effects (ALE) is the principled fix — another article. - Categorical features need category sweeps, not grids. Iterate over the levels. Don't interpolate between "red" and "blue."
- High-cardinality or sparse features often have a meaningful region that's much narrower than their nominal range. Use quantiles for the grid (e.g.
[p5, p95]) instead of[min, max]. - Non-IID data breaks permutation importance. Shuffling assumes rows are exchangeable. Time-series, grouped, or nested data need block permutations.
- Importance is relative to the model's training, not the world. A weak model shows weak importances even on features that genuinely matter. These are insights about your model, not causal claims about reality.
End-to-end, sixty lines
Once the batching click happens, the whole algorithm is short enough to hold in your head:
import numpy as np from sklearn.metrics import accuracy_score def feature_effects(model, X_train, X_val, y_val, features, grid_n=20, seed=0): """Return (importance, effects) for a fitted sklearn-compatible classifier. importance[j] — permutation importance of feature j (drop in accuracy). effects[j] — (grid_n, num_classes) partial-dependence curve for feature j. """ rng = np.random.default_rng(seed) base_acc = accuracy_score(y_val, model.predict(X_val)) # --- permutation importance on the validation set --- importance = {} for j in features: shuffled = X_val.copy() rng.shuffle(shuffled[:, j]) # shuffle column j in place drop = base_acc - accuracy_score(y_val, model.predict(shuffled)) importance[j] = max(0.0, drop) # clip noise at zero # --- partial dependence on the training distribution --- effects = {} N = X_train.shape[0] for j in features: lo, hi = np.quantile(X_train[:, j], [0.05, 0.95]) grid = np.linspace(lo, hi, grid_n) batch = np.tile(X_train, (grid_n, 1)) for g, v in enumerate(grid): batch[g * N : (g + 1) * N, j] = v probas = model.predict_proba(batch) effects[j] = probas.reshape(grid_n, N, -1).mean(axis=1) return importance, effects
Port it to any language where your model exposes predict_proba on a batch. The whole trick is the batching and the averaging; everything else is bookkeeping.