Why Raw Stats Are Not Enough
A pitcher's ERA tells you what happened. A pitcher's expected ERA based on contact quality, adjusted for park and defense, tells you what should have happened. The gap between those two numbers predicts future performance better than either one alone. This is feature engineering in a nutshell: transforming raw statistics into derived variables that capture predictive signal while stripping away noise.
AI models do not understand baseball. They understand numbers in columns. The feature engineer's job is to encode baseball knowledge into those numbers so the model can learn relationships that reflect genuine on-field dynamics rather than statistical artifacts. A well-engineered feature set can make a simple model perform like a complex one. A poorly engineered feature set can make even the most sophisticated model produce mediocre results.
Rolling Window Features
Baseball performance fluctuates constantly. A hitter who slashes .350/.420/.600 over ten games might be a .260 hitter running hot, or a genuine .300 hitter finally healthy. Rolling window features, which compute statistics over sliding time windows (last 7, 15, 30, 60 games), help the model distinguish between these scenarios by capturing both recent form and longer-term baseline performance.
Multiple window lengths provide different informational signals. Short windows (7-10 games) are sensitive to hot and cold streaks but noisy due to small sample sizes. Medium windows (30-45 games) balance recency with reliability. Long windows (90+ games or season-to-date) are stable but slow to reflect genuine changes in player performance. Feeding all three windows to the model lets it learn which timeframe matters most for each type of prediction.
Weighted rolling averages go a step further by giving more importance to recent observations. An exponentially weighted moving average of a pitcher's strikeout rate, for example, responds quickly to a genuine velocity change while smoothing over random game-to-game variation. The decay parameter controls this tradeoff and is itself a tunable hyperparameter.
Interaction Features
Baseball is full of interactions: factors whose combined effect differs from their individual effects. A right-handed power hitter facing a left-handed sinkerball pitcher at Coors Field is a fundamentally different situation than the same hitter facing the same pitcher at Oracle Park. Interaction features explicitly encode these combinations, allowing the model to learn matchup-specific and context-specific effects.
Pitcher-batter platoon interactions are among the most well-established: left-handed batters historically perform better against right-handed pitchers, and vice versa. But the magnitude of this platoon advantage varies dramatically by player. Some batters show no meaningful platoon split. Others are nearly helpless against same-side pitching. Interaction features capture these individual-level variations rather than applying a one-size-fits-all platoon adjustment.
Environmental interactions matter too. Exit velocity translates to distance differently at different altitudes and temperatures. A model that adjusts batted ball data for game-day environmental conditions extracts signal that a raw exit velocity feature misses entirely. The interaction between batted ball quality and environmental conditions is where some of the most reliable predictive edges live.
Park and Context Adjustments
Every MLB ballpark has a unique run-scoring environment determined by dimensions, altitude, climate, and playing surface. A 3.50 ERA in Colorado is not the same as a 3.50 ERA in Miami. Park adjustment factors normalize performance across venues, allowing the model to compare players on a level playing field.
Park factors are not static. They vary by handedness (some parks favor left-handed hitters more than right-handed hitters), by batted ball type (fly ball parks versus ground ball parks), and even by time of day and month (marine layer in San Francisco suppresses offense during night games). Granular park factors capture these nuances, while crude single-number park factors miss them.
Beyond physical park effects, context features encode situational information: day versus night, series game number, travel distance from previous series, days of rest for the starting pitcher, and bullpen workload over the preceding week. Each of these factors introduces measurable variance in game outcomes, and encoding them as features gives the model the opportunity to learn their effects.
Derived Skill Metrics
Some of the most powerful features are not found in any box score. They are computed from pitch-level and batted ball data to isolate specific skills. A pitcher's "stuff" rating, derived from pitch velocity, movement, extension, and release consistency, predicts future strikeout rate better than past strikeout rate itself. A hitter's barrel rate and chase rate predict future power production better than traditional slugging percentage.
These derived metrics attempt to measure underlying ability rather than realized outcomes. Outcomes are influenced by defense, luck, sequencing, and context. Ability is more stable and more predictive. The feature engineer's goal is to construct variables that measure ability as cleanly as possible, stripping away the variance introduced by factors outside the player's control.
Feature Selection: Less Can Be More
It is tempting to throw every conceivable feature into the model and let it sort out what matters. This approach usually backfires. Too many features increase the risk of overfitting, where the model finds spurious correlations in the training data that do not generalize to new games. They also slow training, complicate interpretation, and can introduce multicollinearity that destabilizes coefficient estimates.
Effective feature selection involves identifying the subset of features that carries the most predictive signal with the least redundancy. Techniques include recursive feature elimination (iteratively removing the least important features), mutual information analysis (measuring how much each feature tells you about the outcome), and permutation importance testing (measuring how much model performance degrades when a feature's values are randomized).
The best feature sets are compact, interpretable, and robust. They include features that capture genuinely different dimensions of the prediction problem, without redundancy. Getting to this point requires iterative experimentation: building features, testing them, discarding the ones that add noise more than signal, and refining the survivors. It is less glamorous than building new model architectures, but it typically has a larger impact on prediction quality.