MLB Prediction Model Methodology: How Our AI Actually Works

The Full Pipeline: Data to Decision

Building a profitable MLB betting model isn't just about training an algorithm. It's an end-to-end system that starts with raw data and ends with actionable predictions. Here's our complete pipeline:

Phase 1: Data Collection & Cleaning

We pull from multiple sources because no single API has everything we need:

MLB Statcast API: Pitch-by-pitch data including velocity, spin rate, release point, exit velocity, launch angle, and batted ball location
Baseball-Reference scraping: Historical game logs, player splits (home/away, vs LHP/RHP), and advanced metrics
FanGraphs: Sabermetric stats (wOBA, FIP, wRC+, xFIP) and plate discipline metrics
Weather Underground API: Historical and forecasted weather data for every ballpark
Odds aggregators: Historical closing lines from multiple sportsbooks to track sharp money
MLB Injury Reports: Official IL updates and estimated return timelines

Data cleaning is unglamorous but critical. We've built custom scripts to:

Handle missing values (forward-fill for time series, median imputation for cross-sectional)
Remove outliers using IQR method (3x interquartile range)
Standardize team names across different APIs (LAD vs. Los Angeles Dodgers vs. Dodgers)
Validate game outcomes against multiple sources (MLB.com, ESPN, Baseball-Reference)
Detect and correct data entry errors (impossible stats like 200 mph exit velocity)

Data Quality Nightmare: In 2023, we discovered a bug where Statcast was incorrectly tagging some sliders as curveballs. This poisoned three months of training data before we caught it. Lesson: never trust a single source, always validate.

Phase 2: Feature Engineering

This is where the real work happens. Raw stats are okay, but derived features are where predictive power comes from. We've engineered 180+ features across these categories:

Pitcher Performance Features (45 features)

Rolling velocity averages (5-game, 10-game, 20-game) by pitch type
Spin rate trends and standard deviation over last 30 days
Release point consistency (lower variance = better command)
Usage rates by pitch type (has mix changed recently?)
Whiff rate, chase rate, and contact quality metrics by pitch
Days rest and cumulative pitch count over last 14 days
Park-adjusted ERA and FIP with rolling windows
Platoon splits (performance vs RHB and LHB separately)

Hitter/Lineup Features (52 features)

Weighted lineup OPS giving more weight to top-of-order bats
Hard-hit rate and barrel rate for starting lineup
Chase rate and walk rate (discipline metrics)
Performance vs specific pitch types (e.g., lineup vs fastballs 95+ mph)
Recent form indicators (last 7 days, last 15 days)
Platoon advantage percentage (how many favorable matchups)
Speed metrics (sprint speed for SB and XBH potential)

Matchup & Situational Features (38 features)

Head-to-head history when sufficient sample size exists
Division game indicator (different dynamics than inter-division)
Travel distance and time zone changes
Rest days for each team
Series position (Game 1 vs Game 3 of series)
Recent win/loss streaks and momentum indicators
Playoff implications (magic number, elimination scenarios)

Bullpen & Relief Features (22 features)

Bullpen availability (who pitched recently, who's rested)
Bullpen ERA, FIP, and WHIP over various windows
High-leverage performance metrics
Expected workload (starter's average IP affects bullpen usage)
Closer save/blown save records

Environmental & Park Features (23 features)

Temperature, humidity, barometric pressure at game time
Wind speed and direction (converted to OF/CF/LF/RF components)
Park factors for HR, runs, hits (adjusted for weather)
Roof status for retractable roof stadiums
Field conditions (wet, dry, recent rain)

                Feature Engineering Win: We created a "Pitcher Fatigue Index" combining days rest, recent pitch count, and velocity trends. This single feature improved model accuracy by 2.3% because it catches fatigued pitchers before they blow up. The betting markets often miss this.
            

Model Architectures: The Four-Model Ensemble

We run four different models and combine their predictions. Here's the technical breakdown:

Model 1: Random Forest (Primary Workhorse)

Architecture:

500 decision trees, each trained on random subsets of data (bagging)
Max depth: 15 levels to prevent overfitting
Min samples per leaf: 10 (ensures statistical significance)
Feature subsampling: sqrt(n_features) at each split
Out-of-bag scoring for validation without separate holdout set

Why It Works:

Random Forests excel at capturing non-linear relationships. For example, temperature's effect on run scoring isn't linear—games at 55°F and 95°F both suppress scoring, but 75°F is optimal. A linear model can't capture this, but decision trees can split on multiple temperature ranges.

The ensemble nature (500 trees voting) reduces variance dramatically. Individual trees overfit, but averaging their predictions smooths out noise. It's like polling—one poll might be wrong, but averaging 500 polls gets closer to truth.

Performance Metrics (2025):

Classification Accuracy

Overall Accuracy: 58.7%

Precision: 59.2%

Recall: 61.4%

F1 Score: 0.603

Betting Performance

Win Rate (ML): 58.7%

ROI: 17.2%

Units Profit: +18.4U

Sample Size: 214 picks

Model 2: Neural Network (Pattern Finder)

Architecture:

Input layer: 180 features (fully connected)
Hidden layer 1: 128 neurons, ReLU activation, 30% dropout
Hidden layer 2: 64 neurons, ReLU activation, 25% dropout
Hidden layer 3: 32 neurons, ReLU activation, 20% dropout
Output layer: 3 neurons (home win / away win / over/under), softmax activation
Loss function: Categorical cross-entropy
Optimizer: Adam with learning rate 0.001
Training: 100 epochs with early stopping (patience=10)

Why It Works:

Neural networks discover complex, non-obvious interactions between features. For instance, the model learned that high wind speed at Coors Field (altitude) has a different effect than high wind at sea-level parks. It identified that certain pitcher-hitter matchups (fastball velocity + batter swing speed + launch angle tendency) predict outcomes better than analyzing each variable independently.

The dropout layers prevent overfitting by randomly "turning off" neurons during training, forcing the network to learn robust features that don't depend on any single pathway.

Performance Metrics (2025):

Classification Accuracy

Overall Accuracy: 56.9%

Precision: 57.3%

Training Loss: 0.623

Validation Loss: 0.641

Betting Performance

Win Rate (Totals): 63.2%

ROI: 22.1%

Units Profit: +9.7U

Sample Size: 87 picks

                Neural Network Specialty: This model crushes totals predictions. Its ability to model complex scoring dynamics (how pitcher velocity + weather + bullpen fatigue interact) makes it our go-to for over/under bets. When the neural net has high confidence on a total, we listen.
            

Model 3: Gradient Boosting (XGBoost)

Architecture:

Boosting rounds: 300 trees (with early stopping)
Learning rate: 0.05 (slower learning = better generalization)
Max depth: 6 (shallower than Random Forest to reduce overfitting)
Subsample: 0.8 (use 80% of data per tree)
Colsample_bytree: 0.8 (use 80% of features per tree)
Objective: binary:logistic for moneyline, reg:squarederror for totals

Why It Works:

XGBoost builds trees sequentially, where each new tree corrects mistakes from previous trees. It's like having 300 experts where each one focuses on fixing what the previous experts got wrong. This iterative error correction makes it deadly accurate for specific scenarios.

XGBoost particularly excels at run line predictions because it captures the nuance of "close game" vs "blowout" dynamics. It learned that certain pitcher-offense matchups tend toward one-run games, while others frequently produce blowouts.

Performance Metrics (2025):

Classification Accuracy

Overall Accuracy: 57.4%

AUC-ROC Score: 0.614

Log Loss: 0.652

Betting Performance

Win Rate (RL): 57.1%

ROI: 14.8%

Units Profit: +7.2U

Sample Size: 96 picks

Model 4: Logistic Regression (Baseline & Sanity Check)

Architecture:

Standard logistic regression with L2 regularization (ridge)
Regularization strength: C=1.0
Solver: liblinear (good for small-to-medium datasets)
Feature selection: Keep top 50 features by correlation

Why We Use It:

Logistic regression is intentionally simple. It can't capture complex interactions, but that's the point. If our fancy neural network predicts something wildly different from logistic regression, we investigate why. Often, the neural net is overfitting or picking up spurious correlations.

Plus, logistic regression gives interpretable coefficients. We can see exactly which features drive predictions: "A 1 mph increase in fastball velocity = 2.3% higher win probability." That helps us understand what the models are learning.

Performance Metrics (2025):

Classification Accuracy

Overall Accuracy: 55.2%

Coefficient Stability: High

Betting Performance

Win Rate: 55.2%

ROI: 9.4%

Units Profit: +4.1U

Sample Size: 87 picks

Ensemble Method: Combining Predictions

Here's where it gets interesting. We don't just pick the "best" model—we combine all four using weighted averaging. Here's our approach:

Dynamic Weighting Strategy

Model weights aren't static. They adjust based on recent performance and bet type:

Model	Moneyline Weight	Run Line Weight	Totals Weight
Random Forest	40%	30%	25%
Neural Network	25%	20%	45%
XGBoost	25%	40%	20%
Logistic Regression	10%	10%	10%

These weights were optimized using historical out-of-sample data from 2020-2024. We tested 10,000+ weight combinations to find the optimal balance.

Confidence Score Calculation

Our confidence score (0-100) considers:

Model Agreement (40% weight): When all four models agree, confidence spikes. Disagreement lowers it.
Prediction Strength (30% weight): How confident is each model? A 60% probability is weaker than 80%.
Historical Performance (20% weight): How has this exact bet type performed recently?
Data Quality (10% weight): Do we have complete data? Missing features hurt confidence.

confidence = (model_agreement * 0.40) +
             (prediction_strength * 0.30) +
             (recent_performance * 0.20) +
             (data_quality * 0.10)

if confidence >= 85:
    bet_size = 3.0  # units
elif confidence >= 70:
    bet_size = 1.0
elif confidence >= 55:
    bet_size = 0.5
else:
    bet_size = 0  # no bet
            

                Ensemble Advantage: In 2025, the ensemble outperformed any individual model by 3.7% in win rate and 6.2% in ROI. Diversification works in modeling just like it does in investing. Different models catch different edges.
            

Validation & Testing Protocols

Anyone can build a model that performs great on training data. The trick is building one that performs on new data. Here's how we validate:

1. Train-Test Split (Temporal)

We split data chronologically, never randomly. Training on 2010-2023, testing on 2024, then deploying for 2025. This simulates real-world usage where you're predicting the future, not randomly sampling the past.

Common Mistake: Many "backtests" use random splits, which leaks future information into training. A model might see Game 2 of a series in training and Game 1 in testing—that's cheating. Always split temporally for time-series data.

2. Walk-Forward Validation

We retrain models monthly using only data available at that point in time. For example, our June 2025 model was trained on games through May 2025. This ensures we're not using future data to predict the past.

3. Cross-Validation (K-Fold, Temporal)

We run 5-fold temporal cross-validation: split data into 5 sequential chunks, train on 4, test on 1. Repeat for all combinations. This gives confidence intervals around performance metrics.

4. Out-of-Sample Testing (2024 Holdout)

The entire 2024 season was held out during model development. We used it once for final validation before deploying for 2025. Results: 56.8% accuracy, confirming the models generalize.

5. Real-Money Tracking (2025 Season)

The ultimate test. We've tracked every pick since Opening Day 2025 with real betting lines (not closing lines, but lines available when we post picks). Current record: 198-144-3 (58.0%), +24.8 units profit.

What We're Working On Next

Model development never stops. Here's what's in our pipeline:

Planned Improvements (Winter 2025-2026)

Transformer Architecture: Experimenting with attention mechanisms for sequential data (pitch-by-pitch modeling)
Real-Time Updates: Models that adjust during games based on pitch velocity, score, and in-game events
Injury Impact Modeling: Quantifying lineup changes when key players are injured or rested
Umpire-Specific Models: Different umpires have different strike zones; modeling this could add 1-2% edge
Betting Market Integration: Using line movement as a feature (not just validation) to identify sharp money
First 5 Innings Specialization: Building dedicated F5 models since bullpens don't matter

Machine learning for sports betting is an arms race. As markets get sharper, we need better models. The moment we stop improving is the moment our edge disappears.

Ready to See the Models in Action?

Daily picks with full transparency—see which models agree, confidence scores, and reasoning

View Today's Picks Back to AI Overview

How Our Models Actually Work

The Full Pipeline: Data to Decision

Phase 1: Data Collection & Cleaning

Phase 2: Feature Engineering

Pitcher Performance Features (45 features)

Hitter/Lineup Features (52 features)

Matchup & Situational Features (38 features)

Bullpen & Relief Features (22 features)

Environmental & Park Features (23 features)

Model Architectures: The Four-Model Ensemble

Model 1: Random Forest (Primary Workhorse)

Architecture:

Why It Works:

Performance Metrics (2025):

Classification Accuracy

Betting Performance

Model 2: Neural Network (Pattern Finder)

Architecture:

Why It Works:

Performance Metrics (2025):

Classification Accuracy

Betting Performance

Model 3: Gradient Boosting (XGBoost)

Architecture:

Why It Works:

Performance Metrics (2025):

Classification Accuracy

Betting Performance

Model 4: Logistic Regression (Baseline & Sanity Check)

Architecture:

Why We Use It:

Performance Metrics (2025):

Classification Accuracy

Betting Performance

Ensemble Method: Combining Predictions

Dynamic Weighting Strategy

Confidence Score Calculation

Validation & Testing Protocols

1. Train-Test Split (Temporal)

2. Walk-Forward Validation

3. Cross-Validation (K-Fold, Temporal)

4. Out-of-Sample Testing (2024 Holdout)

5. Real-Money Tracking (2025 Season)

What We're Working On Next

Planned Improvements (Winter 2025-2026)

Ready to See the Models in Action?