The Full Pipeline: Data to Decision
Building a profitable MLB betting model isn't just about training an algorithm. It's an end-to-end system that starts with raw data and ends with actionable predictions. Here's our complete pipeline:
Phase 1: Data Collection & Cleaning
We pull from multiple sources because no single API has everything we need:
- MLB Statcast API: Pitch-by-pitch data including velocity, spin rate, release point, exit velocity, launch angle, and batted ball location
- Baseball-Reference scraping: Historical game logs, player splits (home/away, vs LHP/RHP), and advanced metrics
- FanGraphs: Sabermetric stats (wOBA, FIP, wRC+, xFIP) and plate discipline metrics
- Weather Underground API: Historical and forecasted weather data for every ballpark
- Odds aggregators: Historical closing lines from multiple sportsbooks to track sharp money
- MLB Injury Reports: Official IL updates and estimated return timelines
Data cleaning is unglamorous but critical. We've built custom scripts to:
- Handle missing values (forward-fill for time series, median imputation for cross-sectional)
- Remove outliers using IQR method (3x interquartile range)
- Standardize team names across different APIs (LAD vs. Los Angeles Dodgers vs. Dodgers)
- Validate game outcomes against multiple sources (MLB.com, ESPN, Baseball-Reference)
- Detect and correct data entry errors (impossible stats like 200 mph exit velocity)
Data Quality Nightmare: In 2023, we discovered a bug where Statcast was incorrectly tagging some sliders as curveballs. This poisoned three months of training data before we caught it. Lesson: never trust a single source, always validate.
Phase 2: Feature Engineering
This is where the real work happens. Raw stats are okay, but derived features are where predictive power comes from. We've engineered 180+ features across these categories:
Pitcher Performance Features (45 features)
- Rolling velocity averages (5-game, 10-game, 20-game) by pitch type
- Spin rate trends and standard deviation over last 30 days
- Release point consistency (lower variance = better command)
- Usage rates by pitch type (has mix changed recently?)
- Whiff rate, chase rate, and contact quality metrics by pitch
- Days rest and cumulative pitch count over last 14 days
- Park-adjusted ERA and FIP with rolling windows
- Platoon splits (performance vs RHB and LHB separately)
Hitter/Lineup Features (52 features)
- Weighted lineup OPS giving more weight to top-of-order bats
- Hard-hit rate and barrel rate for starting lineup
- Chase rate and walk rate (discipline metrics)
- Performance vs specific pitch types (e.g., lineup vs fastballs 95+ mph)
- Recent form indicators (last 7 days, last 15 days)
- Platoon advantage percentage (how many favorable matchups)
- Speed metrics (sprint speed for SB and XBH potential)
Matchup & Situational Features (38 features)
- Head-to-head history when sufficient sample size exists
- Division game indicator (different dynamics than inter-division)
- Travel distance and time zone changes
- Rest days for each team
- Series position (Game 1 vs Game 3 of series)
- Recent win/loss streaks and momentum indicators
- Playoff implications (magic number, elimination scenarios)
Bullpen & Relief Features (22 features)
- Bullpen availability (who pitched recently, who's rested)
- Bullpen ERA, FIP, and WHIP over various windows
- High-leverage performance metrics
- Expected workload (starter's average IP affects bullpen usage)
- Closer save/blown save records
Environmental & Park Features (23 features)
- Temperature, humidity, barometric pressure at game time
- Wind speed and direction (converted to OF/CF/LF/RF components)
- Park factors for HR, runs, hits (adjusted for weather)
- Roof status for retractable roof stadiums
- Field conditions (wet, dry, recent rain)
Feature Engineering Win: We created a "Pitcher Fatigue Index" combining days rest, recent pitch count, and velocity trends. This single feature improved model accuracy by 2.3% because it catches fatigued pitchers before they blow up. The betting markets often miss this.
Model Architectures: The Four-Model Ensemble
We run four different models and combine their predictions. Here's the technical breakdown:
Model 1: Random Forest (Primary Workhorse)
Architecture:
- 500 decision trees, each trained on random subsets of data (bagging)
- Max depth: 15 levels to prevent overfitting
- Min samples per leaf: 10 (ensures statistical significance)
- Feature subsampling: sqrt(n_features) at each split
- Out-of-bag scoring for validation without separate holdout set
Why It Works:
Random Forests excel at capturing non-linear relationships. For example, temperature's effect on run scoring isn't linear—games at 55°F and 95°F both suppress scoring, but 75°F is optimal. A linear model can't capture this, but decision trees can split on multiple temperature ranges.
The ensemble nature (500 trees voting) reduces variance dramatically. Individual trees overfit, but averaging their predictions smooths out noise. It's like polling—one poll might be wrong, but averaging 500 polls gets closer to truth.
Performance Metrics (2025):
Classification Accuracy
Overall Accuracy:
58.7%
Precision:
59.2%
Recall:
61.4%
F1 Score:
0.603
Betting Performance
Win Rate (ML):
58.7%
ROI:
17.2%
Units Profit:
+18.4U
Sample Size:
214 picks
Model 2: Neural Network (Pattern Finder)
Architecture:
- Input layer: 180 features (fully connected)
- Hidden layer 1: 128 neurons, ReLU activation, 30% dropout
- Hidden layer 2: 64 neurons, ReLU activation, 25% dropout
- Hidden layer 3: 32 neurons, ReLU activation, 20% dropout
- Output layer: 3 neurons (home win / away win / over/under), softmax activation
- Loss function: Categorical cross-entropy
- Optimizer: Adam with learning rate 0.001
- Training: 100 epochs with early stopping (patience=10)
Why It Works:
Neural networks discover complex, non-obvious interactions between features. For instance, the model learned that high wind speed at Coors Field (altitude) has a different effect than high wind at sea-level parks. It identified that certain pitcher-hitter matchups (fastball velocity + batter swing speed + launch angle tendency) predict outcomes better than analyzing each variable independently.
The dropout layers prevent overfitting by randomly "turning off" neurons during training, forcing the network to learn robust features that don't depend on any single pathway.
Performance Metrics (2025):
Classification Accuracy
Overall Accuracy:
56.9%
Precision:
57.3%
Training Loss:
0.623
Validation Loss:
0.641
Betting Performance
Win Rate (Totals):
63.2%
ROI:
22.1%
Units Profit:
+9.7U
Sample Size:
87 picks
Neural Network Specialty: This model crushes totals predictions. Its ability to model complex scoring dynamics (how pitcher velocity + weather + bullpen fatigue interact) makes it our go-to for over/under bets. When the neural net has high confidence on a total, we listen.
Model 3: Gradient Boosting (XGBoost)
Architecture:
- Boosting rounds: 300 trees (with early stopping)
- Learning rate: 0.05 (slower learning = better generalization)
- Max depth: 6 (shallower than Random Forest to reduce overfitting)
- Subsample: 0.8 (use 80% of data per tree)
- Colsample_bytree: 0.8 (use 80% of features per tree)
- Objective: binary:logistic for moneyline, reg:squarederror for totals
Why It Works:
XGBoost builds trees sequentially, where each new tree corrects mistakes from previous trees. It's like having 300 experts where each one focuses on fixing what the previous experts got wrong. This iterative error correction makes it deadly accurate for specific scenarios.
XGBoost particularly excels at run line predictions because it captures the nuance of "close game" vs "blowout" dynamics. It learned that certain pitcher-offense matchups tend toward one-run games, while others frequently produce blowouts.
Performance Metrics (2025):
Classification Accuracy
Overall Accuracy:
57.4%
AUC-ROC Score:
0.614
Log Loss:
0.652
Betting Performance
Win Rate (RL):
57.1%
ROI:
14.8%
Units Profit:
+7.2U
Sample Size:
96 picks
Model 4: Logistic Regression (Baseline & Sanity Check)
Architecture:
- Standard logistic regression with L2 regularization (ridge)
- Regularization strength: C=1.0
- Solver: liblinear (good for small-to-medium datasets)
- Feature selection: Keep top 50 features by correlation
Why We Use It:
Logistic regression is intentionally simple. It can't capture complex interactions, but that's the point. If our fancy neural network predicts something wildly different from logistic regression, we investigate why. Often, the neural net is overfitting or picking up spurious correlations.
Plus, logistic regression gives interpretable coefficients. We can see exactly which features drive predictions: "A 1 mph increase in fastball velocity = 2.3% higher win probability." That helps us understand what the models are learning.
Performance Metrics (2025):
Classification Accuracy
Overall Accuracy:
55.2%
Coefficient Stability:
High
Betting Performance
Win Rate:
55.2%
ROI:
9.4%
Units Profit:
+4.1U
Sample Size:
87 picks
Ensemble Method: Combining Predictions
Here's where it gets interesting. We don't just pick the "best" model—we combine all four using weighted averaging. Here's our approach:
Dynamic Weighting Strategy
Model weights aren't static. They adjust based on recent performance and bet type:
| Model |
Moneyline Weight |
Run Line Weight |
Totals Weight |
| Random Forest |
40% |
30% |
25% |
| Neural Network |
25% |
20% |
45% |
| XGBoost |
25% |
40% |
20% |
| Logistic Regression |
10% |
10% |
10% |
These weights were optimized using historical out-of-sample data from 2020-2024. We tested 10,000+ weight combinations to find the optimal balance.
Confidence Score Calculation
Our confidence score (0-100) considers:
- Model Agreement (40% weight): When all four models agree, confidence spikes. Disagreement lowers it.
- Prediction Strength (30% weight): How confident is each model? A 60% probability is weaker than 80%.
- Historical Performance (20% weight): How has this exact bet type performed recently?
- Data Quality (10% weight): Do we have complete data? Missing features hurt confidence.
confidence = (model_agreement * 0.40) +
(prediction_strength * 0.30) +
(recent_performance * 0.20) +
(data_quality * 0.10)
if confidence >= 85:
bet_size = 3.0 # units
elif confidence >= 70:
bet_size = 1.0
elif confidence >= 55:
bet_size = 0.5
else:
bet_size = 0 # no bet
Ensemble Advantage: In 2025, the ensemble outperformed any individual model by 3.7% in win rate and 6.2% in ROI. Diversification works in modeling just like it does in investing. Different models catch different edges.
Validation & Testing Protocols
Anyone can build a model that performs great on training data. The trick is building one that performs on new data. Here's how we validate:
1. Train-Test Split (Temporal)
We split data chronologically, never randomly. Training on 2010-2023, testing on 2024, then deploying for 2025. This simulates real-world usage where you're predicting the future, not randomly sampling the past.
Common Mistake: Many "backtests" use random splits, which leaks future information into training. A model might see Game 2 of a series in training and Game 1 in testing—that's cheating. Always split temporally for time-series data.
2. Walk-Forward Validation
We retrain models monthly using only data available at that point in time. For example, our June 2025 model was trained on games through May 2025. This ensures we're not using future data to predict the past.
3. Cross-Validation (K-Fold, Temporal)
We run 5-fold temporal cross-validation: split data into 5 sequential chunks, train on 4, test on 1. Repeat for all combinations. This gives confidence intervals around performance metrics.
4. Out-of-Sample Testing (2024 Holdout)
The entire 2024 season was held out during model development. We used it once for final validation before deploying for 2025. Results: 56.8% accuracy, confirming the models generalize.
5. Real-Money Tracking (2025 Season)
The ultimate test. We've tracked every pick since Opening Day 2025 with real betting lines (not closing lines, but lines available when we post picks). Current record: 198-144-3 (58.0%), +24.8 units profit.
What We're Working On Next
Model development never stops. Here's what's in our pipeline:
Planned Improvements (Winter 2025-2026)
- Transformer Architecture: Experimenting with attention mechanisms for sequential data (pitch-by-pitch modeling)
- Real-Time Updates: Models that adjust during games based on pitch velocity, score, and in-game events
- Injury Impact Modeling: Quantifying lineup changes when key players are injured or rested
- Umpire-Specific Models: Different umpires have different strike zones; modeling this could add 1-2% edge
- Betting Market Integration: Using line movement as a feature (not just validation) to identify sharp money
- First 5 Innings Specialization: Building dedicated F5 models since bullpens don't matter
Machine learning for sports betting is an arms race. As markets get sharper, we need better models. The moment we stop improving is the moment our edge disappears.
Ready to See the Models in Action?
Daily picks with full transparency—see which models agree, confidence scores, and reasoning
View Today's Picks
Back to AI Overview