The Problem with Win-Loss Records
A model goes 57-43 over 100 picks. Is it good? The honest answer: you cannot tell from that number alone. A model that only picks heavy favorites and goes 57-43 may be performing below expectation. A model that consistently picks underdogs and goes 57-43 may be performing brilliantly. The win-loss record strips away the context that determines whether the model demonstrated genuine predictive skill or simply benefited from variance.
Win-loss records also have huge confidence intervals in small samples. Over 100 picks, a model with true 55% accuracy could easily post records ranging from 48-52 to 62-38 due to random variation. The standard error on a 100-trial binomial proportion is about 5 percentage points. This means a model needs hundreds or thousands of tracked predictions before its win-loss record reliably reflects its true skill level.
ROI: Return on Investment
ROI measures the total profit or loss relative to total amount risked. Unlike win-loss records, ROI accounts for the odds of each prediction. A model that consistently identifies small underdogs can have a modest win rate but strong ROI because its wins pay more than its losses cost. Conversely, a model that exclusively picks heavy favorites might have an impressive win rate but negative ROI because the few losses are more expensive than the many wins are profitable.
ROI is the metric that matters most for practical evaluation because it directly measures whether the model's predictions generate value. However, ROI is also subject to significant variance in small samples. A single large underdog win can swing ROI dramatically in either direction. Evaluating ROI reliably requires hundreds of tracked predictions across a range of pick types and confidence levels.
Brier Score: The Gold Standard
The Brier score is the most commonly used metric for evaluating probabilistic predictions. It measures the mean squared error between predicted probabilities and actual outcomes. If the model predicted 70% for an event that occurred, the Brier score contribution for that prediction is (0.70 - 1.0)^2 = 0.09. If the event did not occur, the contribution is (0.70 - 0.0)^2 = 0.49.
Brier scores range from 0 (perfect predictions) to 1 (worst possible predictions). A model that assigns 50% to everything achieves a Brier score of 0.25. Any score below 0.25 indicates the model has skill beyond random guessing. The lower the score, the better the predictions. In practice, MLB prediction models achieve Brier scores between 0.23 and 0.24 for game outcomes, reflecting the inherent difficulty of predicting individual baseball games.
The Brier score can be decomposed into three components: calibration (do predicted probabilities match observed frequencies?), resolution (how much do the predictions vary from the base rate?), and uncertainty (how predictable is the outcome inherently?). This decomposition is diagnostic: it tells you not just whether the model is good, but why it is good or bad. A model with good resolution but poor calibration makes bold predictions that are systematically biased, a fixable problem. A model with good calibration but poor resolution is honest but uninformative, a harder problem.
Log Loss: Punishing Confident Mistakes
Log loss is another proper scoring rule that penalizes confident wrong predictions more heavily than cautious wrong predictions. If a model says 95% and the event does not occur, the log loss penalty is enormous. If it says 55% and the event does not occur, the penalty is small. This asymmetry incentivizes models to be well-calibrated: confident when justified and cautious when uncertain.
Log loss is technically the negative log-likelihood of the observed outcomes given the predicted probabilities. It is the standard training objective for classification models, meaning models are literally optimized to minimize log loss during training. Evaluating on the same metric aligns training and evaluation objectives, which is theoretically desirable.
The practical difference between Brier score and log loss is the severity of the penalty for overconfidence. Log loss penalizes extreme predictions that turn out wrong much more harshly than the Brier score does. For MLB prediction, where the true probabilities rarely exceed 70-75% and individual game outcomes are inherently noisy, log loss punishes the kind of overconfident predictions that are most common among poorly constructed models.
Calibration Analysis
Beyond aggregate metrics, evaluating a model's calibration curve provides granular insight into where the model performs well and where it struggles. Some models are well-calibrated near 50% (close games) but poorly calibrated at the extremes (lopsided matchups). Others are well-calibrated overall but show systematic bias in specific contexts (home versus road games, day versus night games, divisional versus interleague matchups).
Conditional calibration analysis, examining calibration within subgroups, is particularly revealing. A model might have excellent aggregate Brier scores but be systematically overconfident on home favorites and underconfident on road underdogs. This pattern would be invisible in aggregate metrics but clearly actionable: the model's predictions need context-specific adjustments to be fully reliable.
Evaluation Methodology: Preventing Self-Deception
The most common evaluation mistake is testing the model on data it has already seen during training. This produces unrealistically good performance numbers that evaporate when the model encounters new data. Proper evaluation requires strict temporal separation: the model is trained on data from one period and evaluated on data from a subsequent, non-overlapping period.
Walk-forward analysis is the gold standard for time-series prediction evaluation. The model is trained on all data up to date T, makes predictions for date T+1, then retrains on all data up to T+1, makes predictions for T+2, and so on. This exactly replicates the real-world prediction scenario: at every point, the model only has access to data that was historically available.
Sample size requirements are frequently underestimated. Due to the high variance of individual game outcomes, even a genuinely skilled model needs several hundred tracked predictions before its performance metrics stabilize enough to draw confident conclusions. Evaluating a model on thirty or fifty predictions is essentially rolling dice, and any conclusions drawn from such small samples should be held with minimal confidence.
Finally, evaluation should always compare against a meaningful baseline. Beating a coin flip (50%) is trivial; even using home-field advantage gives you 53-54%. The relevant question is whether the model outperforms what a simple, feature-free model achieves. Common baselines include home team always wins, higher-ranked team always wins, and market-implied probabilities derived from consensus lines. Only a model that consistently outperforms these baselines over large samples has demonstrated genuine predictive skill.