Probability Is Not Certainty
When an AI model assigns a 65% win probability to a team, it is making a precise statement: "In situations like this one, across many similar games, this team wins approximately 65% of the time." It is explicitly not saying "this team will win." It is not even saying "this team will probably win." It is quantifying the degree of uncertainty inherent in the prediction.
This distinction sounds academic but has massive practical implications. A 65% probability means the other team wins 35% of the time. That is not a rare event. It happens more than one in three times. If you evaluate a model by whether its top prediction wins every game, you are misunderstanding what the model claims to do. The correct evaluation is whether events assigned 65% probability actually occur 65% of the time, consistently, across hundreds of predictions.
Calibration: The True Measure of Probability Quality
Calibration measures the alignment between stated probabilities and observed outcomes. A perfectly calibrated model has this property: among all predictions where it said "70% probability," the predicted outcome occurred exactly 70% of the time. Among all predictions where it said "55%," the outcome occurred 55% of the time. And so on across the entire probability spectrum.
Calibration is visualized using reliability diagrams. The x-axis shows the model's predicted probability (binned into groups like 50-55%, 55-60%, etc.), and the y-axis shows the actual observed frequency of the predicted outcome within each bin. A perfectly calibrated model produces points along the diagonal line. Points above the diagonal indicate underconfidence (the model says 60% but the outcome actually happens 70% of the time). Points below the diagonal indicate overconfidence (the model says 70% but the outcome only happens 60% of the time).
Most raw model outputs are not well-calibrated. They need post-hoc calibration: a transformation applied after the model makes its predictions that adjusts the output probabilities to better match observed frequencies. Platt scaling and isotonic regression are the two most common calibration methods. Platt scaling fits a logistic curve to the model's raw outputs, while isotonic regression fits a non-parametric monotone function. Both improve calibration, often dramatically.
Confidence vs. Accuracy: They Are Not the Same Thing
A model can be confident and wrong. A model can be uncertain and right. These are independent dimensions. Confidence describes how spread or peaked the model's probability distribution is. Accuracy describes whether the predicted outcome occurred.
A model that assigns every game a 50/50 probability is perfectly calibrated (assuming each team actually wins about half the time in its sample), but completely uninformative. It has no confidence and adds no value. A model that assigns every game a 99% probability to the home team is extremely confident but terribly calibrated, because home teams do not win 99% of the time.
The useful model is both calibrated and sharp: it assigns high probabilities to events that actually occur at high rates, and low probabilities to events that occur at low rates. Sharpness measures how far the model's predictions deviate from the base rate (roughly 50% for game outcomes). A sharp and calibrated model provides genuinely useful information. A calibrated but not sharp model is truthful but uninteresting. A sharp but uncalibrated model is boldly wrong.
Uncertainty Quantification: Knowing What You Don't Know
Beyond the point probability (e.g., 63%), sophisticated models also estimate their own uncertainty. This meta-prediction, a prediction about the prediction, answers the question: "How confident is the model in this specific probability estimate?"
A model might output 63% with tight uncertainty bounds (the true probability is very likely between 60-66%) or 63% with wide uncertainty bounds (the true probability could reasonably be anywhere from 52-74%). These convey very different amounts of information. The first suggests the model has strong, reliable signal. The second suggests the model is guessing within a wide range.
Uncertainty quantification is implemented through techniques like Monte Carlo dropout (running the prediction multiple times with random neuron deactivation and observing the spread), bootstrap aggregation (observing variance across ensemble members), and Bayesian neural networks (maintaining probability distributions over model parameters rather than point estimates). Each approach measures a slightly different type of uncertainty, and combining them provides the most complete picture.
Epistemic vs. Aleatoric Uncertainty
There are two fundamentally different sources of uncertainty in any prediction. Epistemic uncertainty comes from the model's limited knowledge. If a pitcher is making their MLB debut, the model has little data on this specific pitcher and its predictions carry high epistemic uncertainty. This type of uncertainty can, in principle, be reduced by gathering more data.
Aleatoric uncertainty comes from inherent randomness in the outcome. Even with perfect knowledge of both teams' abilities, the specific outcome of any single baseball game is genuinely random to a significant degree. A perfectly hit ball can be caught or can find a gap based on defensive positioning that varies from pitch to pitch. This type of uncertainty cannot be reduced by any amount of additional data or model sophistication.
Understanding which type of uncertainty dominates a given prediction is practically useful. High epistemic uncertainty (unfamiliar matchup, limited data) means the model's probability estimate is unreliable and should be treated with skepticism. High aleatoric uncertainty (two genuinely evenly matched teams) means the outcome is inherently unpredictable regardless of model quality. Both situations warrant caution, but for different reasons.
Practical Implications
The correct way to use AI probabilities is as inputs into a decision framework, not as standalone predictions. A 63% probability is useful when compared against an implied probability from other sources. If two independent analyses estimate 63% and 55% for the same event, the disagreement is informative. If they agree at 63%, the convergence strengthens confidence in that estimate.
Probabilities also enable portfolio thinking. Rather than evaluating each prediction in isolation, a calibrated probability model allows you to assess the expected outcome across a set of predictions. Some individual predictions will be wrong, by design, but if the probabilities are calibrated, the aggregate outcomes will match expectations. This shift from "was this prediction right?" to "are these probabilities reliable in aggregate?" is the conceptual leap that separates sophisticated analysis from naive outcome-counting.