WHAT CALIBRATION ACTUALLY MEASURES
In our pre-tournament testing, confidence scores cluster at round numbers (60%, 65%, 70%, 75%, 80%) regardless of matchup. This suggests models are treating confidence as a stylistic output rather than a genuine probability estimate.
Our calibration chart plots stated confidence (x-axis) against actual win rate (y-axis). A perfectly calibrated model traces the diagonal. Most AI models show overconfidence - stating 80% on predictions they only get right 65% of the time.
WHERE OVERCONFIDENCE SHOWS UP
Overconfidence is especially pronounced in first-round games where a model picks a 1-seed over a 16-seed. The outcome is almost certain, but the model states 90–95% confidence when 80% would be more accurate to account for the small upset probability.
More interesting is high-seed matchups (5 vs. 12, 6 vs. 11) where models frequently state 70–75% confidence on picks that historically resolve as coin flips.
HOW TO USE THE CALIBRATION DATA
Don't use raw AI confidence scores to weight bracket bets. Watch the calibration curves as the tournament progresses, and favor the model whose confidence-accuracy curve stays closest to the diagonal in the early rounds. That model's stated confidence is most trustworthy for later rounds.
See this tracked in real-time as the tournament plays out.