AI MARCH MADNESS 2026
AI March Madness 2026
MARCH MADNESS2026

ABOUT THIS PROJECT

WHAT IS AI MARCH MADNESS?

AI March Madness 2026 is a tournament tracker that measures how well different AI models predict NCAA basketball games - not just who wins, but how they predict, which sources they cite, how their confidence shifts, and what that reveals about their reliability.

We run the same queries across 3 AI models (GPT-4o, Gemini, and Perplexity) at three time points before each game: 24 hours out, 6 hours out, and 1 hour before tip-off.

We then track the results, score each model like a bracket pool, and analyze the source patterns, confidence language, and prediction stability that separate accurate models from overconfident ones.

METHODOLOGY

QUERY PROTOCOL

Each model receives the same structured prompt: "Who will win [Team A] vs [Team B] in the [round] of the 2026 NCAA Tournament, and why?" Prompts are sent at T-24h, T-6h, and T-1h before tip-off.

SOURCE TRACKING

We record every source cited by each model in its response. Sources are categorized by type (major media, analytics, social, team official) and tracked across rounds to identify citation trends and source fingerprints per model.

ACCURACY SCORING

Correct picks score 1 point. Bracket scoring weights later rounds more heavily (Sweet 16 = 2×, Elite 8 = 4×, Final Four = 8×, Championship = 16×). Flip tracking records any T-1h pick that differs from the T-24h pick.

CALIBRATION ANALYSIS

We categorize each model's confidence language into 5 tiers and track actual accuracy at each tier. A well-calibrated model's actual win rate should track closely with its stated confidence level.

BIAS REGISTER

Analyst citations are manually reviewed against known affiliation databases. We flag cases where an analyst has a documented connection (played, coached, family, hometown) to a team they picked. This data is human-verified.

PROMPT SENSITIVITY

For selected games, we run 5 prompt variations per model to measure how much phrasing changes the predicted outcome. Consistency score = % of variations that produce the same pick as the baseline prompt.

// DATA COLLECTION

All predictions are collected automatically via scheduled cron jobs that query each AI model through the OpenRouter API. Sources, citations, and confidence levels are extracted from each response and stored in a Supabase database. Data updates continuously throughout the tournament.

FREQUENTLY ASKED QUESTIONS

What is AI March Madness 2026?+
AI March Madness 2026 is a live experiment that tracks how three AI models - GPT-4o, Gemini 2.5, and Perplexity Sonar Pro - predict every game of the 2026 NCAA Men's Basketball Tournament. We measure accuracy, source transparency, confidence calibration, and prediction drift across all 67 tournament games.
Which AI models are tracked?+
We track three search-enabled AI models: OpenAI's GPT-4o (via gpt-4o-search-preview), Google's Gemini 2.5 Pro (with Google grounding), and Perplexity's Sonar Pro. All three have real-time web search capabilities, which lets us analyze both their predictions and the sources they cite.
How are predictions collected?+
Each model receives the same structured prompt at three time windows before every game: 24 hours out (T-24h), 6 hours out (T-6h), and 1 hour before tip-off (T-1h). This triple-snapshot approach lets us measure prediction drift and flip rates as new information becomes available.
What is prediction drift?+
Prediction drift measures how an AI model's pick changes between collection windows. A "flip" occurs when the model switches its predicted winner between T-24h and T-1h. High flip rates without corresponding news events suggest unstable reasoning priors.
What is confidence calibration?+
Calibration measures whether an AI model's stated confidence matches its actual accuracy. A well-calibrated model that says "80% confident" should be correct roughly 80% of the time. Most AI models show overconfidence - stating high confidence on predictions they get wrong more often than their confidence implies.
How is accuracy scored?+
Correct picks earn 1 point. Bracket scoring weights later rounds more heavily: Sweet 16 picks are worth 2 points, Elite 8 worth 4 points, Final Four worth 8 points, and the Championship pick is worth 16 points. This mirrors standard bracket pool scoring.
What is source intelligence?+
Source intelligence tracks every URL each AI model cites in its prediction responses. We categorize sources by type (major media, analytics, social media, team official sites) and rank domains by citation frequency. This reveals each model's evidence base and potential biases.
What is prompt sensitivity testing?+
Prompt sensitivity tests whether rephrasing a prediction question changes the AI's answer. We run five prompt variants per game per model. A model that changes its pick based on synonymous rephrasing has unstable priors - its prediction is more a function of phrasing than genuine analysis.

GLOSSARY

T-24h / T-6h / T-1h
Collection windows before each game. T-24h means 24 hours before tip-off, T-6h means 6 hours out, T-1h means 1 hour out.
Flip
When an AI model switches its predicted winner between collection windows. Tracked from T-24h to T-1h.
Flip Rate
The percentage of games where a model changes its pick between collection windows. Lower flip rates generally correlate with higher accuracy.
Calibration Score
A measure of how well a model's stated confidence matches actual outcomes. Perfect calibration = 0 (confidence equals accuracy at every level).
Confidence Bucket
Groupings of predictions by stated confidence level (55%, 65%, 75%, 85%, 95%) used to calculate calibration curves.
Bracket Score
Weighted accuracy score where later-round correct picks are worth more: Round of 64 = 1pt, Round of 32 = 1pt, Sweet 16 = 2pt, Elite 8 = 4pt, Final Four = 8pt, Championship = 16pt.
Source Fingerprint
The unique pattern of citation domains each AI model relies on. GPT-4o tends toward major sports media; Perplexity favors analytics domains.
Upset
A game where a lower-seeded team (higher seed number) defeats a higher-seeded team. Classic upsets include 12-over-5 and 11-over-6 matchups.
Seed Bias
The tendency of AI models to systematically favor higher-seeded (lower number) teams regardless of underlying team quality metrics.
Consistency Score
In prompt sensitivity testing, the percentage of prompt variants that produce the same pick as the baseline prompt. Higher = more stable reasoning.
Ensemble Pick
A consensus prediction derived from combining multiple AI models. When models with uncorrelated errors agree, the ensemble pick historically outperforms any single model.

Get the weekly AI accuracy report.

Model rankings, source shifts, confidence gaps, and upset analysis. Every Sunday during the tournament.

No spam. Unsubscribe any time. Data-only, no hot takes.

LIVE PICKS
Predictions will appear here once collection begins · Tournament starts March 19
Predictions will appear here once collection begins · Tournament starts March 19