Prompt Sensitivity Testing - AI March Madness 2026
Prompt sensitivity testing measures whether rephrasing a prediction question changes the AI model's answer. For selected games, five prompt variants are used per model to test prediction stability: baseline, probabilistic, advancement framing, contrarian, and minimal.
A model that changes its predicted winner based on synonymous rephrasing has unstable priors. Its prediction depends more on how the question is asked than on genuine analytical reasoning. Consistency scores range from 0% (different pick every phrasing) to 100% (same pick across all five variants).
Results show significant variance in Gemini 2.5 when framing shifts from binary (winner/loser) to probabilistic (percentage chance). GPT-4o demonstrates more stability across phrasings. Perplexity shows the highest variance, likely because its web search behavior changes based on query phrasing.
- Consistency Score
- Percentage of prompt variants that produce the same predicted winner as the baseline prompt. Higher is more stable.
- Prompt Variant
- A rephrased version of the standard prediction question, testing whether the model's answer changes with different wording.
- Baseline Prompt
- The standard prediction question format: 'Who will win [Team A] vs [Team B] in the [round] of the 2026 NCAA Tournament?'
PROMPT SENSITIVITY
How predictions change with different prompt phrasings · stable models score higher