HOW WE TEST IT
Our prompt sensitivity tests use five phrasings for each game, varying framing (winner/advancement/probability), detail level (seeds only vs. seeds + recent record), and format requested (name only vs. name + confidence).
WHAT WE FOUND
Results from pre-tournament testing show significant variance in Gemini when framing shifts from binary (winner/loser) to probabilistic (% chance). GPT-4o shows more stability across phrasings. Perplexity shows the highest variance overall - likely because its web search behavior changes based on how the query is phrased.
READING THE CONSISTENCY SCORE
The Prompt Sensitivity page tracks this in real-time during the tournament. A high consistency score means the model's pick was stable across all five prompt variants. Low consistency is a red flag for that specific prediction - it suggests the pick is more a function of phrasing than genuine analytical confidence.
See this tracked in real-time as the tournament plays out.