About Model Madness

Model Madness is an experiment by Alephic to see how AI language models reason about uncertain, real-world predictions. We gave 46+ models from 14 providers an agent loop, web research tools, and a full tournament bracket — then watched what they did.

What is March Madness?

If you're new to March Madness: it's the NCAA's annual college basketball tournament. 64 teams compete in a single-elimination bracket — one loss and you're out. Teams are seeded 1–16 within four regions (East, West, South, Midwest), where a #1 seed is the strongest team and a #16 seed is the weakest. The bracket maps every possible path from 64 teams down to one champion: 32 games in the first round, then 16, 8, 4, 2, and finally 1 — 63 games in total. Every year, unexpected upsets and surprise deep runs make the bracket nearly impossible to predict, even for experts who follow college basketball closely.

How It Works

Each model runs as an autonomous agent. There is no human in the loop. The model receives a system prompt with the full Round of 64 matchup list and is told to research, reason, and submit a complete 63-game bracket. The agent loop runs until the model calls submit_bracket with a valid bracket or hits a 100-step safety limit.

The prompt is identical across all models within a mode — only the available tools change. This makes results directly comparable and reveals how tool access affects bracket quality.

Competitive Framing

Every model is told it competes against 30+ other AI models for a spot at the top of a public leaderboard. The prompt explicitly warns that chalk brackets (always picking the higher seed) lose every year and encourages models to research for an edge — look for upset indicators, check injury reports, and find Cinderella candidates.

This framing matters. Smart models respond by doing 4–8 rounds of web research before building their bracket. Weaker models skip research entirely and submit in one step. The gap between the two approaches shows up clearly on the leaderboard.

Three Difficulty Modes

Each model runs under one or more modes that control which tools are available. The prompt text stays the same — the only variable is the toolkit.

HARDResearch Only

Five tools: web_search, web_fetch, use_browser, calculator, and submit_bracket. No lookup helpers, no bracket validator. The model must construct every team ID from memory or research, and get the bracket right on first submission. This is the purest test of reasoning and recall — formatting errors and hallucinated team names are penalized the same as wrong picks.

MIDResearch + Validation

Everything in Hard, plus three helpers: lookup_team (find a team by name, get its ID and seed), lookup_game (inspect which teams are in a specific game), and validate_bracket (check structure, pick counts, and carry-forward constraints before submitting). These tools eliminate formatting errors so the score reflects prediction quality, not data-entry luck.

EASYRound-by-Round

Models predict one round at a time using their own picks to derive next-round matchups. After submitting Round of 64 picks, the model sees which teams it advanced and picks winners for the Round of 32, and so on through the Championship. This reduces context window burden and isolates prediction quality from bracket construction ability — the enum of valid teams shrinks each round (64 → 32 → 16 → 8 → 4 → 2), making formatting errors nearly impossible.

The Toolkit

Every tool is built on the Vercel AI SDK's tool primitive with Zod-validated inputs and outputs. Here's what models get access to:

web_search

Search the web via Firecrawl. Returns top 5 results with titles and snippets. Models use this to find rankings, injury reports, matchup analysis, and upset indicators.

web_fetch

Fetch a web page and return its content as markdown, truncated to 4,000 characters. Powered by Firecrawl's scraping engine.

use_browser

Browse JavaScript-heavy pages (stats dashboards, interactive brackets) with a real browser that waits 5 seconds for JS rendering. Slower than web_fetch — models are told to use it only as a fallback.

calculator

Safe math evaluator supporting +, -, *, /, %, and parentheses. No arbitrary code execution.

lookup_teamMID ONLY

Find a tournament team by partial name. Returns the team's slug ID, seed, region, and full display name. Case-insensitive with normalization (hyphens ↔ spaces).

lookup_gameMID ONLY

Inspect a specific game slot — returns both teams, their seeds, and regions. Handles First Four play-in games where a slot may have an alternate team.

validate_bracketMID ONLY

Validates bracket structure before submission: correct pick counts per round (32-16-8-4-2-1), valid game IDs, teams actually in those games, and carry-forward constraints (a team can only advance if picked to win the previous round). Returns detailed error messages.

submit_bracket

Submit the final 63-pick bracket. Runs the same validation as validate_bracket internally. Returns success or a list of errors. On success, a custom stopWhen condition terminates the agent loop.

Bracket Schema

The bracket is a Zod schema with six rounds (r1–r6), each containing an array of { gameId, winnerId } objects. The winnerId field is constrained to a z.enum() of all 64 tournament team slugs — hallucinated team names are blocked at the schema level, not just in post-processing.

Fields are named r1–r6 rather than “roundOf64” or “roundOf32” because early testing showed GPT-4o Mini was confused by “roundOf32” — the name implies 32 items, but the field expects 16 picks. Short names with .describe() annotations make expected counts unambiguous.

Scoring

Standard ESPN scoring — points double each round to reward correctly predicting later, harder games:

Round of 6410 ptsRound of 3220 ptsSweet 1640 ptsElite 880 ptsFinal Four160 ptsChampionship320 ptsMaximum possible1,920 pts

Tiebreaker: each model predicts the championship game's combined final score. Closest to actual wins the tie.

Tech Stack

Agent Runtime

  • Vercel AI SDK (generateText, tool, stepCountIs)
  • AI Gateway for unified multi-provider access
  • Firecrawl for web scraping + JS rendering
  • Zod schemas for type-safe tool I/O

Frontend

  • Next.js 16 (App Router, server components)
  • Tailwind CSS 4 + shadcn/ui
  • React 19

Data

  • Neon Postgres (serverless)
  • Drizzle ORM (schema-first, type-safe)
  • Full agent traces stored (messages, tool calls, reasoning)

Models

  • 46+ models across 14 providers
  • Anthropic, OpenAI, Google, xAI, Meta, DeepSeek, Mistral, and more
  • Browse all models →

About Alephic

Alephic is an AI consultancy that helps companies build with large language models. Model Madness is one of our explorations into how different models reason under uncertainty — how they use tools, when they choose to research versus guess, and where the gap between “can reason” and “can follow instructions” actually matters.