About Model Madness

What is March Madness?

March Madness is the NCAA Division I Men’s Basketball Tournament — a single-elimination competition between 64 college teams held every March. One loss and you’re out. The field is divided into four regions of 16 teams, seeded 1 through 16 by a selection committee. Over three weeks and six rounds (Round of 64, Round of 32, Sweet 16, Elite 8, Final Four, Championship), the bracket narrows from 64 teams to one champion. That’s 63 games total.

A “bracket” is a set of predictions for every game. Before the tournament starts, you pick a winner for all 63 games. Higher seeds are supposed to beat lower seeds, but upsets are common — that’s what makes it fun. The scoring system rewards later-round picks more heavily (10 points in Round 1, up to 320 for the Championship), so getting the Final Four and Champion right matters far more than nailing every first-round game.

The Experiment

Model Madness started as a bracket-picking competition for AI. As we built it, the question changed: what do you have to change about a system — the function calling, the prompting, the modes of interaction — to support models of different capability? We expanded from one bracket format into three difficulty modes with different toolkits, designed to isolate where less capable models break down and what scaffolding gets them to a valid result.

The core finding: cross-model-class engineering is fundamentally different from single-model-class engineering. What the frontier models can do today will almost certainly be what cheaper models can do later this year — but only with the right scaffolding.

45 models from 17 providers each get the same system prompt with all 32 Round of 64 matchups. The agent loop runs until the model calls submit_bracket with a valid 63-pick bracket or hits a 100-step safety limit. We tell every model it’s up against 30+ others on a public leaderboard, and that chalk brackets lose every year. We push them to search for injury reports, upset indicators, Cinderella candidates.

Read the full writeup: Model Madness: A Tournament of Tool Calling →

What We Learned from GPT-4o Mini

It picked Duke, Nebraska, Miami. None were in the bracket. We constrained winnerId to a z.enum() of the 64 actual team slugs.

A field called roundOf32 expects 16 picks (32 teams, 16 games). 4o Mini read “32” and submitted 32 picks, or halved to 16 and halved again to 8. We renamed everything to r1–r6 with .describe() annotations for the game counts.

It picked teams for wrong games — Texas A&M-CC to win a game between Texas State and Florida State — so we built lookup_game.

After passing validation at step 41, it entered an infinite loop calling validate_bracket, every call returning valid: true. We built submit_bracket, which validates internally. A custom stopWhen condition terminates the loop only on a passing bracket.

Twenty-three steps of web searches without building a bracket. At step 72 it searched “conférence” in French. It fetched nba.com. It tried to load a Facebook share dialog.

4o Mini still couldn’t produce all 63 picks in a single tool call. It would get r1 right (32 picks) and emit nothing for later rounds. The output token limit was the bottleneck. It got DQ’d from one-shot modes but works in easy mode.

Gemini 2.5 Flash Lite completed the same task in 34 steps: one web search, 32 sequential game lookups, one submit_bracket that passed on the first try.

Tools

Every tool is built on the Vercel AI SDK’s tool primitive with Zod-validated inputs and outputs.

web_search

Search the web via Firecrawl. Top five results with titles and snippets.

web_fetch

Fetch a page as markdown, truncated to 4,000 characters.

use_browser

Real browser with a five-second wait for JS rendering. For JS-heavy pages. Slower, so we tell models to treat it as a fallback.

calculator

Safe math evaluator. +, -, *, /, %, and parentheses.

lookup_teamMID ONLY

Find a tournament team by partial name. Returns slug, seed, region, display name. Case-insensitive with normalization — "bradley braves" (spaces) didn\u2019t match "bradley-braves" (hyphens) until we fixed it.

lookup_gameMID ONLY

Takes a game ID, returns both teams, seeds, and regions.

validate_bracketMID ONLY

Checks pick counts per round (32-16-8-4-2-1), valid game IDs, team presence in those games, and carry-forward constraints.

submit_bracket

Submit the final 63-pick bracket. Runs the same validation as validate_bracket internally. On success, a custom stopWhen condition ends the agent loop.

Three Modes

The prompt is identical across modes — only the tools change.

HARDResearch Only

Five tools: web_search, web_fetch, use_browser, calculator, and submit_bracket. The model has to construct every team ID from memory or research and get the bracket right on first submission. Formatting errors cost the same as wrong picks.

MIDResearch + Validation

Adds lookup_team, lookup_game, and validate_bracket. The validator creates a self-correction loop: submit, check, fix, resubmit.

EASYRound-by-Round

Breaks the bracket into six sequential rounds. After submitting Round of 64 picks, the model sees which teams it advanced and picks Round of 32 winners, on through the Championship. The valid team enum shrinks each round (64 → 32 → 16 → 8 → 4 → 2). We implemented this using Workflow DevKit — each round is a durable step that can retry independently. It exists because some models can pick individual game winners but can’t hold 63 picks across six rounds in a single output.

Scoring

Standard ESPN bracket scoring. Points double each round:

Round of 6410 ptsRound of 3220 ptsSweet 1640 ptsElite 880 ptsFinal Four160 ptsChampionship320 ptsMaximum1,920 pts

Tiebreaker: each model predicts the championship game’s combined final score. Closest to actual wins.

Tech Stack

Agent Runtime

Vercel AI SDK (generateText, tool, stepCountIs) with AI Gateway for multi-provider access. Firecrawl for web scraping and JS rendering. The same Zod schemas that constrain model output also validate tool inputs.

Frontend

Next.js 16 with App Router and server components. Tailwind CSS 4, shadcn/ui, React 19.

Data

Neon Postgres, because serverless was easier than standing up a database for a bracket experiment. Drizzle ORM. We store full agent traces — messages, tool calls, provider-visible reasoning.

Models

45 across 17 providers. Browse all models →

About Alephic

Alephic is an AI-first strategy and software partner that helps marketing organizations solve complex challenges through custom AI systems. We built Model Madness to study how scaffolding and tool access shape AI performance across model classes.