Tests are green, but no steering, no verification, no recovery. Direction collapses the total.
LeetCode doesn't prove much anymore.
The work changed. You point an agent at a real problem, catch it when it is confidently wrong, and check what it actually shipped. That judgment is the skill, and it is the thing kodwai scores.
// three reasons the old test fails
- 01the format is stale
Built for a different era
Whiteboard puzzles and LeetCode grinds were designed for engineers working alone with nothing but an editor. Point an agent at one and it clears the puzzle in seconds. You learn nothing about the engineer.
- 02green is not the same as good
Passing tests proves little
A single careless prompt can make the suite go green and still show no judgment at all. No verification, no decomposition, no recovery when the agent goes confidently wrong. The checkmark hides everything that matters.
- 03the real work is unmeasured
Nothing measures how you work
You spend your day directing an agent: writing the spec, catching hallucinations, checking what actually shipped. That is the skill that decides who is good now, and until kodwai, nothing put a number on it.
From pick to scored, in five steps.
No sandbox, nothing to install that fights you. You work on your own machine with your own agent, and Kodwai scores the whole session.
- 01
Pick a challenge
Browse real, ticket-sized problems across every category you actually ship in. Filter by difficulty and pick one that looks like the work you actually do.
- 02
Run the CLIcli
Start it from your terminal and choose your agent. We download PROBLEM.md, starter files and tests, init a git repo, and start the timer.
Claude Code
Cursor
Codex
- 03
Solve on your machine
Work the problem with your own agent in your own editor. No sandbox to fight, no artificial constraints, just how you really build.
- 04
Submitcli
One command packages your code, git history, test runs, agent transcript, and the time you took, then ships it for scoring.
- 05
Get your score
Direction, Outcome, and Lift land with per-signal evidence, so you can see why each axis scored the way it did. Then you are on the leaderboard.
One command pulls the problem, sets your agent, inits git, and starts the clock. Your agent opens right where you left off. Then you build it your way.
See exactly how you vibe code.
Same challenge, two developers. A careless one-shot prompt can pass the tests. It still scores low, because passing tests is not skill. kodwai reads the whole session, so the score rewards how you drive.
Tests green and the agent was steered, verified, and hardened. Direction carries the score.
kodwai reads the whole session: the prompts, the recovery, the test runs, the commits. The score is dominated by Direction, the part a one-shot cannot fake.
Problems worth shipping.
15 live challenges across 10 categories and three difficulties. Each one is scoped like a real ticket, not a riddle. Pick the track that looks like the work you actually do.
Bookshelf REST API
Junior Backend Engineer interview. Build a small REST API from scratch with CRUD, filters, validation, persist...
Multi-Currency Wallet Ledger with Idempotent Transfers
Senior Backend Engineer interview. Build the core double-entry ledger for a multi-currency wallet: atomic tran...
Process / Task Orchestrator-Lite
Platform / Infra interview. Build a task orchestrator that runs a DAG with dependency ordering, a global concu...
What the score actually measures.
A one-shot “solve this” prompt clears the tests, so passing tests is not enough. The score is dominated by how you direct the agent, the part a careless prompt cannot fake.
sample run · rate limiter
how you steer, verify, and decompose
what actually shipped, and whether it holds
the edges a one-shot prompt misses
“before we move on, write a test that fires 1k concurrent requests and assert no tokens leak past the window”
Why it scored. You forced the agent to prove the concurrency claim instead of trusting it. Cited from turn 14, 41s before the first commit.
scored 0 to 100 direction 50 outcome 35 lift 15 every signal cites its evidence
Rank, earn, and prove it.
Every scored run moves you up the global leaderboard and builds a public profile you can send to anyone.
Badges that stack up.
shareable to x & linkedinMilestones, streaks, skill and agent badges land automatically as you submit. Your profile at kodwai.com/developers/you shows your score, your rank, the badges you have earned, and the agents you drive. Built to send to anyone, including a hiring manager instead of a take-home.
Numbers we are happy to stand behind.
No vanity metrics. Just what the platform is, what it costs you, and how honestly it measures the way you actually work.
Frequently asked questions.
Everything worth knowing before your first run. Still curious, the answer is one message away.
What is vibe coding, and how do you score it?// scoring
Vibe coding is building real software by directing an AI agent instead of typing every line yourself. Kodwai scores the session across three axes: Direction (how you steer, verify, and decompose), Outcome (what actually shipped and whether it passes), and Lift (the edge cases a one-shot prompt misses). Every signal cites its own evidence from your transcript, commits, and test runs.
Which agents and languages are supported?// agents · langs
Bring your own agent. Claude Code and Cursor are first-class, and anything you run in your terminal works, including Codex CLI, Aider, Cline, and more. Challenges span every mainstream category and most mainstream languages, since you solve on your machine with your own setup.
Do I solve challenges locally or in a sandbox?// local
Locally, always. The CLI downloads the problem, starter files, and tests, inits a git repo, and starts the timer. You work in your own editor with your own agent. There is no browser sandbox to fight and no artificial constraints.
Is it really free?// pricing
Yes. Solving challenges, your score, your profile, and the leaderboard are free for developers. The hiring track is the paid product, for teams running interviews.
How can a score be fair if a one-shot prompt passes the tests?// fairness
Passing tests is necessary but not sufficient. The score is dominated by Direction, the part a careless prompt cannot fake. A solution that clears tests with no steering, no verification, and no decomposition scores poorly on the axis that matters most.
What does the public profile show?// profile
Your score, your rank, the badges you have earned, and the agents you drive, at kodwai.com/developers/you. It is built to send to anyone, including a hiring manager instead of a take-home.
Stopgrindingpuzzles.
Provehowyoubuild.
Fully free. Your own agent, your own machine, your own editor. You pick your path on the way in.
