๐Ÿฅ‹ Katabench
Start free

Your code runs on our servers.
Here's exactly what happens.

No magic and no hand-waving: every submission takes the same journey, with multiple independent safety layers along the way. That's also why the metrics are trustworthy โ€” they're measured on the server, next to the code, where the browser can't fake them.

The pipeline

The journey of a submission

From your keystroke to a graded report โ€” five stops, each one assuming the previous could be wrong.

โœ๏ธ 01

Write

C# in a full editor, right in your browser. Run the samples as often as you like.

๐Ÿ›ก๏ธ 02

Screen

Every submission is checked before anything runs. Hostile code never executes.

โš™๏ธ 03

Compile

Built server-side with the same compiler you use locally โ€” full diagnostics on errors.

๐Ÿ“ฆ 04

Run, sealed

Executed in a disposable, fully isolated sandbox. One per submission.

๐Ÿ“Š 05

Measure & grade

Time, memory, and correctness โ€” measured on the server, where they can't be faked.

Seconds, end to end: the snippet goes out, the graded report comes back โ€” per-test timings, budgets, memory, and diffs.

Containment

Inside the sandbox

We treat every submission as hostile โ€” including yours. That's a feature: the same walls that contain malicious code make the grading fair and the timings clean.

our execution environment
containment log

$ submit Solution.cs

โ–ธ sandbox spawned fresh ยท single-use

# security hardening โ€” applied to every run

โœ“ network isolated

โœ“ file system read-only

โœ“ cpu ยท memory ยท processes capped

โ–ธ tests complete 148 ms

โ–ธ sandbox destroyed nothing persists

โ–ฎ

This exact lifecycle runs for every submission โ€” spawn, seal, run, measure, destroy. There is no long-lived server your code shares.

The grading

How each track grades you

time input size โ†’ time budget O(nยฒ) โ€” timeout โœ— O(n) โ€” passes โœ“
Both solutions are correct. Only one survives the hidden suite.

Track 01

โšก Algorithms

Correct gets you halfway. The hidden suite scales the input until complexity decides the outcome.

  • ยท Visible sample cases plus a hidden suite with inputs large enough that complexity decides the outcome.
  • ยท Per-test time budgets: correct-but-slow times out where the efficient solution passes with room to spare.
  • ยท Allocation tracking on every run โ€” and on some puzzles a hard allocation budget, where the copy-and-reverse approach fails and the in-place one passes.

The N+1 starter

โ€ฆ ร—41 round-trips

โœ— Timeout over budget

The set-based rewrite

1 round-trip

โœ“ Passed 12 ms

Same rows returned. The grader shows you what each approach cost.

Track 02

๐Ÿ—„๏ธ Database / EF

Your LINQ runs against a real database โ€” and the grader shows you what it really cost.

  • ยท Enough data that inefficiency shows up in the timings, not just in code review.
  • ยท Expand any test to see the queries your code actually produced โ€” and what they cost.
  • ยท Plan-graded puzzles capture the execution plan the engine chose and grade it: full-table reads flagged in red, index usage in green. The right rows the wrong way fails.

method length

limit 25 lines

the starter: 78 โœ— yours: 18 โœ“

cyclomatic complexity

limit 8

the starter: 14 โœ— yours: 6 โœ“

nesting depth

limit 3 levels

the starter: 5 โœ— yours: 2 โœ“

โœ“ 9 / 9 tests still passing โ€” behavior never changed

Measured directly from your source. The tests stay green; the mess has to go.

Track 03

๐Ÿงน Refactoring

Working code you'd hate to inherit. Make it clean โ€” without changing what it does.

  • ยท The behavioral tests pass before you touch anything, and they must still pass when you're done.
  • ยท Structural gates measured from your source: method length, cyclomatic complexity, nesting depth, duplicate blocks.
  • ยท Every flavor of real-world mess โ€” tangled conditionals, god methods, copy-paste blocks, arrow code โ€” including the famous Gilded Rose kata.

Infrastructure

database, email, HTTP

Application

use cases

Domain

pure โ€” depends on nothing

Domain code reaching for the database fails the grade โ€” instantly, on every submission.

Track 04

๐Ÿ›๏ธ Architecture

Multi-file refactoring katas, graded on behavior and design together.

  • ยท Dependency direction, layering, and abstraction boundaries are verified automatically.
  • ยท A wrong dependency fails the submission the same way a failing test does.
  • ยท Feedback in seconds โ€” not in a code review three weeks later.

The adversarial suite

../../etc/passwd

path traversal

โœ“ denied

evil-example.com

suffix confusion

โœ“ rejected

aaaaaaaaaaaaaaaaaaaaaa!

ReDoS

โœ“ 0.2 ms, no hang

ada\n[INFO] admin granted

log forging

โœ“ escaped

โœ“ 6 / 6 functional | โœ“ 4 / 4 exploits blocked
Two suites, one grade: close the hole without breaking the feature.

Track 05

๐Ÿ›ก๏ธ Secure Coding

Functionally correct, quietly exploitable. Close the hole without breaking the feature.

  • ยท Two suites grade every submission: functional tests prove the feature still works, adversarial tests throw real attack payloads at it.
  • ยท The payloads are the classics that hit production systems: path traversal, Zip Slip, log forging, injection, suffix confusion.
  • ยท Resource-exhaustion attacks are graded with real budgets โ€” a ReDoS input must return in milliseconds, a decompression bomb must be rejected within a memory cap.

Your code stays yours

Submissions are stored so you can see your own history and progress โ€” that's it. We don't publish them, and the sandbox they ran in is gone seconds after they finish. The details live in the Privacy Policy.

See the grading for yourself

One puzzle is all it takes to understand why server-measured beats green checkmarks.

Start solving โ€” it's free