Skip to content
koda
← All notes

On the design · 8 min read · 2026-05-20

Show your work: why we reward effort, not just correct answers.

Most math software gives credit only when the final answer is right. Koda gives credit for the work along the way — the regrouping mark, the re-tried step, the diagram drawn in the margin. Here's the XP design, and why we kept landing there.

The decision, in one paragraph.

Koda's XP table awards points for showing work, attempting a step, drawing a diagram, and re-trying after a slip — in addition to the points for a correct final answer. The correct-answer bonus is there, but it isn't the largest line item. The reason is simple: in elementary math, the thing we want kids to get better at is the process — the regrouping, the alignment, the re-reading of the word problem, the willingness to draw a picture instead of guessing. The correct answer is the artifact of doing the process well. If we only rewarded the artifact, we'd teach kids that the artifact is the point, which is the exact mistake we're trying to walk back.

What the XP table actually contains.

The rough shape of the per-problem XP table, for a typical 4th-grade arithmetic problem. Numbers shift a little by grade and problem type, but the structure is the same:

  • Wrote a step. The child made a visible move on paper — a number, an arrow, a partial sum, an underlined word in the problem. +2 XP.
  • Drew a representation. A fraction circle, a number line, a bar diagram, a tally. Even a rough one. +3 XP.
  • Tried a step that turned out to be wrong, then changed it. The slip-and-recover pattern we most want to reinforce. +4 XP.
  • Asked for a hint and used it. The child climbed the hint ladder and the next move improved. +2 XP.
  • Finished the problem. Regardless of whether the final number was right. The child saw it through. +3 XP.
  • Final answer correct. +3 XP.

A child who works carefully and gets it right earns roughly 14–17 XP. A child who works carefully and slips on the final number — wrote a 7 where they meant a 9 — still earns around 11–14. A child who guesses, writes nothing, and happens to land the right answer gets the 3-point answer bonus and nothing else. That's deliberate. The shape of the table is the lesson.

We're calibrating a per-problem ceiling so a 12-step problem can't outscore a thoughtful 3-step one by an order of magnitude. The current ceiling sits at roughly 25 XP per problem and we'll adjust it as we see what genuinely 10-step problems look like in the wild. The cap is there so step-padding stops paying off before the child notices it's a strategy at all.

Why the answer-only default is the weakest of the common ones.

Most kid-math apps reward the right answer and only the right answer. The reason is operational, not pedagogical — grading a typed number is cheap; reading the work is hard. But the convenience produces a real artifact: kids learn to optimize for the measured thing. If you measure the final answer, you'll see more guessing, more skipping the messy middle, more "I just knew it." The rational move for the kid is to stop showing the work — there's no reward for it.

The standard classroom response — "show your work or you lose half credit" — is the analog version of what we're doing digitally. The reason elementary teachers insist on shown work is that the work is where the cognitive heavy-lifting happens. Koda's overhead camera lets us read the work; the XP table lets us give credit for it.

What the literature points at.

Two strands of evidence shape this. We name them honestly, including the limits.

Process praise vs. person praise.Mueller & Dweck (1998) ran the original studies: 5th graders who were praised for effort after a task picked harder follow-up problems, persisted longer after failure, and reported more enjoyment than children praised for ability. The direction has been replicated for praise specifically — Henderlong & Lepper (2002) review the adjacent literature carefully — but the broader growth-mindset claim has had a rougher time. Sisk et al. (2018), the first big meta-analysis, found weak overall effects across school interventions. Macnamara & Burgoyne (2023) were more skeptical still. The cleanest positive result, Yeager et al. (2019) in Nature, found the effect concentrated in lower-achieving 9th graders in supportive school contexts — not the universal cognitive intervention the early press suggested. The honest read: the process-praise finding has been less reliable than we'd hoped, and we can't lean on it the way articles from 2010 used to. What survives is simpler: in operant terms, what you reward, you get more of. We picked process for the same reason a 4th-grade teacher does — it's the part we want them to get better at.

Reinforcement schedules, used carefully. A second framing comes from operant conditioning, but we want to be careful about the analogy. Effort-XP gives a child something close to continuous reinforcement: a visible move forward (a step written, a diagram drawn, a hint used well) reliably gets credit. Answer-only grading does the opposite. It provides no reinforcement at all for problem-solving moves; the vacuum is what invites guessing as a strategy. And once a kid is guessing, the eventual correct guess lands on what is — for that kid, in that moment — an intermittent schedule, which is part of why guessing is sticky once it starts. The recent problem-gambling literature (Delfabbro and colleagues, 2023, among others) warns against pulling animal-schedule results directly into classroom claims, and we agree; we're not claiming the schedule literature proves the effort-XP design is correct. We're claiming it explains a mechanism — why "just guess" feels worth trying once the environment has stopped giving credit for thinking.

How Koda reads the work (and where this stands today).

We can't give credit for what we can't see. The whole design depends on the overhead camera plus the handwriting pipeline — and that loop is the next big milestone on the build plan, not in the production session UI today. Today's build collects the problem and answer through typed inputs while the handwriting OCR and capture path are in flight. When the paper loop ships: an overhead camera looks down at the paper at about 15 frames per second during a session; the on-device handwriting model classifies each new mark (digit, operator, scratch-out, diagram, annotation); a state machine in the session manager watches for the patterns we reward (“wrote a step,” “drew a representation,” “tried-and-changed”) and appends an event to the log. The XP rollup is computed by scanning the log at the end of each problem — so what the child sees is a summary on completion, not a live ticker incrementing in real time. That's deliberate: a live counter is the kind of thing a kid optimizes against move by move, and we'd rather not invite it.

We wrote a separate note on why we watch paper instead of a tablet; it overlaps with this post in that paper is what makes effort visible at all.

The kids for whom “show your work” is a tax.

Any system that gives credit for visible work risks penalizing the kids for whom externalizing thought is the hardest part of the problem. A child with dysgraphia who can solve the long division cleanly in their head but seizes up at the act of writing it down. A child with ADHD who solved it ten seconds ago and now can't get themselves to put it on paper before the moment passes. An autistic mental calculator who skips intermediate steps because they don't need them — and who, under an effort-XP rule, looks like the kid who guessed. Fine-motor fatigue near the end of a worksheet. We've watched this failure mode in our own household and it's the design risk we worried about most.

How Koda handles it: each child profile carries a set of parent-controllable supports. A child flagged for dysgraphia is on the roadmap to get a voice-answer path — when the paper loop is live and handwriting confidence is low, Koda will ask the child to saywhat they're thinking, and that voice trace will count for the same XP that written steps would. (The spoken-answer pipeline is designed and on the workbench; the speech-recognition piece isn't implemented yet.) For now, the typed-input session view is the fallback. The hint ladder is configurable per child. And the policy we hold firm on: if Koda can't read the work for a given attempt, the message to the child is never silent and never punitive — they always get credit for showing up, and the gap goes to the parent portal, not to the kid's XP counter.

None of this fixes the underlying difficulty. It just stops us from compounding it. The rule we wrote on the wall when we started: visible work is the preferred signal of effort, not the only signal we'll accept.

What we deliberately don't pay for.

A few categories of "effort" earn no XP, and we want to be honest about why.

  • Time-on-page. Sitting in front of a worksheet for 20 minutes isn't effort if the pencil never moved. Some apps reward minutes; we don't.
  • Repeated identical guesses. Writing "47" three times in the answer slot isn't three tries. The state machine collapses these into a single attempt.
  • Compliance with the hint. If the tutor says "draw the fraction as a circle" and the child draws a perfect circle that has nothing to do with the problem, no XP. The hint has to actually move the work forward.
  • Volume of marks. Filling the page with random numbers earns no points. We give credit for specific patterns, not for ink consumption.

The honest counter-arguments.

"Won't kids just learn to game the effort metric?"Some will, sometimes. The research on intelligent tutoring systems is pretty clear that "gaming the system" — going through the motions without engaging cognitively — happens and is correlated with worse learning (Baker, Corbett, Koedinger & Wagner, 2004). The design has to take it seriously. Three concrete anti-gaming choices we made: (1) a per-problem XP cap, currently around 25 XP, so step-padding stops paying off quickly; (2) mastery-aware difficulty — staying on problems the child has already shown they can do doesn't level them up, so “farm the easy stuff” isn't a path; (3) pattern detection on the slip-and-fix bonus specifically (it's the highest single line item, +4 XP, so it's the most attractive thing to game) — when slip-and-fix appears with mechanical regularity, identical structure across consecutive problems, the bonus quietly turns off for that session and the persona pivots: “Hey, you're doing a lot of erasing — want to talk through this one out loud?” Most 8-to-10-year-olds aren't running an optimization loop on the XP table; they're trying to figure out fractions. But the kids who dopattern-match systems deserve a design that takes the pattern seriously and responds, not one that pretends they don't exist.

"Aren't you teaching kids that being wrong is fine?" Being wrong on the way to right is fine. Being wrong at the end is information — the problem isn't done. The XP table doesn't flatten the difference; the final-answer bonus is real, and the system distinguishes "you finished" from "you finished correctly." What it won't do is treat the slip as a punishment. The slip is the thing the kid is supposed to learn from.

"Doesn't this just convert one extrinsic reward into a different extrinsic reward?" Yes. XP is extrinsic, and the classic finding here — Lepper, Greene & Nisbett (1973), the overjustification effect — is that adding extrinsic rewards to an activity a child already finds interesting can reduce intrinsic motivation, even after the rewards stop. That's a real risk and we don't want to wave it away. Our no-streaks post names the same concern outright; we want this post to match its honesty, not soften it. Two things give us some footing. First, the kindof behavior being rewarded matters — Henderlong & Lepper (2002) and the broader self-determination literature suggest that rewarding behaviors that are themselves the learning (drawing a representation, re-trying a step) crowds out intrinsic motivation less than rewarding pure performance outcomes does. Effort-XP is aimed squarely at the former. Second, we'll watch for the warning signs in the data we have: a child who stops showing work when the XP counter is hidden, a child who needs the counter visible to keep going, a child who picks easier problems instead of harder ones as a session progresses. Any of those is a signal that the reward is doing the wrong thing, and the design should flex.

What would change our minds.

A few things would push us off the effort-based design:

  • Field data from Koda families showing that kids on the effort-based table don't actually show more work than a control on an answer-only table. We'll start to have evidence within six months of shipping.
  • A finding that the effort categories we reward don't predict conceptual-understanding gains. If "drew a representation" turns out to be ink the child produces to earn XP rather than a tool they're thinking with, we revisit the line item.
  • Strong evidence of gaming — kids reliably synthesizing the rewarded patterns without the underlying cognitive moves. The session-level pattern detection is the early-warning system; if it fires often, the design is wrong.
  • Crowding-out signals: a kid whose effort drops when the XP counter is hidden, or who systematically picks easier problems to earn faster. The overjustification literature predicts these; if we see them, the reward shape needs to change.

One last thing.

The reason we built the camera, watched the paper, ran the handwriting model — that infrastructure — was so we could give the child credit for the work, not just the answer. The XP table is where that decision shows up in product. It isn't a gamification layer bolted on top; it's the part where the architecture and the pedagogy meet. We'd rather a child finish a session having earned points for three patient attempts on a hard problem they didn't quite land than for five quick right answers they guessed. The first kid is learning math. The second is learning that guessing works — which is a habit that has a half-life, and the half-life isn't long.

If you'd like to know when Koda ships, the waitlist is here. The related notes on why we don't use streaks and why we watch paper cover the rest of the quiet-by-default reward design.