On the pedagogy · 9 min read · 2026-05-20

How Koda decides when NOT to interrupt.

We've written about the triggers that make Koda speak. The harder design problem is the opposite one: how to stay quiet long enough for the kid to do the thinking, when every product instinct in adtech-shaped software pushes the other way.

The companion piece.

A few weeks ago we wrote how Koda decides when to interrupt — the triggers, the rungs, the guardrails. That post answered the mechanism question. This post answers the prior one: why is restraint the design goal in the first place?If you don't have a strong answer to that, the model's defaults will fill the silence — and what they fill it with will almost always be worse than the silence was.

Most tutors over-talk. The research has known this for forty years.

Mary Budd Rowe's 1986 review in the Journal of Teacher Educationis the canonical citation. Rowe timed thousands of classroom exchanges and found that teachers typically waited under one second after asking a question before rephrasing, calling on someone else, or answering it themselves. When teachers were trained to wait three seconds, student response length increased, unsolicited-but-appropriate responses went up, and the share of children who never spoke went down. The paper's title is the counter-intuitive finding: slowing down may be a way of speeding up.

Rowe's study is forty years old and framed around human teachers; the underlying mechanism — that the cost of an early interruption is higher than it looks because it pre-empts the kid's own next move — applies more, not less, to a tutor that can speak at the speed of token-generation. The thing an LLM is best at is filling silence. The thing a struggling child needs most is to be allowed to break the silence themselves.

Productive struggle is a real cognitive mechanism.

The phrase “productive struggle” gets used loosely; the cognitive case for it is more specific than the slogan suggests. Two threads in the research are worth naming.

Desirable difficulty(Bjork & Bjork, 2011). The Bjorks summarize decades of laboratory work showing that conditions which slow acquisition during practice often improve long-term retention and transfer. The everyday version: a child who derives 7 × 8 from 7 × 4 in four effortful seconds is doing more durable cognitive work than a child handed “fifty-six” and moved on. Friction during encoding is part of what makes the encoding stick. A tutor that removes the friction makes the learning easier in the moment and worse over the week. The Bjorks are explicit about the catch, though: a desirable difficulty is only desirable if the learner already has enough background knowledge to engage with it. Strip away the prerequisites and it's just difficulty — which is exactly why Koda's restraint has to be calibrated to what a kid already knows, not applied uniformly.

Productive failure (Kapur, 2008). Manu Kapur gave students problems just beyond their reach beforeany instruction; they mostly failed, then received the instruction the rest of the class got from the start. The productive-failure group outperformed the direct-instruction group on transfer items. Kapur's reading is that struggling to invent half-formed strategies primes the conceptual structure the instruction lands on.

Here's the boundary condition we have to be honest about: that 2008 study was eleventh-grade Newtonian kinematics, not K-5 arithmetic — and the productive-failure evidence is strongest for older students who already carry a lot of background knowledge. For younger learners the picture is murkier. The meta-analytic read (Sinha & Kapur, 2021) actually leans toward some instruction first in the early grades. So Koda's restraint is deliberately bounded: short productive struggle for a kid who has the prerequisites, not open-ended floundering — and never a substitute for teaching a child something they've never seen. The literature tells us where this lever works and where it doesn't, and we'd rather build to its limits than past them.

Barak Rosenshine's 2012 synthesis pulls the other way. Rosenshine argues for high success rates — his target is roughly 80% correct during guided practice — and warns against letting students flounder, because frustration eats motivation and frustration-encoded errors are hard to unteach. Both can be true. There is a window — somewhere between “immediately bailed out” and “left to struggle past the point of any purchase” — where the difficulty is productive. The tutor's job is to find it, and that window is exactly what Koda's success-rate thresholds (below) are trying to bracket. Most edtech defaults are too quick to bail the kid out.

What Koda actually does (today, not someday).

Some of the restraint we believe in is wired up. Some of it is still aspirational. The honest version, mapped onto shipping code:

Exam mode is silent — and worth being precise about. In the prompt layer, exam mode instructs the tutor to stay quiet: when the active session is in exam mode, the mode-guidance line tells the language model to emit an empty primary string (_mode_guidance in tutor_prompt.py). That's the design intent. But here's the honest mechanics: today's default build doesn't wire an LLM tutor into the session manager at all (create_app in app.py constructs theSessionManager with no tutor_client), so a hint request returns a fixed fallback regardless of mode (_generate_hint in session/manager.py). In other words, exam sessions are silent for a more basic reason than the pedagogy — there's no generative tutor running yet. The pedagogy and the current default happen to agree; we want to be clear it's the prompt-layer rule, not a runtime exam-detector, that encodes the intent. The dyscalculia support flag in the portal (accommodations.py) stores the preference that exam mode should default off for those children — the intent being that the default is the mode where Koda gives the most help, not the least. The flag stores that preference today; the runtime auto-routing that enforces it at session start lands in a software update (see what dyscalculia actually is).

One hint per attempt, not a paragraph of them.The session manager generates one primary hint per submitted answer, then stops. The tutor prompt's response schema instructs the model to keep that primary line to twenty words or fewer, and the system prompt forbids revealing the final answer when the child is wrong (src/koda/cognition/tutor_prompt.py). To be precise, the twenty-word bound is an instruction in the prompt, not a parser that truncates output — but it's the load-bearing rule against monologuing. The most common failure mode in chat-style tutors is stacking three explanations in one turn because the model is “being helpful.” Twenty words is most of what helpful looks like.

Scaffolding depth adapts to the child's rolling success rate. Before each hint, the session manager picks a learning depth — SCAFFOLD, NEUTRAL, or TERSE — from recent first-try outcomes. After five samples at 80% or better, depth flips to TERSE: the model is told to skip warm-up language and point at one thing to re-check in as few words as possible (_DEPTH_GUIDANCE in tutor_prompt.py, _pick_learning_depthin session/manager.py). A kid who is doing well gets less tutoring, on purpose.

Fail-streak threshold of three.When a child misses problems carrying the same topic tag three times in a row, Koda surfaces a remediation video — not on the first miss, not on the second. The threshold is a deliberate productive-struggle knob: earlier “interrupts productive struggle, later leaves the kid stuck,” per the comment above the constant (VIDEO_RECOMMEND_FAIL_STREAK in session/manager.py). The number is editable; the principle is not.

What's not yet wired. An earlier post described triggers based on pencil-motion and gaze stalls (detecting when a child stops writing or hesitates over the page) as part of how Koda decides when to step in. To be precise: that detection layer is designed but not in the shipping supervisor today — the restraint that islive is the prompt-level and mode-level behavior described here. Today's supervisor watches worksheet stability and face presence as ambient activation, not per-problem stall heuristics. Spoken hints are also not enabled by default in current builds. For now, the “restraint” story is partly a story about features that aren't live. We'd rather say that than imply otherwise.

The honest counter-argument.

“But my kid will give up if no one helps.” The right worry. Productive struggle that tips into unproductivestruggle — sustained frustration, no purchase, no win — is bad. Kids encode the frustration and learn that math is a place where they fail. Koda's answer lives in SUCCESS_RATE_THRESHOLDS in mastery.py: a rolling success rate below 70% nudges difficulty easier, above 90% nudges it harder, and the band between is left alone. The dict only defines those two thresholds; “same” isn't a named setting, it's just the gap between them.

The dial we argue about in design reviews is the lower edge — is it 65%, 70%, 72%? The literature won't pin a number to the decimal; the cutoff depends on the child, the subject, and the day. We picked 70%, which sits belowRosenshine's 80% guided-practice target — we're deliberately accepting more difficulty than his optimum, betting that a bounded stretch is worth it. We expect to revise the number.

“Your tutor is not aggressive enough.” A reasonable take, especially for parents who grew up with a more direct-instruction model. The thing that changed our minds was watching enough sessions to see the pattern: a kid who is helped before they ask stops trying to figure things out, and the rest of the worksheet runs on the help. We'd rather be on the “quiet by default” side of that error and tune up than be on the “narrate everything” side and tune down. The second direction is much harder to walk back.

The trust we're asking parents to extend.

Restraint, in a kids' product, looks like a feature missing. A quiet tutor watching a child struggle is a strange thing to ship in 2026, when the rest of the category is racing to make the AI talk more. The marketing case for a chatty tutor is easier — you can demo it in thirty seconds. The marketing case for a tutor that knows when to stay out of the way is the kind of thing that needs eight paragraphs and a citation to Mary Budd Rowe.

What we're asking parents to trust is something like this: the moments when Koda is quiet while your kid is working — even when they look momentarily stuck — are not bugs. They are the product. The point of restraint is to make space for a specific cognitive event, which is the child noticing their own miss, reaching for their own next move, and registering the result as something they did. That event is, as far as we can tell, consistent with building conceptual understanding — one of the five intertwined strands of mathematical proficiency the National Research Council describes (the others being procedural fluency, strategic competence, adaptive reasoning, and productive disposition; Kilpatrick, Swafford & Findell, 2001).

We don't always get the calibration right today. We are deliberately on the quiet-by-default side of the trade-off, and we will be wrong sometimes — too quiet for a kid who needed a nudge. What we won't do is fix that by talking more. We'll fix it by getting better at reading the moment.

If you want to know when Koda ships, the waitlist is here. Related notes: how Koda decides when to interrupt, how a 10-minute review beats a 30-minute drill, and what mastery means at home.