On the architecture · 8 min read · 2026-05-01

We don't trust the LLM to grade arithmetic.

Why Koda checks every math answer with a deterministic symbolic-algebra verifier instead of asking the language model whether the kid was right.

The decision, in one paragraph.

Koda has a language model on board. It writes the hints, picks the next teaching move, decides when to interrupt. What it doesn't do is grade your child's math. Every numerical answer your kid writes on the worksheet is checked by a separate deterministic verifier — an open-source symbolic-algebra library that has been quietly checking math for the scientific computing community for over a decade. The verifier is exact in a way no language model is. We picked it because we wanted definitive feedback, not statistical feedback. Here's the longer version of why.

The problem with "ask the LLM."

Modern frontier language models are competent at arithmetic, in the same way a tired adult is competent at arithmetic — they're right most of the time. The exact rate depends on the model, the prompt, and the problem. A 2024 evaluation on the GSM8K benchmark put GPT-4-class models at around 92–96% on grade-school word problems. That sounds high. It is high. It's also nowhere near high enough to be a math grader.

Two specific failure modes matter for our use case. First, silent compounding: when the LLM is asked to verify a multi-step problem (say, a child's worked-out long-division attempt), it has to re-compute every step, and an error at step 2 propagates through steps 3, 4, 5. The model often "verifies" the kid's answer by reproducing the kid's mistake. Second, confident wrongness: when the model gets a digit wrong, it doesn't say "I'm not sure." It says "Yes, that's correct" with exactly the same tone it uses for the answers it has right.

For an AI tutor, the cost of a false positive — telling a kid they got it right when they didn't — is higher than the cost of a false negative. A child who's told they're wrong looks again. A child who's told they're right when they aren't carries the misconception forward, and it's the next teacher's problem. We're going to be that next teacher. So: not the LLM.

What the verifier does.

The verifier is a symbolic-algebra library. It does what your high-school algebra teacher did with chalk: it holds an expression as a tree of operators and operands, and it knows the rules for transforming that tree without ever computing a floating-point approximation. It's been used inside scientific computing tools, integrated with numerical libraries, cited in research at NASA-class labs, and taught in university computer-algebra courses for years. Old, boring, exact. We like that combination.

For our purposes, the relevant fact is that the verifier's answer to "does 3/4 + 1/8 equal 7/8?" is not a probability. It's True. There's no temperature parameter. There's no chain-of-thought. The library walks the tree, applies the standard reduction rules, and returns the same answer every time, on every machine, forever.

How the architecture works.

Two systems, two responsibilities.

student wrote:  3/4 + 1/8 = 4/12

  ┌────────────────┐         ┌────────────────┐
  │  vision model   │ ───→   │  parser         │
  │  reads the page │         │  → expression   │
  └────────────────┘         └────────┬───────┘
                                      │
                          ┌───────────┴───────────┐
                          ▼                       ▼
                 ┌──────────────┐        ┌──────────────┐
                 │  verifier    │        │  language     │
                 │  is it true? │        │  model        │
                 │  ────────    │        │  what to say  │
                 │  False       │        │  if not true  │
                 │              │        │  "you added   │
                 │              │        │   the tops..."│
                 └──────────────┘        └──────────────┘

The handwriting model reads the work — or, in today's interim build, the child's typed answer comes straight in. Either way, a small parser turns the characters into an expression. The verifier answers was the math right? — exactly, with no uncertainty. Then the language model picks the next teaching move based on what's wrong: which rung of the hint ladder, which explainer video to offer, which sentence to say. The language model never gets a vote on the math itself.

The actual verification, simplified, looks something like this:

def is_correct(student: str, target: str) -> bool:
    """Compare two arithmetic expressions for exact equality."""
    s = parse(student, rational=True)
    t = parse(target,  rational=True)
    return simplify(s - t) == 0

The parser reads the expression keeping fractions as fractions (not floating-point approximations); the simplification tests whether the difference reduces to exactly zero. If the student wrote 4/12 for 7/8, that subtraction is -13/24, which is not zero, and the function returns False. If the student wrote 14/16 for 7/8, that subtraction is 0, and the function returns True. (Both fractions name the same number; the verifier doesn't care that they're written differently.)

That's it. A few lines of code, plus a parser. It is exact. It does not need a GPU. It runs in milliseconds in the same process as the language model.

The honest counter-argument.

Two objections we take seriously.

"Frontier models are pretty good at math now. Aren't you over-engineering?" The frontier models are pretty good. They're not perfect, and the failure modes don't follow human patterns — they get harder problems right and easier problems wrong, they're more wrong on problems that look unusual, they're more confident the more steps they take. None of that is the sort of failure mode you want sitting between a 9-year-old and the answer to 3/4 + 1/8. We don't think the verifier is over-engineering; we think "ask the LLM" is under-engineering, dressed up as elegance.

"What about word problems? A symbolic verifier can't read 'three squirrels share 0.6 kg of acorns.'" Correct. The language model does that step — it reads the prose and turns it into an expression. The verifier then checks whether the expression the language model extracted matches the answer your kid wrote. The language model gets to do language; the verifier gets to do math. (And when the language model gets the parsing wrong, the worst case is that we ask the kid to clarify, which is also what a human tutor would do.)

What this lets us do.

Definitive feedback. When Koda tells your child their answer is right, it's right — not "the AI thinks it's right." This matters for trust, especially the second or third time the kid disagrees with an AI tutor and needs the AI tutor to actually be right.

Reliable hint-ladder triggering. Because we know exactly where the kid's answer diverges from the target, Koda can pick the right rung of the hint ladder. "You added the tops and the bottoms" only fires when the difference between the student's answer and the target is the specific shape of "top is the sum of tops, bottom is the sum of bottoms." A language model that thought 4/12 was correct would never trigger that hint, because the LLM didn't notice the misconception.

Honest exam scores. Exam mode silences the language model entirely. The score that comes out at the end is the verifier's count of right answers. There's no ambiguity, no model variance, no "the AI gave you partial credit because…" The kid got 18 out of 20. Done.

What we won't do.

We won't let the language model verify alone, even with chain-of-thought. We won't trust "let me check that again" self-corrections from the language model. We won't replace the deterministic verifier with a fine-tuned math model that promises higher accuracy without giving back determinism. The language model is the teacher's voice; the verifier is the answer key. The two roles do not get to merge.

One more thing — for the engineers reading.

This isn't novel. Computer-algebra systems have been around since the 1960s, and "use a CAS to verify, use the LLM to explain" is an obvious architecture once you've watched a frontier model confidently insist that 0.39 > 0.4. What's new is that we can now run both halves locally on a consumer-class device — the language model and the verifier in the same process — with low enough latency that the tutor feels real-time to a 9-year-old. The piece of work was the integration, not the algebra.

If you want the architecture context that goes with this, we wrote about the local-only build choice in a separate note. If you want to know when Koda ships, the waitlist is here. If you want to argue with us about any of this — the email is hello@kodatutor.ai.