On the cameras · 8 min read · 2026-05-01

What the camera actually sees (and what it doesn't).

A frame-by-frame walkthrough of Koda's two cameras. What each one captures, how long the data lives, and what the design specifically prevents them from seeing.

The decision, in one paragraph.

Koda has two cameras. The overhead one watches the worksheet. The front one watches the child. We built the system so that what the cameras can see is narrow by design, what gets stored is bounded by architecture, and what leaves the device is zero. This note walks through the cameras one frame at a time — what they capture, what happens after, and what the system specifically can't do even if we wanted it to.

What the overhead camera sees.

The overhead camera is mounted above a desk, pointed straight down at a piece of paper. The frame is roughly 30 cm × 30 cm — about the size of a standard worksheet plus a margin. Inside that frame is paper, pencil, the child's hand, sometimes their forearm, the desk surface around the worksheet.

What the overhead camera doesn't see, by physics: the child's face. The geometry doesn't allow it — the camera is mounted above the desk pointing down. A face that's tilted forward to read the worksheet appears as the top of a head. A face that's looking up at the screen is angled away from the lens entirely. The overhead view is bounded by the laws of how cameras work to a worksheet-and-hand view.

Resolution: we capture at 1080p, around 15 frames per second during an active session — that's the capture rate from the camera to the vision pipeline. Higher resolution doesn't help us read a 4th-grader's pencil marks any better, and lower resolution loses the regrouping marks that matter. 15 fps is enough to see a pencil moving without burning power on frames we'd immediately throw away. (The page-stability check that decides a worksheet is ready runs at a lower cadence — roughly two frames per second — because it only needs to confirm the page has held still for a moment, not track fast motion.)

What the front camera sees.

The front camera sits where a webcam usually does — on top of the monitor or TV. Its frame contains the child sitting at the desk: face, shoulders, sometimes a sliver of the room behind them.

We crop that frame to the child before doing anything else with it. The face-detection step pulls a bounding box around the kid; everything outside the box is ignored. A sibling walking through the background, a parent in the doorway, the bookshelf — none of those make it into any computation that happens after. The cropped frame is what feeds the face-recognition embedding model and — in the designed but not-yet-shipping supervisor — a gaze-and-pencil-stall heuristic that triggers the hint ladder. Today the supervisor uses face presence (is a face visible?) as the ambient signal, not per-problem gaze or motion detection.

The front camera doesn't see the worksheet (the geometry's wrong), and it doesn't see the parent unless the parent is sitting at the kid's seat. The frame's jobs today are narrow: establish face presence so a session can start, and — once a face is found — identify who the child is. On the roadmap it does more — observing how physical skills and focus develop across a session and surfacing that to you as a parent: support, never a score on your kid.

What happens to a frame.

This is the part most parents want walked through carefully, so we'll do it slowly.

The camera produces a frame. The frame is a 1080p JPEG-ish blob in the device's RAM.
The vision model reads the frame. For the front camera today, that is face detection and — if a face is found — face embedding. For the overhead camera today, that is page-stability watching (confirming the worksheet is still, not reading what's written on it). The handwriting-recognition step and the gaze + pencil-motion heuristic are part of the designed pipeline but are not running in today's build.
The vision model produces structured output: "face matches profile #2 at confidence 0.93," or — once the handwriting-read loop ships — "the worksheet currently shows '52 - 27 = 25'." That structured output is what the rest of Koda reasons about. (Reading handwriting into the live session, and the gaze/pencil-motion signals — e.g. “pencil hasn't moved in 18 seconds, gaze on page” — are part of the designed supervisor and appear in the data model, but today's build uses typed input plus face presence, not camera-read handwriting or per-problem stall detection.)
The frame itself is dropped from RAM. Not written to disk. Not buffered. Not queued for "future improvement of our service." Once the structured output is extracted, the pixels are gone.
Next frame, repeat.

Across a 20-minute math session, this happens about 18,000 times per camera, and each frame's lifetime in the system is on the order of tens of milliseconds. There is no "session video" being assembled. There is no recording. The camera produces frames; the vision model reads them; the pixels die.

What's stored, what isn't.

Stored on the device, indefinitely until you delete the profile:

The optional enrollment photos your child took when you set up their profile. These are stored as image files in the local profiles directory. They feed the face-matching embedding model at session start.
The face embedding vector derived from those photos. Encrypted in the profile row of the local database. A few hundred numbers per profile. This is what the front camera's live face-detection actually compares against.
The event log: structured records of what your child worked on, what hint rung Koda climbed to, what XP they earned. This is the data the parent report (and the v1.1 Friday digest) is built from. No frames, no audio, no voice samples — just structured event records.

Not stored, anywhere, ever:

Live session frames from either camera.
Live session audio — see the wake-word section below.
Voice samples of your child. Not in cloud storage (there is no cloud), not on the device, not in any analytics layer.
What was on the worksheet beyond the structured "problem #N answered '25'" record. The literal pixels of the worksheet aren't archived.

The wake-word listener — the audio question.

We're often asked, "is the microphone always listening?" The honest answer requires a little unpacking.

The microphone is on while a session is active, just like the cameras are. It's also optionallyon between sessions to listen for the wake word "Hi Koda" — that's the third trigger in our hint ladder (the explicit-ask one). When wake-word listening is on, the audio works the same way the video does: the mic produces a rolling buffer (about 3 seconds worth), a tiny on-device model checks each window for the literal phrase "Hi Koda," and if it doesn't fire, the window is overwritten by the next one. There's no transcription. The audio doesn't go anywhere except into the wake-word model's input layer, and out the other end of that model is a single number: confidence-this-was-the-wake-word. (Today that model is a generic preset wake-word proxy standing in for the custom-trained “Hi Koda” model, which is still on the workbench.)

If you don't want the wake-word listener on at all, the parent portal has an off switch. We default it to off; the wake word is opt-in. Most families end up turning it on after a week because it's useful, but the default is off because that's the conservative starting point.

In the designed session flow, the mic captures the child's voice for the same reason the camera captures their face: so Koda can respond. The audio is processed on-device; the structured output is "the kid said 'I don't get it'" or "the kid laughed"; the raw audio is dropped. Not stored. Not transmitted. (This in-session speech path is still on the workbench — today's build doesn't transcribe or act on session speech; the only audio path that runs is the optional wake-word check above.)

What the cameras can't see, by design.

We try to write down the things we've made impossible for the system to do, because promises about behavior are weaker than constraints in the architecture.

The cameras can't see the child outside a session. Both cameras have a hardware recording light. When the light is off, the camera is off — the firmware ties the light to the sensor power rail. (We pick cameras specifically for this property.) The "indicator" isn't a software toggle; it's a circuit.
The cameras can't transmit. The device doesn't make network calls during a session. (We wrote about this in the local-only architecture note.) Even if a frame survived past the vision model — which it doesn't — there's no upload pipeline for it to enter.
Frames don't accumulate. The pixel-discard step happens in-memory, immediately after the structured-output extraction. There's no buffer that grows to a video. There's no log of frames waiting to be uploaded.
The microphone can't transcribe everything. The wake-word listener's model is a single-purpose binary classifier (was-this-the-wake-word? yes/no). It physically can't produce text transcripts of overheard conversation; that would be a different model with different weights.

What you can verify physically.

Architectural promises are good; physical verification is better. Things you can check yourself:

The camera lights. If a light is on, a session is on. If a light is off, the camera is off. There is no software state where the light lies. (If you ever see a Koda update that introduces a lying camera light, that's a serious-enough betrayal that the right move is throwing the box out.)
Unplug a camera. Both cameras are removable. Unplug one, the corresponding camera dies completely. Plug it back in when you want a session.
Cover the lens. Tape works. Once the overhead camera is covered, Koda can't read the worksheet — the system tells you so, in the parent portal status panel.
Watch your network. If you run Little Snitch or pfSense and watch outbound connections from the Koda device during a session, you should see exactly nothing — no calls to cloud LLM endpoints, no telemetry, nothing. (Software updates make calls; sessions don't.)

The honest counter-argument.

We'll say it directly: a camera in your child's room is a real thing. Even with a hardware recording light. Even with a local-only architecture. Even with a discard-after-each-frame pipeline. There are families for whom the answer is "no thank you" and we think those families are reasonable. Koda asks for cameras pointed at the desk for ~30 minutes a day. If that's past the threshold, the right answer is to use a different tool — there are good ones — and we're not going to argue you out of it.

What we will argue against is the version of camera-skepticism that holds Koda to a higher standard than the dozen camera-bearing devices already in your house — the iPad, the laptop, the phone, the smart speaker, the doorbell, the modern TV, the in-car telemetry, the always-listening assistant. Most of those default to cloud, default to logged audio, default to facial-recognition datasets that live in someone else's data center. We've tried to build the version that doesn't, and to make the constraints visible. That's the deal.

What we'd ask of you.

If you decide to set Koda up, sit your kid down before the first session and walk them through what the cameras see. Show them the lights. Show them the parent portal. The kids who understand the cameras tend to relax around them quickly; the ones who don't understand tend to feel watched. That's true for any camera, anywhere. The architecture takes care of the part we can take care of; the conversation between you and your kid is what does the rest.

The longer notes that go with this one: the local-only architecture (where the data goes, or rather doesn't), and the colophon (every library and deliberate omission). If you want to know when Koda ships, the waitlist is here. As ever, a few emails. Total.