Notes · 7 May 2026

Walking as ground truth: an eval rig for a grounded LLM.

I sell a printed walking book that’s generated, fresh, against any UK postcode. To make sure the LLM doesn’t lie about which footpaths exist, I built an eval grounded in actually walking the routes. Here’s what 100 walks taught me.

The problem with route-generating LLMs.

The product is simple to describe and harder to ship. Type a UK postcode into fromyourdoor.com; about a minute later, you get a printed book of ten walks that start at that address — routed across OpenStreetMap, sorted into four distance bands (short, morning, half-day, full-day), with narrative descriptions of what you’ll see along the way.

The narrative descriptions are the part where things get interesting. The graph routing — A* over OSM’s footway/footpath/bridleway edges — is solved engineering. What an LLM does on top is the editorial layer: at the second stile turn left, ignore the bridleway up the hill, the Stag’s Head is half a mile further on the right, decent kitchen. That’s the bit that makes it a book and not just an exported GPX file.

It’s also the bit that hallucinates if you let it. Three failure modes I saw repeatedly in the early prototypes:

The LLM invents a stile that isn’t there, because turn left at the stile is a sentence that fits the rhythm of walking instructions.
The LLM names a pub that doesn’t exist, or names one that does but has the wrong opening hours, or worse: was demolished in 2019.
The LLM emits a bearing or distance that contradicts the actual geometry of the segment it’s describing — follow the path north for half a mile when the path actually heads south-east for 800 metres.

None of these are catastrophic. None of them are caught by the standard LLM benchmarks. None of them get you a refund. And all of them, accumulated across ten walks in a printed book, are exactly the kind of small wrongness that turns a thoughtful gift into a slightly embarrassing one.

Walking benchmarks aren’t a thing. So I built one out of the only ground truth I had: my own feet.

The four-band eval.

The composite eval scores each generated walk on four bands, weighted equally. Each band is the average of three or four sub-checks. The sub-checks are a mix of programmatic (run against the OSM graph + the actual narration text) and LLM-as-judge (a separate, larger model rates clarity on a 0–100 scale, given a strict rubric).

// pseudocode: composite score is the average of four bands
composite = mean([
  routing,            // does the route resolve on OSM with no missing edges?
  narration,          // do the named features actually exist in the bbox?
  waymark,            // do bearings + distances match the geometry?
  instruction_clarity // is the prose unambiguous to a reasonable walker?
])

Routing is the cheapest band: every route is replayed through the OSM graph and any unrouteable edges or invalid junction sequences score zero on that sub-check. This catches about 30% of the failure cases on its own and has no LLM in the loop, which makes it free to run on every commit.

Narration is the most expensive: every named feature in the prose (pubs, churches, gates, viewpoints) is cross-referenced against POIs in the route’s bounding box, with a fuzzy-match for slightly-wrong names and a hard-fail for hallucinations. This is where the LLM-judge does the most work, because the Stag’s Head and The Stag are the same pub, but the Crown Inn and the Crown Tavern are sometimes different ones.

Waymark compares the bearings and distances in the prose against the actual polyline. Half a mile north is a hard claim: half a mile is 805 m ± about 50 m of acceptable drift; north is 0° ± 22.5°. If the segment is 1.2 km running south-east, the sub-check fires.

Instruction clarity is the LLM-judge band: given the narration text and the route, would a reasonable walker carrying the printed page know which way to go at each decision point? This catches the prose that’s technically correct but useless: continue along the path at a four-way junction.

Walking is the ground truth.

The eval rig is fast and cheap to run, and it caught most of the obvious wrongness. It did not catch the subtle wrongness, the kind that makes a walk slightly worse without being technically false. The eval said the route was 92/100; my legs said it was 70.

So I started walking the routes. Roughly 100 of them so far, across the south of England, where I live, and a smaller number on visits to the Lake District and Wales. Each walk gets logged as a structured disagreement entry: where I expected to be, where I actually was, what the book told me to do, what I actually did, and why.

{
  "walk_id": "OX7-6YD-morning-loop-3",
  "waypoint": 4,
  "book_says": "Cross the stile and follow the field edge.",
  "actual": "There is no stile. There is a kissing gate, signposted, 30m further on.",
  "category": "named-feature-wrong",
  "bearing_delta_deg": 0,
  "distance_delta_m": 30,
  "fix_hypothesis": "The LLM is over-indexing on 'stile' as a default rural-feature word."
}

About one in eight waypoints, on the early walks, produced a disagreement entry. About one in fifteen on the more recent ones. The improvement comes from feeding the disagreement dataset back into two places: the eval rubric (so the LLM-judge starts catching stile when the OSM tag is kissing_gate) and the narration prompt (so the writing model knows to default to the OSM tag rather than the genre word).

This is the moat, such as it is. The code is six weeks of work and could be replicated by any competent engineering team. The 100 walks of structured disagreement entries, fed back into both the eval and the prompts, are 100 weekends I cannot give back. They’re also the only thing that distinguishes a book that’s correct from a book that’s good.

The 85-plateau.

The composite score for a fresh book on a typical postcode plateaus around 85. I have spent multiple weekends trying to get it above 90. I have stopped trying.

What happens above 85 is that the gains turn out to be noise. The same prompt, the same route, the same model, run twice in quick succession, will score in a range of about ±3 points. Trying to optimise into that range is chasing variance. Below 80, books are noticeably worse — I can feel the difference walking them. Between 80 and 85 there’s a real gradient. Above 85 the eval is no longer measuring anything I can act on.

Knowing this changed how I shipped. The pre-flight check before printing a book is now composite ≥ 80, with a hard floor on routing and waymark, not chase the highest number you can get. Cheaper, faster, and the books are no worse for it.

Below 80 the books are bad. Above 85 the eval is measuring noise. Aim for the corridor.

Three things I didn’t expect.

The narration is more reliable than the routing. I expected the LLM to be the weak link. Once the narration prompt was tuned against the disagreement dataset, it turned out the routing layer — specifically, the OSM graph itself — produced more user-facing errors than the prose did. Footpaths that don’t exist on the ground but do exist in OSM. Fields that have been re-fenced. Permissive paths that have been revoked. The book can’t know any of this; the LLM gets blamed when really it’s the cartography that’s out of date.

The eval-judge model needs to be bigger than the writer model. Counterintuitively, I get the best per-pound result by drafting the narration with a smaller model and judging it with a larger one. The judge has to spot subtle wrongness; the writer mostly has to be coherent. Reversing this, which was my first instinct, was substantially worse.

Disagreement entries are more useful than success entries. I tried to log good walks too — what worked, what was particularly clear — and feed those back as positive examples. They turned out to be much less useful than the disagreement entries. The model learns more from this was wrong, here’s what should have been said than from this was right, do more like this. I’ve since stopped logging the success entries; the only data is the failures.

What this is for.

The narrow answer: it’s for shipping a small, peculiar product without lying to the people who buy it. Every paid book has its composite score logged, and any book that scores below the floor is held in a review queue rather than printed. Buyers don’t see the score; they see a book that doesn’t send them across a field that isn’t there.

The wider answer: walking is, as far as I know, an underused source of ground truth for LLM evaluation. Most of the work I’ve seen on grounded LLM evals lives in domains where ground truth is either programmatic (run the code, see if it passes) or annotator-driven (pay people to rate things). Walking sits in between. The world checks the model. You don’t have to pay anyone; you have to put on shoes.

I don’t think this generalises perfectly. Most LLM applications don’t have a physical-world feedback loop you can take a Saturday afternoon to close. But the ones that do — agriculture, civic infrastructure, transit, anything where the model is making claims about a place — might benefit from someone walking the territory and writing down what didn’t match.

That’s the moat I’m building. It’s slow. It’s embarrassingly low-tech. It’s also, on the evidence so far, the only one that holds.

What this gets used for

From Your Door — a printed walking book of ten walks routed from any UK postcode, sold as a housewarming gift. £39, posted in seven working days. The eval rig described above is the gate that keeps a book from being printed if the routing or narration scores below the floor.

Written by Joe Wapshott. I’m a sole trader in the UK; I print and post these books myself. If you’ve built an eval rig for a grounded LLM in a different domain, I’d be interested to compare notes — hello@fromyourdoor.com.