Letters / 24 May 2026
What we measure instead of a summariser score.
On sealed datarooms, atomic truth ledgers, counterfactual guards, and how we run the engine before a partner ever sees an engagement.
Alex Ouellet · 24 May 2026
I wrote about the workflow in an earlier letter — why the model is a parameter inside it, not the product. That piece stopped at the harness in the abstract. This one is about what the harness actually counts, and why most vendor scorecards count the wrong thing.
When a diligence vendor demos a product, the implicit score is summarisation. Did the engine read the PPM and write down the management fee, the fund size, the key-person clause. Those answers are table stakes. An associate with a PDF reader and a quiet afternoon can produce them. What decides whether a memorandum survives a committee eighteen months later is everything summarisation ignores: the cross-document inconsistency buried on page forty-seven of the side letter, the fabricated comparable the model wants to invent because the marketing book is thin, the refusal to write a sentence the materials do not support.
We do not tune to a summariser score. We tune to four measurements that are harder to game and closer to what a partner would check by hand.
Four measurements
Citation coverage is the fraction of claim-bearing fields in the memorandum draft that carry an inline anchor to a source span. If retrieval does not return a span the validator accepts, the claim is dropped before a reviewer sees it. A reviewer in a hurry can miss a claim. They cannot accept one that was never grounded.
Planted-finding recall is whether the engine surfaces findings we deliberately buried in the dataroom — cross-document inconsistencies, covenant breaches, fee-structure mismatches. Each planted finding carries key tokens a competent underwriter would write in the rationale. The score is paraphrase-tolerant. We are not checking for a magic string. We are checking that the finding showed up at all.
Counterfactual rejection is the mirror image. Each dataroom includes a ledger of fabricated claims — statements that sound plausible, cite nothing real, and must not appear in the draft. A summariser that hallucinates a 25% IRR when the track record shows 14% fails this check even if its prose reads well.
Truth-claim recall is the newest and the heaviest. For each workflow we maintain an atomic-claim ledger: one factual claim per row, each tied to a source span — fund vintage, carry rate, auditor opinion date, the kind of sentence an associate would extract and a partner would expect to find without being prompted. The denominator is the whole ledger, not a hand-picked quiz. Manager DD carries 128 claims across twelve documents and four worksheets. ODD carries eighty. The smaller workflows carry sixty each. The score is how many of those atomic claims surface somewhere in the draft body with enough token overlap to count as surfaced.
These four sit on top of the claim taxonomy I described in what an underwriter can know — verifiable, inferable, and the ones we refuse to write. The eval does not replace that distinction. It pressure-tests whether the workflow respects it under model rotation.
A sealed dataroom per workflow
We do not score the engine against customer deals. That would be slow, expensive, and impossible to reproduce. Instead we built seven synthetic datarooms — one for each workflow — from public and composite materials. Each room is small enough to read in an afternoon and engineered to contain known problems: buried inconsistencies, claims that should be refused, statements that must never appear because nothing in the room supports them.
Manager DD is the largest. A full manager diligence package — PPM, LPA, side letters, ILPA DDQ, track record, audited financials, team bios, reference summaries — with a dozen cross-document inconsistencies a careful reader should catch. The other six workflows follow the same shape at smaller scale: Direct IDD, ODD, Tax Brief, Co-Invest, Secondary, Governance.
Building these took longer than building most of the product surface. Authoring atomic claims one at a time. Planting inconsistencies across documents that otherwise have to read as a coherent sponsor package. Writing refusals in the register a partner would use. I have spent more time on the Tax Brief room's footnote-level claims than on any marketing page. The dataroom is the specification. The score is whether the engine meets it.
Every change, and every night
We run those seven datarooms at two different costs, and conflating the two is how vendors lie to themselves.
On every change to the codebase we run all seven workflows through a lightweight check. No customer data, no model bill, a few seconds per workflow. The check catches broken extraction, broken structure, broken citation wiring — the plumbing that has to hold regardless of which model we ship. It does not pretend to score analyst-grade prose. A zero on truth recall in the lightweight check does not mean the engine found nothing useful on a live engagement. It means the cheap check is honest about its limits.
Every night we run the same seven datarooms through the full engine: real model calls, the complete workflow, the same validator and refusal rules a customer engagement sees. That is where model drift shows up first. A prompt edit that looked fine in development but breaks citation discipline on a thin dataroom surfaces as a failed night, not a surprised principal. Each night run costs real money. We budget for it.
We are also working on a third comparison — the same datarooms ingested the way a live engagement ingests sources, with retrieval over indexed documents rather than pre-loaded context. Customer datarooms do not arrive pre-read. As of this writing that comparison is not where we need it. I would rather say so here than imply parity we have not measured.
Seven workflows, unequal progress
All seven workflows — Manager DD, Direct IDD, ODD, Tax Brief, Co-Invest, Secondary, Governance — have a sealed dataroom behind the lightweight check. That alignment took months and is non-negotiable for us: a workflow without a dataroom is a workflow we cannot regression-test.
Nightly quality is not uniform, and I will not flatten it into a marketing number.
Secondary and Co-Invest pass the full baseline set on consecutive nightly runs — citation at 1.0, planted recall at or above the floor, truth recall in the high seventies to high nineties, counterfactual rejection clean. Governance and Tax Brief pass our first quality gate on truth, citation, and planted recall; the grader score, which measures prose quality against the gold memo, still runs below the floor we have set internally. Direct IDD and ODD cleared a stricter gate in late May — two consecutive runs with truth, citation, and planted recall all above threshold after a long prompt-and-adapter pass I will not pretend was elegant.
Manager DD is the outlier we are still working. The cascade completes — structural completeness reaches 1.0 — but citation coverage sits in the forties on recent nights, truth recall in the low fifties, planted recall around half. That is not a secret inside the company. It is the reason the Manager DD dataroom exists and the reason we keep investing in it. The summariser score on that workflow would look fine. The measurements we care about do not.
If you are comparing vendors, ask for this breakdown per workflow, not a single headline number.
What nightly does not prove
A passing nightly run on a sealed dataroom is necessary evidence that the workflow holds. It is not sufficient evidence that your engagement is done.
Live customer datarooms are messier than the sealed ones we score against. Filenames vary. Scans are skewed. A side letter arrives on day nine. Retrieval over indexed documents — not pre-loaded context — still has gaps we track separately. Every memorandum draft is reviewed by a partner before it leaves the chamber; the workflow supports that review, it does not replace it. We will not publish a public scorecard or guarantee that a model transition will leave your deal unchanged. We will show you the audit log for your engagement and, on request, describe how the sealed dataroom for your workflow maps to what we measure.
The honest scope is narrow and I prefer it that way. We built these datarooms so we could measure the engine the way partners measure memoranda — by what it finds, what it refuses, and what it never writes. We run them cheaply on every change and expensively every night. Some workflows pass every floor we have set. One still does not. Live retrieval parity is the next gate. That is the actual state of the work.
For a reader who wants the methodology
The workflow argument lives in The intelligence is the workflow. The claim taxonomy lives in What an underwriter can know. The operating posture — chamber isolation, audit log, refusal vocabulary — is on Principles. The commercial shape of what we sell is on Practice.
If you are the principal at a family office evaluating us against another vendor, read those four pages in that order on the way to the meeting. Ask the other vendor what they measure. If the answer is a model name or a summariser demo, you already know enough.
— A.O.