Letters / 20 February 2026

The intelligence is the workflow.

On why the model is the cheapest part of a diligence engine, and what has to be built instead.

Alex Ouellet · 20 February 2026

We get asked which model we use. The question comes up on most intro calls with family offices, and twice now from investors doing their own diligence on us. I never give a useful answer, and I want to write down why.

The model we use today is not the model we used six months ago. It probably isn't the model we'll use six months from now. Saying which one would put the wrong question in the reader's head. The right question, the one that decides whether the product holds, is what we do when it changes.

We've answered that one in code, not in a marketing line, and the answer is what this letter is about.

The harness

We have an internal harness. It's a curated set of real diligence questions, with the answers an experienced underwriter would write down: a continuation-vehicle term sheet, a private-credit covenant pack, an operating partner's reference notes, a sponsor's marketing book paired with the deal-level returns it omits. Most of the answers in the harness are refusals — the kind of "the materials don't support this" sentence I would write by hand on a Friday afternoon. The harness runs every night and scores the run.

We've had three model transitions in the past fourteen months. Two were announced. The third wasn't; the provider rolled a quiet update one Wednesday, we noticed the harness twitch on Thursday morning, and by Friday we'd pulled it apart enough to know what shifted. In all three transitions the public benchmarks moved by ten or twenty points. The harness moved by three or four. After the quiet update, one fixture dropped seven points and three others gained two each; a week in the prompt registry got the loss back.

That's the experiment. It hasn't been twelve months long enough to call it conclusive, but the shape has held.

Model · the cheapest input

v3.0anonymous · interchangeable

Harness · Manager IDD

71%

System · what survives

  • Harness
  • Prompt registry
  • Schema
  • Audit log
  • Refusal vocabulary
The parameter changes. The system holds.

The thing I'd want a fund partner to take from it is small. The model is the cheapest input we touch. Everything else — the harness above, the prompt registry, the schema the memorandum is rendered against, the audit log, the refusal vocabulary, the validator that drops a claim without a citation — was written by us, lives in the same repository as this letter, and survives any single model transition.

That's the asset.

A fixture, in detail

The fixture I want to describe is a continuation vehicle, because the work it asks for is concrete enough to be honest about.

A continuation vehicle is what a sponsor does near the end of a fund's life when they want to keep one of the portfolio companies. The sponsor sells the asset from the old fund to a new vehicle the same sponsor manages. New LPs in the new vehicle. The same operating team running the asset. The pricing gets set by the sponsor against a fairness opinion the sponsor's own counsel commissions.

There are eight things a careful diligence on one of these has to check. Three are visible in the materials the sponsor sends. The other five aren't, and those are the hard ones. Whether the LPs in the old fund got a real choice, or whether the rollover document was structured so a "no" was effectively a "yes." Whether the fairness opinion's comparables include trades that happened in distress. Whether the operating team's new strike resets to a number that would, in public markets, get called a giveaway. There are two more that I won't put on this page because they vary too much by structure.

An engine that reads what the sponsor sends and writes down the three visible checks is a summariser. An engine that reads the same materials, refuses to make claims about the five invisible checks, and writes down in plain prose why it refused, is a memorandum.

Our harness measures both. The summariser score is a precondition. The refusal score is the one we tune to.

I think today's model is roughly competent at the first and uneven at the second. The model eighteen months ago was poor at both. I expect the model eighteen months from now to be excellent at the first and not much better at the second, because the discipline that's missing in the second isn't a property of the model. It's the discipline of dropping a claim that can't find a citation. We had to build that ourselves.

Citation, refusal, the log

Three practices fall out of that posture. They're easier to describe than to install — none of them is exotic, all of them took longer than I expected.

Citations come first. Every claim the engine writes carries an inline anchor to a span: filename, page, paragraph. If retrieval doesn't return a span the model is allowed to cite, the validator drops the claim before it reaches a reviewer. A reviewer in a hurry can miss a claim. They can't accept one that isn't there.

Refusals come second. The schema requires a short reason in a refused field, written in the partner's prose. Materials don't contain the basis for an answer. The question is one for counsel. The comparable is too thin. We don't allow "I cannot determine," because "I cannot determine" is a model talking; the refusal has to sound like a partner. After a few engagements with us, a partner reads the absence of a sentence as a finding, which is how the trained eye works in this business already.

The log is the third. Every step of a workflow run records what was retrieved, what was sent to the model, what came back, which fields were coerced into the schema, which were dropped, which were signed. A general counsel can read it on a quiet morning. A reviewer can replay it. A regulator, when one shows up, can read it.

Two operating rules sit underneath the whole thing. The first docs/adr/0017-cypress-eval-in-process-extractor.mdCypress eval uses in-process extractor + chunker, not live Supabase. is that the continuous-integration eval reads documents through the same extractor and chunker the production engine uses, but doesn't touch a live database. A regression in the extractor has to show up in the eval. A flaky database can't be allowed to fail it. The eval measures the engine, not its surroundings.

The second docs/adr/0018-real-chamber-nightly-eval.mdReal-chamber nightly eval — GH Actions, 07:00 UTC, against the canned fixture. is that we run the real chambers, against the real fixtures, every night at seven UTC. The night run is expensive. It calls the model the way a customer call does, on a fixture roughly the size of a private-credit data room. The cheap CI gate doesn't catch model drift. The nightly run does. When something moves under us, the nightly run is the first place that says so.

Without either rule, the workflow is something I claim to have built. With both, I can show it.

For a reader on the way to a meeting

If you're the principal at a family office, this is the part of the letter to look at on the way to a meeting with a vendor.

When the vendor tells you their product is "powered by" a particular model, they're telling you the model is the thing. They want you to trust the model. What they aren't telling you, and what isn't on the deck, is what happens when that model changes on a schedule they don't control. They aren't telling you what happens when the model refuses a claim it should have made, or writes a claim it shouldn't have written. They aren't telling you whether the prompt that produced the demo will produce the answer for your deal six months from now, after the model has been retrained on a corpus you can't see.

The useful question isn't which model. It's how the vendor's product holds up when that model rotates. The honest answer is one of two things. Either there's a harness, a registry, an audit log, and a refusal vocabulary the vendor can show you, or the vendor talks about being thoughtful and pivots back to the model name.

For the investor doing diligence on us as a company, the same question applies. What we're building isn't a wrapper on top of a model. The wrapper, the harness, the registry, the audit log, the validator, the refusal vocabulary, the workflow that strings them together — that's the company. The model is a parameter inside it, and a small one.

The pushback I hear most often is that the frontier moves quickly, and so the workflow we've built will have to be rewritten in a year. Some of it, yes. The post-write validator will get simpler when the model is better at staying inside its citations. The retrieval prompts will get shorter. The reviewer interface will get more direct. None of those are the work, though. The work is the eight checks on a continuation vehicle, the five that aren't visible in the room, the verb tense that distinguishes a fact from an inference, the discipline of dropping a sentence the materials don't support. None of that moves because the model got a better tokenizer.

On not naming the model

Which model do we use, then. The one that scored highest on the harness when we shipped this quarter's release. Probably a different one in the next release. If the customer's chamber is configured for AWS Bedrock the model resolves there; if it's a sovereign deployment the model is an open-weights one the customer runs on their own hardware. The audit log records each resolution.

The intelligence is the workflow. I wish I had a better line for it, because the line itself is the kind of phrase a vendor would put on a homepage and it's why I've left it out of ours. The substance of what a customer pays for is the work I described above, which is harder to point to than a model name, but which is the only honest answer to the question I get asked the most.

— A.O.

← All letters