Eval is the part of the agentic stack that gets cut first when timelines slip and is the part regulators and incident reviewers ask about first.
A defensible eval harness covers the model's reasoning, the tool surface, and the orchestrator — not just the model in isolation. Most public benchmarks cover none of these.
This brief produces a working harness, a red-team report, and an audit-trail schema designed to survive review by a regulator who has never seen the system before.
