Evaluation, red-teaming, and audit trails

Evaluation, red-teaming, and audit trails.

An agentic trading system without an eval harness is a system that is being evaluated in production, by the market, on a schedule the firm doesn't choose. This brief builds the alternative.

Audience

Builders · Regulators · Institutions

Engagement formats

F-01 · F-02

Typical duration

3–6 weeks

Outputs

Eval harness specification and reference implementation · Red-team plan and findings · Audit-trail schema · Reviewer-grade documentation

Last reviewed

2026-03-22

The question

Eval is the part of the agentic stack that gets cut first when timelines slip and is the part regulators and incident reviewers ask about first.

A defensible eval harness covers the model's reasoning, the tool surface, and the orchestrator — not just the model in isolation. Most public benchmarks cover none of these.

This brief produces a working harness, a red-team report, and an audit-trail schema designed to survive review by a regulator who has never seen the system before.

What this produces

01An eval harness specification and reference implementation.
02A red-team plan covering adversarial prompts, tool misuse, and orchestrator-level failure modes.
03A written findings report with severity grading and remediation paths.
04An audit-trail schema scoped to the obligations the system is meeting.
05Reviewer-grade documentation explaining the harness to someone who has not built it.

How it works

Three methodology steps from the standing approach, scoped to this brief.

01
Frame
Read the regulator filings, the codebase, or the internal memo. Write the question that is actually being asked.
02
Build the artifact
The artifact named in Outputs, above. Working notes during the build are visible.
03
Hand it off
A meeting, not a link. Six weeks of follow-up Q&A is included.

What it’s not

This is not a continuous evaluation operations function — once delivered, the harness is the firm's to run.

This is not a model benchmark report — public benchmark numbers do not appear in this brief.

This does not replace formal model-risk-management procedures the firm may already have.

Evaluation, red-teaming, and audit trails.

The question

What this produces

How it works

What it’s not

Adjacent briefs