Skip to content

SERVICES / 04

Evaluation, red-teaming, and audit trails.

An agentic trading system without an eval harness is a system that is being evaluated in production, by the market, on a schedule the firm doesn't choose. This brief builds the alternative.


Audience
Builders · Regulators · Institutions
Engagement formats
F-01 · F-02
Typical duration
3–6 weeks
Outputs
Eval harness specification and reference implementation · Red-team plan and findings · Audit-trail schema · Reviewer-grade documentation
Last reviewed
2026-03-22

The question

Eval is the part of the agentic stack that gets cut first when timelines slip and is the part regulators and incident reviewers ask about first.

A defensible eval harness covers the model's reasoning, the tool surface, and the orchestrator — not just the model in isolation. Most public benchmarks cover none of these.

This brief produces a working harness, a red-team report, and an audit-trail schema designed to survive review by a regulator who has never seen the system before.

What this produces

  1. 01An eval harness specification and reference implementation.
  2. 02A red-team plan covering adversarial prompts, tool misuse, and orchestrator-level failure modes.
  3. 03A written findings report with severity grading and remediation paths.
  4. 04An audit-trail schema scoped to the obligations the system is meeting.
  5. 05Reviewer-grade documentation explaining the harness to someone who has not built it.

How it works

Three methodology steps from the standing approach, scoped to this brief.

  1. 01

    Frame

    Read the regulator filings, the codebase, or the internal memo. Write the question that is actually being asked.

  2. 02

    Build the artifact

    The artifact named in Outputs, above. Working notes during the build are visible.

  3. 03

    Hand it off

    A meeting, not a link. Six weeks of follow-up Q&A is included.

What it’s not

This is not a continuous evaluation operations function — once delivered, the harness is the firm's to run.

This is not a model benchmark report — public benchmark numbers do not appear in this brief.

This does not replace formal model-risk-management procedures the firm may already have.

Adjacent briefs


Begin

Send the question.

Contact form Schedule a 30-minute call