What does it actually mean to prove what an AI agent did?

Three properties together: a contemporaneous record (captured at the time of the action, not reconstructed later), a tamper-evident mechanism (retroactive edits are mathematically detectable), and a regulator-acceptable format (the evidence presented in a structure an auditor, regulator or court accepts as valid). All three are required. Two out of three is not sufficient.

Why is application logging not enough?

Application logs capture HTTP requests, stack traces and performance metrics — the wrong fields. They have no agent-level decision context (which sub-agent? which resource? what data classification?). They are mutable: anyone with write access to the log store can edit them and there is no cryptographic detection. They are scattered across multiple systems with inconsistent retention policies. A regulator presented with application logs as Article 12 evidence will typically reject them.

What is the minimum evidence I need for EU AI Act Article 12?

Article 12 requires automatic recording of events over the system lifecycle that enable identification of risk situations under Article 79, post-market monitoring under Article 72, and monitoring of operation under Article 26. In practice: every agent action captured at the time, retained for at least six months, tamper-evident, and producible as a focused evidence pack for any specified system and period on a regulator's request.

What's the fastest path to compliance for a fintech with agents already in production?

Five steps and roughly one week of engineering time: (1) install the SDK and wrap your agent runtime, (2) confirm receipts are flowing in the operator dashboard, (3) configure the EU AI Act Article 12 pack template against your agent inventory, (4) enable RFC 3161 notarisation, (5) walk a sample pack through with your auditor and adjust the agent classification metadata they want surfaced. From zero to a defensible Article 12 evidence trail.

How to prove what an AI agent did — a practical guide

8 June 2026 · 11 min read · By Hak Turkel

The question "what did the AI do?" gets asked in four very different contexts: an ICO subject access request, an FCA SYSC operational resilience review, a Lloyd's cyber renewal questionnaire, and an adverse-event class action. The teams that can answer it confidently have the same building blocks. The teams that cannot tend to share a different pattern: an over-reliance on application logs that were never designed to answer compliance questions.

Why application logging fails the question.

Three reasons, in increasing order of severity.

The wrong fields. Application logs are designed to debug infrastructure: HTTP method, status code, latency, trace ID. They are not designed to evidence a decision: which sub-agent fired, what data classification it touched, what the model's confidence was, what tool calls it made, what the final decision was, whether a human approved or overrode it. You can stuff some of this into structured logs, but you are retrofitting evidence onto a debugging tool.
Mutability. Any engineer with write access to your log store can edit, delete or backdate entries. There is no cryptographic mechanism that detects edits. A regulator presented with edit-capable logs as evidence of past behaviour will treat them as informative but not conclusive — and an adversarial counterparty will argue exactly that.
Scatter. Real production AI agents span six to ten systems: model provider logs, application logs, infra logs, DB audit, message broker, vector store, vector cache, retrieval layer. Each has a different retention policy, a different access-control model, and a different operator. Even producing a complete timeline for one agent run takes hours of correlation work. Doing it across hundreds of users for a SAR is impossible.

None of these is fixable by writing more application code. They are structural properties of using observability infrastructure for a compliance question.

The five-step pattern.

Every team that ends up with a defensible AI agent audit trail builds the same five things. The shapes differ but the categories are identical.

1. Instrument the agent runtime.

Capture has to happen inside the agent loop, not next to it. Wrap the runtime — the OpenAI Agents SDK, the Claude Agent SDK, an MCP server, LangChain, CrewAI — so every tool call, model call, decision and sub-agent spawn produces a structured event at the moment it happens. Runtime-layer capture cannot be bypassed by the agent's own logic; application-layer instrumentation can.

2. Emit a signed receipt per action.

A receipt is a small, well-defined record:

{
  "event_id":      "01J6Q7T8K3N4P5R6S7V8W9XAYZ",
  "agent_id":      "claims-triage-v3",
  "session_id":    "claim-2025-09-001",
  "ts":            "2026-09-08T14:14:22.331Z",
  "action":        { "type":"tool_call", "name":"customer_record_read" },
  "resource":      { "type":"customer", "id":"cust_abc123",
                     "classification":["PII","financial"] },
  "redacted_input":  "...PII stripped at the SDK before leaving perimeter...",
  "redacted_output": "...",
  "body_hash":     "sha256:..."
}

PII is stripped at the SDK before leaving the customer perimeter — the backend never sees raw email addresses, card numbers or government IDs. The receipt's body_hash is the SHA-256 of the canonical receipt body.

3. Hash-chain the receipts.

Each receipt embeds the previous receipt's body hash as its prev_hash field. The chain head per (agent, session) is the rolling SHA-256 of every receipt in the session in order. A single-byte edit anywhere breaks every subsequent hash. The chain is internal cryptographic consistency: the records are mathematically bound to each other and cannot be silently mutated.

4. Notarise the chain head with RFC 3161.

After every chain-head advance, send the new head hash to a Time-Stamping Authority. The TSA signs the hash together with the current UTC time and returns an RFC 3161 token. This is the cryptographic statement "at moment T, this exact chain head existed". Without it, a sufficiently motivated adversary could claim the entire chain was generated after the fact. With it, the record is anchored in wall-clock time by an independent third party whose signature is verifiable offline.

Detailed primer on RFC 3161 timestamping for AI audit logs →

5. Generate a jurisdictional evidence pack on demand.

For any (agent, period, system) the dashboard produces a regulator-ready pack: cover sheet, agent inventory, action log excerpt, hash-chain integrity proof, RFC 3161 notarisation tokens, signed manifest, redacted inputs and outputs. The pack ships as portable JSON + auditor-friendly PDF. Your auditor can re-verify it independently with agentaudit-verify pack.json — no contact with Agent Audit required for the verification step.

The jurisdictional templates currently shipped: EU AI Act Article 12, ICO SAR, FCA SYSC, NIST AI RMF, Insurance Claim, Board / Audit Committee. New templates are added as design partners co-design them with their auditors.

What "regulator-acceptable" actually means.

The difference between "we have logs" and "we have evidence" is whether the format is one the regulator already recognises and the auditor can sign off on independently. The three properties an acceptable pack carries:

Property	What it does
Signed manifest	The pack manifest is a single signed document listing every receipt's hash, the chain head per session, and every notarisation. Tampering with any part of the pack invalidates the manifest signature.
Hash-chain proof	The manifest's integrity section enumerates every session, its chain head, and confirms the chain replayed cleanly. An auditor re-runs the replay independently.
RFC 3161 notarisation tokens	Every chain head advance carries a TSA-signed timestamp. The auditor decodes the token offline against the TSA's public certificate and confirms the hash matches.

The honest cost-benefit.

Building all of this in-house is a 6-12 month engineering programme for a team with relevant crypto and compliance experience. Most firms do not have that team and the engineering cost is not the slowest part — the slowest part is getting the export format signed off by the auditor's lawyers.

Using a managed solution like Agent Audit collapses this to roughly a week of integration work plus the conversations with your auditor. The hash chain, RFC 3161 notarisation, pack generators, and verify CLI are already built and the export formats have been designed against the actual regulatory texts.

If you are building it yourself anyway, the open-source SDK gives you the receipt format, the chain logic and the verify CLI. The managed backend is what you pay for: hot retention, cold archive, notarisation infrastructure, pack generation, SCIM, SSO, and the jurisdictional pack catalogue.

How to prove what an AI agent did.