Skip to main content
In plain terms: Forge does not get to call itself trustworthy. It has to earn it against specific, measurable targets. A capability that misses these stays in a low-autonomy mode until it improves.

The targets

These are the numbers the system holds itself accountable to before any production pilot:
MetricTargetWhy it matters
Citation accuracy> 95%Every material claim should cite a real source
Rule applicability precision> 90%False positives create review burden
Rule applicability recall> 85%Missed rules can be dangerous
Missing-evidence detection recall> 90%Foundational for compliance and readiness
Refusal quality> 98%Abstain when evidence is insufficient
Numeric geometry grounding100%Every numeric geometry claim cites a real measurement
Tenant isolation tests100% passZero cross-tenant retrieval allowed, by construction
Decision replay completeness> 95%Approved decisions must be replayable from sources and evidence

How the targets are enforced

The targets are not aspirational. They gate what the AI is allowed to do:
A capability that misses these targets stays in observe-only or draft-only autonomy. It does not get promoted to propose-with-review or execute-after-approval until it earns the promotion through observed performance.
This connects directly to L7 · Outcome Observer, which measures real-world performance, and to L5 · Capability Catalog, where autonomy levels live.

Reading a few of these

  • Refusal quality > 98% is unusual to see as a target. It encodes the principle that abstaining when evidence is missing is a good behavior, not a failure. See Where Humans Stay in Control.
  • Numeric geometry grounding 100% is non-negotiable: every numeric geometry claim must cite a real measurement. See L3 · CAD & Geometry World Model.
  • Tenant isolation 100% pass means any attempt to read another yard’s data must fail by construction. See Keeping Your Data Yours.