In plain terms: Forge does not get to call itself trustworthy. It has to earn
it against specific, measurable targets. A capability that misses these stays in a
low-autonomy mode until it improves.
The targets
These are the numbers the system holds itself accountable to before any production pilot:| Metric | Target | Why it matters |
|---|---|---|
| Citation accuracy | > 95% | Every material claim should cite a real source |
| Rule applicability precision | > 90% | False positives create review burden |
| Rule applicability recall | > 85% | Missed rules can be dangerous |
| Missing-evidence detection recall | > 90% | Foundational for compliance and readiness |
| Refusal quality | > 98% | Abstain when evidence is insufficient |
| Numeric geometry grounding | 100% | Every numeric geometry claim cites a real measurement |
| Tenant isolation tests | 100% pass | Zero cross-tenant retrieval allowed, by construction |
| Decision replay completeness | > 95% | Approved decisions must be replayable from sources and evidence |
How the targets are enforced
The targets are not aspirational. They gate what the AI is allowed to do:A capability that misses these targets stays in observe-only or draft-only autonomy. It does not get promoted to propose-with-review or execute-after-approval until it earns the promotion through observed performance.This connects directly to L7 · Outcome Observer, which measures real-world performance, and to L5 · Capability Catalog, where autonomy levels live.
Reading a few of these
- Refusal quality > 98% is unusual to see as a target. It encodes the principle that abstaining when evidence is missing is a good behavior, not a failure. See Where Humans Stay in Control.
- Numeric geometry grounding 100% is non-negotiable: every numeric geometry claim must cite a real measurement. See L3 · CAD & Geometry World Model.
- Tenant isolation 100% pass means any attempt to read another yard’s data must fail by construction. See Keeping Your Data Yours.