Open Alignment & Risk Gaps — A Technical Accounting

AI-2027 Response — Risk & Alignment Section

1. What ForgeRun Currently Addresses

Runtime constitutional enforcement
Federated cryptographic proofs
Temporal synchronization (multi-speed mitigation)
Immutable audit trail
Governance quorum model

2. Unresolved High-Risk Areas

A. Mechanistic Interpretability

No current neuron-level weight inspection pipeline
No circuit-level internal goal analysis
No mesa-optimizer detection framework

Status: Research required

B. Strategic Deception Detection

No gradient-level honesty probing
No adversarial optimization stress harness
No deception signature classifier

Status: Framework in design

C. Capability Escalation Triggers

No benchmark-linked compute throttle
No automated model-pause thresholds
No cross-lab alert layer

Status: Proposed

D. Agent Containment

Execution Passport concept not yet enforced in production
No hardware-level sandbox verification

Status: Under development

E. Biological / Dual-Use Safeguards

No domain-specific wet-lab risk scoring module
No controlled release interface for sensitive outputs

Status: Required

3. Invitation to Review

These gaps are not hidden or minimized. They are documented here as a matter of intellectual responsibility and practical necessity. We welcome scrutiny and collaboration on all five domains.

The following researchers and practitioners are invited to review, critique, and propose improvements to this architecture:

Daniel Kokotajlo
Scott Alexander
Thomas Larsen
Eli Lifland
Romeo Dean
Independent red-team researchers

The architecture is open for scrutiny. Improvements are welcome.

Open Alignment Gaps is a structured accounting of five unresolved alignment domains — deceptive alignment, goal specification, interpretability, multi-agent coordination, and distributional shift — where the CEA provides no narrowing.

Why Gap Accounting Exists

Honest accounting of unresolved gaps prevents overstatement of coverage. The CEA makes no claim to address deceptive alignment, internal goal specification failures, or interpretability challenges.

Five Unresolved Domains

Deceptive alignment: the CEA cannot detect or prevent a system that conceals its objectives. Goal specification: the CEA enforces stated constraints but does not evaluate whether those constraints are correctly specified. Interpretability, multi-agent coordination, and distributional shift are similarly unaddressed.

Relationship to Limits

These gaps are a subset of the Known Limits page. They are separated here to provide technical depth for researchers examining the alignment-specific boundaries of the project.