Open Alignment & Risk Gaps — A Technical Accounting

AI-2027 Response — Risk & Alignment Section

1. What ForgeRun Currently Addresses

  • Runtime constitutional enforcement
  • Federated cryptographic proofs
  • Temporal synchronization (multi-speed mitigation)
  • Immutable audit trail
  • Governance quorum model

2. Unresolved High-Risk Areas

A. Mechanistic Interpretability

  • No current neuron-level weight inspection pipeline
  • No circuit-level internal goal analysis
  • No mesa-optimizer detection framework

Status: Research required

B. Strategic Deception Detection

  • No gradient-level honesty probing
  • No adversarial optimization stress harness
  • No deception signature classifier

Status: Framework in design

C. Capability Escalation Triggers

  • No benchmark-linked compute throttle
  • No automated model-pause thresholds
  • No cross-lab alert layer

Status: Proposed

D. Agent Containment

  • Execution Passport concept not yet enforced in production
  • No hardware-level sandbox verification

Status: Under development

E. Biological / Dual-Use Safeguards

  • No domain-specific wet-lab risk scoring module
  • No controlled release interface for sensitive outputs

Status: Required

3. Invitation to Review

These gaps are not hidden or minimized. They are documented here as a matter of intellectual responsibility and practical necessity. We welcome scrutiny and collaboration on all five domains.

The following researchers and practitioners are invited to review, critique, and propose improvements to this architecture:

  • Daniel Kokotajlo
  • Scott Alexander
  • Thomas Larsen
  • Eli Lifland
  • Romeo Dean
  • Independent red-team researchers

The architecture is open for scrutiny. Improvements are welcome.

Open Alignment Gaps is a structured accounting of five unresolved alignment domains — deceptive alignment, goal specification, interpretability, multi-agent coordination, and distributional shift — where the CEA provides no narrowing.

Why Gap Accounting Exists

Honest accounting of unresolved gaps prevents overstatement of coverage. The CEA makes no claim to address deceptive alignment, internal goal specification failures, or interpretability challenges.


Five Unresolved Domains

Deceptive alignment: the CEA cannot detect or prevent a system that conceals its objectives. Goal specification: the CEA enforces stated constraints but does not evaluate whether those constraints are correctly specified. Interpretability, multi-agent coordination, and distributional shift are similarly unaddressed.


Relationship to Limits

These gaps are a subset of the Known Limits page. They are separated here to provide technical depth for researchers examining the alignment-specific boundaries of the project.