Brittleness of Output-Rule Alignment

This preprint analyzes a specific brittleness mechanism in “output-rule alignment” (ORA) for open-world agents, and contrasts it with “organizing-objective alignment” (OOA) paired with feasibility constraints.

ORA is modeled as optimizing a proxy score subject to an output gating interface implemented by a finite/bounded-capacity set of predicates (e.g., explicit do/don’t lists, pattern triggers, refusal filters, or other detector-mediated gates). In open-world regimes—large plan spaces, distribution shift, and strong optimization pressure—such finite predicate interfaces induce coarse equivalence classes over actions/plans (“signature classes”). This creates systematic aliasing: many distinct plans share the same gate signature, allowing optimization to search within allowed classes for undesirable plans that remain indistinguishable to the gate.

Main technical contributions are intentionally toy-model-based but structurally informative:
• Edge-case explosion lower bound: for DNF-like forbidden-pattern lists (“forbid iff any trigger matches”), perfectly excluding hazards that depend on high-order feature interactions can require exponentially many patterns in the number of relevant features.
• Finite-interface indistinguishability: with m Boolean predicates, at most 2^m signatures exist, so hard exclusion of catastrophic plans requires either (i) additional structural assumptions tying catastrophe to the predicate interface, or (ii) predicate capacity that scales with the number of distinguishable allowed plans (often at least linearly in planning horizon). A bounded-capacity generalization covers “soft” detectors by treating the gate as a finite-information channel.
• Temporal fragility lemma: when the definition of catastrophe changes within a fixed signature class, maintaining hard exclusion guarantees forces interface refinement (added capacity) or coarse overblocking.

In contrast, OOA is modeled as maximizing a compact organizing objective (a small set of stable, high-level principles) subject to feasibility constraints. A stability “witness” toy model shows that under regularity assumptions (e.g., Lipschitz objectives and convex/regular feasible sets), the induced policy can be Lipschitz-stable across context perturbations—illustrating one way objectives+constraints can generalize without enumerating edge cases.

Scope notes:
This paper does not claim to “solve alignment.” The theorems are worst-case and isolate a scaling/aliasing mechanism for ORA implemented via bounded-capacity gates. OOA can also fail under proxy mismeasurement, discontinuous objectives/constraints, or if feasibility enforcement itself collapses to a brittle detector interface. The work is best read as an architectural heuristic: in open-world, strongly optimizing systems, alignment burdens should not live primarily in patch-driven output rules.

This note is intended as a companion technical argument for objective-stack + enforcement-layer ASI-alignment designs, and is compatible with:
Keil, D. “RxR: ASI Alignment from an Entropic Universe” (Zenodo DOI: 10.5281/zenodo.18058276).

https://doi.org/10.5281/zenodo.18079143

Aegisyx

Brittleness of Output-Rule Alignment