Safety Theory
New Paper Argues Verifying AI Safety Is Computationally Intractable
Jasper Yao’s preprint “The Alignment Trap: Complexity Barriers” — circulated this week and gaining steady citations across the safety community — establishes that verifying the safety of AI systems above a critical capability threshold requires exponential time and is coNP-complete. The result formalizes a long-held intuition into a hard theoretical wall: sufficiently capable systems cannot be conclusively verified safe by any polynomial-time process, regardless of approach.
For most of the past three years, the case for AI safety verification has rested on an implicit engineering optimism: that with sufficient cleverness, test coverage, and red-team effort, the safety of a frontier model could in principle be established before deployment. Jasper Yao’s newly circulated preprint, “The Alignment Trap: Complexity Barriers,” argues that this implicit optimism is mathematically misplaced. Above a definable capability threshold, the verification problem is not merely hard; it is coNP-complete, with verification cost growing exponentially in the system’s capability index.
The paper’s central construction defines a verifier as any procedure that, given an AI system and a safety specification, returns a sound certificate of whether the system satisfies the specification across its full input space. Yao demonstrates that for systems above the threshold — roughly, systems capable of compositional reasoning over open-ended input distributions — any sound verifier must in the worst case enumerate a search space whose size scales super-polynomially with capability. The proof proceeds by reduction from boolean satisfiability and is, by the standards of the field, unusually clean: it does not depend on any specific model architecture or training method.
What follows from the result is more consequential than the result itself. The most common framing of pre-release safety testing — that a sufficiently rigorous evaluation suite can certify a system is safe before it ships — is, on Yao’s analysis, structurally impossible above the threshold. Evaluations can establish the presence of specific known harms (an existence proof, in complexity-theoretic terms, which is tractable). They cannot establish the absence of unknown harms (a universal claim, which is not). The asymmetry is not a matter of effort or budget; it is a matter of computational class.
The political timing is sharp. The White House has, over the past three weeks, floated an FDA-style pre-release vetting regime for frontier AI systems — a proposal that depends implicitly on the assumption that a federal evaluator could meaningfully certify safety before approval. Industry voluntary commitments to “test before release” carry the same implicit assumption. Yao’s paper does not claim such regimes are useless; it claims they cannot do what their proponents say they do. A regulator can document the absence of known failure modes. It cannot, in any computationally meaningful sense, certify that no failure modes exist.
The paper has been received with a mixture of resignation and relief inside the alignment research community — resignation that the verification ceiling appears lower than many had hoped, relief that the result is now formalized rather than merely felt. Several prominent researchers, including those most associated with scalable oversight research, posted brief endorsements over the weekend. The harder downstream question, taken up in this edition’s Why It Matters feature, is what kind of safety regime survives the result intact.