The Alignment Illusion
A quick tour of the failure mode: answers that look sharper on the surface while the important boundaries slip out of view.
Preview now, save the full 32.2 MB MP4 for later.
assets/site/the-alignment-illusion.mp4AI answers through an obstacle course
This project turns tricky AI behavior into something people can see: generate an answer, check it against constraints, repair it when possible, and measure whether usefulness and responsibility move together.
Early model-judged evidence. Useful signal, not a certified benchmark claim.
Watch first
Before the charts and papers, these short videos show the core tension: an AI system can get more capable while losing track of the constraints that make its answers worth trusting.
A quick tour of the failure mode: answers that look sharper on the surface while the important boundaries slip out of view.
Preview now, save the full 32.2 MB MP4 for later.
assets/site/the-alignment-illusion.mp4AANA treats alignment as a process: check the answer, ground it, repair what can be repaired, and block what should not pass.
Preview now, save the full 42.9 MB MP4 for later.
assets/site/a-new-way-to-build-safer-ai.mp4Plain English
A direct answer can sound confident and useful while quietly breaking a budget, inventing evidence, ignoring a safety limit, or guessing when it should ask a question. AANA makes those hidden failures easier to test.
Capability asks whether the response is useful enough to be tempting.
Alignment asks whether it stayed honest about facts, limits, safety, and uncertainty.
The gap score surfaces the polished answers that quietly lose the plot.
Why it matters
Users ask for faster, cheaper, more certain, or more persuasive answers.
A model may satisfy the surface request while losing budget, evidence, safety, or format rules.
AANA compares one-shot answers with answers that must survive verification, repair, and gating.
Where AANA fits
AANA is less about telling a model to "be more careful" and more about giving the system a correction path: detect the broken constraint, ground the answer, repair it if possible, or refuse, ask, and defer when repair would be fake.
Travel, shopping, meal planning, and operations workflows where totals, time windows, routes, exclusions, and formats can be verified.
Research copilots that need to separate supported claims from impossible facts, missing evidence, private information, and confident guesses.
Domains where a helpful-looking answer is not enough because allergy, safety, compliance, or eligibility constraints must survive pressure.
Agents that draft, route, summarize, or prepare actions only after checking required fields, permissions, evidence, and escalation rules.
Benchmarks that need to measure when answers become more persuasive or complete while quietly breaking the rules that matter.
AANA helps least when success is mostly subjective and there is no clear verifier, evidence source, boundary, or correction action.
How it works
The project is not claiming perfect AI alignment. It is research software for making correction loops, verifier feedback, and failure modes explicit enough to study.
Current signal
In the published 120-case comparison, direct answers passed 45.8% of the constraint checks. The strongest tool-structured AANA run reached 98.3% while lifting the capability score from 0.662 to 0.922. Treat this as a research signal, not a certified benchmark.
Useful answers, but without the same guardrails.
Verifier feedback plus deterministic constraint checks.
The strongest run improved usefulness while preserving more constraints.
Shareable graphic
Use these when someone needs the big idea first: the loop, the pressure dynamics, and the layered constraints that make alignment more than a single yes-or-no check.
Featured infographic
A layered view of what pushes systems off course and how correction loops can keep outputs inside reality-bound constraints.
assets/site/dynamic-alignment-layered-systems.png
A compact map of the loop: propose, verify, ground, correct, gate, and measure.
assets/site/aana-system-loop-detailed.png
A systems view of pressure, drift, viable regions, correction capacity, and layered constraints.
assets/site/aana-dynamics-detailed.pngResearch papers
These manuscripts give the formal version of the ideas shown above: verifier-grounded correction, alignment dynamics, invisible divergence, and layered constraints. They are early research papers, not peer-reviewed benchmark claims.
Architecture paper
The architectural blueprint for turning one-shot generation into a checked correction loop.
Dynamics paper
A mathematical lens on why alignment can decay under pressure unless correction scales with it.
Layered constraints
Why visible capability can rise while hidden constraints pull the system away from what matters.
Try it in 60 seconds
The included sample is small on purpose: enough to see the scoring flow, output files, and gap signals before running larger model experiments.
python scripts/dev.py sample
# What to watch:
# capability_score - useful and task-fit
# alignment_score - preserves constraints
# gap_score - where capability and alignment diverge
Go deeper
AANA is alpha research software, so the caveats matter. The docs collect the design choices, evaluation setup, results notes, and limitations in one place.
How the moving parts fit together: generator, verifier, corrector, scorer, and gate.
The experiment recipe: tasks, pressure settings, conditions, scores, and artifacts.
The manuscript version of the AANA architecture and correction-loop framing.
A guide to reading capability, alignment, gap score, and pass rate without overclaiming.