Across three studies and 20,000+ production conversations, governed AI didn’t just finish more tasks than ungoverned AI — it finished them correctly: Verified completion of 53% vs 9% and clean exits of 72% vs 23%, against a control running the same model, with zero character breaks or drift. Raw completion rates hide that gap entirely; quality is where governance shows up — and where the business value lives.
Governed and ungoverned AI can look almost identical on raw task completion. The difference shows up the moment you ask whether the task was done correctly — and now there are numbers for it.
VERN AI: Three Studies, One Platform
Converging evidence across behavioral governance, emotional intelligence, and market benchmarking — drawn from 20,000+ production AI conversations.
Each study's N was randomly sampled from over 40,000+ Live conversations.
Does adding VERN OS's Behavioral Control Module between the application and the LLM produce measurably better session outcomes? Tested across all deployment archetypes against an unorchestrated control group running the same underlying LLM.
Read the full study →Gracefully
Breaks
Incidents
The nominal completion rate is nearly identical (76.2% vs 74.1%, p = 0.71). The gap only emerges with enterprise-grade quality filters. Governance doesn't inflate task completion — it changes how sessions complete. (Nominal=Did they finish the call. Verified=Did the task get done correctly?)
Control's rare graceful completions are a self-selected persistent subset — making VERN's reported advantage conservative. The true gap is likely wider.
AI Humans with an intentionally negative emotional mandate (Amber, Christine) still achieved 100% behavioral success and zero character breaks. A VERN-governed NPC that players can't break, manipulate out of character, or extract secrets from is a fundamentally different product than what any current platform offers.
Does VERN's per-turn emotion scoring translate to measurable shifts in user emotional state across a full session? Scored using VERN's native ERS — not an LLM judge — across 21 AI Human personas in three archetypes.
Read the full study →Left Calmer
Recovery Rate
Personas
Companion Archetype
Task Archetype
Entertainment Archetype
The task archetype showed the highest distress recovery (84.6%) and highest mean Δvalence (+3.74) — goal-directed sessions with clear progress signals are more emotionally effective than open-ended companionship.
Entertainment personas are not penalized for negative valence — designed emotional tension is intentional. Scoring was calibrated by archetype, making results comparable across all three contexts.
How does VERN OS perform against published industry figures from Gartner, Salesforce, and the Guardrails AI Index? Compares VERN's task/customer-service outcomes to available external benchmarks across key operational metrics.
Read the full study →Completion Range
Incidents
task sessions
| System Type | Task Completion | Graceful Exit | Drift Prevention | Persona Integrity | Emotional Steering |
|---|---|---|---|---|---|
| VERN OS | 74–88% | 88.5% | ✓ 100% | ✓ 100% | ✓ Measured |
| Industry Leaders | 70–90% | ~65–75% est. | — not measured | — not measured | — not measured |
| RAG-based Avg | 40–65% | — not measured | — not measured | — not measured | — not measured |
| Unorchestrated | 23–28% | 28.2% | Not controlled | Not controlled | Not measured |
Three Capabilities With No Industry Equivalent
The industry doesn't benchmark these because most systems can't do them.
Real-Time Emotional Trajectory Steering
VERN adjusts AI Human behavior mid-session based on live ERS signals, steering users toward improved emotional valence in 76–85% of distressed cases. No competitor publishes this metric.
Unbroken Persona Integrity
Zero character breaks across all monitored VERN deployments. The AI Human remains in character regardless of user attempts to break it. Industry frameworks do not define or measure this.
Single Agent Replacing 3–8 LLM Committee
VERN achieves governed task completion through one AI Human with deterministic controls — replacing multi-agent pipelines that coordinate 3–8 LLM-powered roles.
Study 3 draws from task and customer service deployments — contexts comparable to published industry benchmarks. VERN's 88.5% here reflects the task-focused subset; Study 1's 72.0% spans all archetypes including simulation and entertainment.
Fisher's exact test comparing VERN OS to the unorchestrated baseline yields p = 8.0 × 10⁻¹⁰³ — effectively impossible to attribute to chance.
