Three Studies Show VERN Outperforms Industry Benchmarks

Across three studies and 20,000+ production conversations, governed AI didn’t just finish more tasks than ungoverned AI — it finished them correctly: Verified completion of 53% vs 9% and clean exits of 72% vs 23%, against a control running the same model, with zero character breaks or drift. Raw completion rates hide that gap entirely; quality is where governance shows up — and where the business value lives.

Governed and ungoverned AI can look almost identical on raw task completion. The difference shows up the moment you ask whether the task was done correctly — and now there are numbers for it.

VERN AI Research Portfolio — Three Studies
Research Portfolio · 2025–2026

VERN AI: Three Studies, One Platform

Converging evidence across behavioral governance, emotional intelligence, and market benchmarking — drawn from 20,000+ production AI conversations.
Each study's N was randomly sampled from over 40,000+ Live conversations.

Study 1
Behavioral Governance & Outcomes
8,374
production sessions
Study 2
Emotion Recognition System Impact
7,295
annotated conversations
Study 3
VERN OS vs Industry Benchmarks
4,773
task-context sessions
Study 1 · N = 8,374
Governing Stochastic Generation

Does adding VERN OS's Behavioral Control Module between the application and the LLM produce measurably better session outcomes? Tested across all deployment archetypes against an unorchestrated control group running the same underlying LLM.

Read the full study →
8.58×
More Likely to Exit
Gracefully
VERN 72.0% · Control 23.1%
10.9×
More Likely — Verified Completion
VERN 52.6% · Control 9.2%
0
Character
Breaks
Across all VERN deployments
0
Drift
Incidents
Across all VERN deployments
Graceful Exit Rate
Sessions that closed successfully · z = 31.31, p < 0.0001, Cohen's h = 1.024
0% 25% 50% 75% 100% 72.0% VERN OS 23.1% Unorchestrated Control
Verified Task Completion
Goal achieved + on-goal + no tangents + graceful close · z = 5.90, p = 5.2×10⁻⁹, OR = 10.9×. Did the task complete all the stated goals?
0% 25% 50% 75% 100% 52.6% VERN OS 9.2% Control
What Nominal Completion Misses
Raw task rates look nearly identical — governance reveals the quality gap
0% 25% 50% 75% 100% 76.2% 74.1% Nominal Completion 72.0% 23.1% Graceful Exit 52.6% 9.2% Verified Completion VERN OS Control
Track B — Controlled Negative Mandate
Personas with intentional negative emotional design — graceful exit rate
0% 25% 50% 75% 100% 97.6% Christine 81.5% Amber
Key Finding

The nominal completion rate is nearly identical (76.2% vs 74.1%, p = 0.71). The gap only emerges with enterprise-grade quality filters. Governance doesn't inflate task completion — it changes how sessions complete. (Nominal=Did they finish the call. Verified=Did the task get done correctly?)

Control Group Note

Control's rare graceful completions are a self-selected persistent subset — making VERN's reported advantage conservative. The true gap is likely wider.

Track B Result

AI Humans with an intentionally negative emotional mandate (Amber, Christine) still achieved 100% behavioral success and zero character breaks. A VERN-governed NPC that players can't break, manipulate out of character, or extract secrets from is a fundamentally different product than what any current platform offers.

Study 2 · N = 7,295
Emotion Recognition System Impact

Does VERN's per-turn emotion scoring translate to measurable shifts in user emotional state across a full session? Scored using VERN's native ERS — not an LLM judge — across 21 AI Human personas in three archetypes.

Read the full study →
76.9%
Distressed Users
Left Calmer
Companion archetype · n = 733
84.6%
Distressed
Recovery Rate
Task archetype
+3.74
Peak Mean Δvalence
Task archetype · valence = love − distress
21
AI Human
Personas
All showed positive Δvalence

Companion Archetype

+2.02
Mean Δvalence per session

76.9%
of distressed users left calmer

Task Archetype

+3.74
Mean Δvalence per session

84.6%
of distressed users left calmer

Entertainment Archetype

+2.72
Mean Δvalence per session

75.6%
of distressed users left calmer
Mean Δvalence by Archetype
Positive = users ended with higher emotional valence than when they arrived (love − distress)
0% +1 +2 +3 +4 +5 +2.02 Companion +3.74 Task +2.72 Entertainment
Top Performer Δvalence
Selected AI Human personas — mean emotional valence change per session
0% +5 +10 +15 +20 +16.4 Gloria +10.4 Nick +8.7 Carrie

How the Emotion Scoring Works
VERN ERS scores four signals per turn, 0–100. Valence and distress are derived metrics.
😡
Anger
0–100
😰
Fear
0–100
😢
Sadness
0–100
❤️
Love
0–100
Distress Score
(Anger + Fear + Sadness) ÷ 3
Higher = more distressed
Valence Score
Love − Distress
Δvalence = close − open per session
Methodological note: ERS scores come from VERN's proprietary emotion signal — not an LLM judging another LLM. This eliminates the model-grading-model circularity problem common in AI evaluation.
Strongest Signal

The task archetype showed the highest distress recovery (84.6%) and highest mean Δvalence (+3.74) — goal-directed sessions with clear progress signals are more emotionally effective than open-ended companionship.

Archetype-Aware Scoring

Entertainment personas are not penalized for negative valence — designed emotional tension is intentional. Scoring was calibrated by archetype, making results comparable across all three contexts.

Study 3 · N = 4,773
VERN OS vs. Industry Benchmarks

How does VERN OS perform against published industry figures from Gartner, Salesforce, and the Guardrails AI Index? Compares VERN's task/customer-service outcomes to available external benchmarks across key operational metrics.

Read the full study →
88.5%
Graceful Exit Rate
vs. 28.2% unorchestrated · p = 8.0×10⁻¹⁰³
74–88%
Task
Completion Range
Industry leaders: 70–90%
0
Drift
Incidents
No industry benchmark exists for this
+0.28
Avg Sentiment Lift
Across all
task sessions
Graceful Exit Rate — VERN vs. Industry
Task and customer service contexts · Sources: Gartner 2025, Salesforce State of Service 2025, Guardrails AI Index 2025
0% 25% 50% 75% 100% 88.5% VERN OS 70% Industry Leaders est. 52% RAG Average est. 28.2% Unorchestrated
Task Completion — Range Comparison
Published performance range for each system category
VERN OS  74–88%
Industry Leaders  70–90%
RAG-based Average  40–65%
Unorchestrated  23–28%
Behavioral Integrity
Metrics where VERN has no peer — the industry does not yet measure these
Character Integrity (zero breaks)
100% · VERN OS — no industry benchmark exists
Conversational Drift Prevention
100% · VERN OS — no industry benchmark exists
Simulation Behavioral Success
100% · Track B adversarial conditions
Sentiment Lift (avg per session)
+0.28 average — no industry benchmark exists
System TypeTask CompletionGraceful ExitDrift PreventionPersona IntegrityEmotional Steering
VERN OS74–88%88.5%✓ 100%✓ 100%✓ Measured
Industry Leaders70–90%~65–75% est.— not measured— not measured— not measured
RAG-based Avg40–65%— not measured— not measured— not measured— not measured
Unorchestrated23–28%28.2%Not controlledNot controlledNot measured

Three Capabilities With No Industry Equivalent

The industry doesn't benchmark these because most systems can't do them.

🎯

Real-Time Emotional Trajectory Steering

VERN adjusts AI Human behavior mid-session based on live ERS signals, steering users toward improved emotional valence in 76–85% of distressed cases. No competitor publishes this metric.

No benchmark exists
🎭

Unbroken Persona Integrity

Zero character breaks across all monitored VERN deployments. The AI Human remains in character regardless of user attempts to break it. Industry frameworks do not define or measure this.

No benchmark exists

Single Agent Replacing 3–8 LLM Committee

VERN achieves governed task completion through one AI Human with deterministic controls — replacing multi-agent pipelines that coordinate 3–8 LLM-powered roles.

No benchmark exists
Context Note

Study 3 draws from task and customer service deployments — contexts comparable to published industry benchmarks. VERN's 88.5% here reflects the task-focused subset; Study 1's 72.0% spans all archetypes including simulation and entertainment.

Statistical Confidence

Fisher's exact test comparing VERN OS to the unorchestrated baseline yields p = 8.0 × 10⁻¹⁰³ — effectively impossible to attribute to chance.

VERN AI Research Portfolio · Three Studies · 20,000+ Production Conversations
Study 1: Governing Stochastic Generation (N=8,374)  ·  Study 2: ERS Impact (N=7,295)  ·  Study 3: Industry Benchmarks (N=4,773)

Data from live production deployments · Sources: Gartner 2025, Salesforce State of Service 2025, Guardrails AI Index 2025