VERN AI Research Portfolio — Three Studies

Research Portfolio · 2025–2026

VERN AI: Three Studies, One Platform

Converging evidence across behavioral governance, emotional intelligence, and market benchmarking — drawn from 20,000+ production AI conversations.
Each study's N was randomly sampled from over 40,000+ Live conversations.

Study 1

Behavioral Governance & Outcomes

8,374

production sessions

Study 2

Emotion Recognition System Impact

7,295

annotated conversations

Study 3

VERN OS vs Industry Benchmarks

4,773

task-context sessions

Study 1 — Behavioral Governance Study 2 — Emotional Intelligence Study 3 — Industry Benchmarks

Study 1 · N = 8,374

Governing Stochastic Generation

Does adding VERN OS's Behavioral Control Module between the application and the LLM produce measurably better session outcomes? Tested across all deployment archetypes against an unorchestrated control group running the same underlying LLM.

Read the full study →

8.58×

More Likely to Exit
Gracefully

VERN 72.0% · Control 23.1%

10.9×

Higher Odds of
Verified Completion

52.6% VERN · 9.2% Control

Character
Breaks

Across all VERN deployments

Drift
Incidents

Across all VERN deployments

Graceful Exit Rate

Sessions that closed successfully · z = 31.31, p < 0.0001, Cohen's h = 1.024

Verified Task Completion

Goal achieved + on-goal + no tangents + graceful close · z = 5.90, p = 5.2×10⁻⁹, OR = 10.9×. Did the task complete all the stated goals?

What Nominal Completion Misses

Raw task rates look nearly identical — governance reveals the quality gap

Track B — Controlled Negative Mandate

Personas with intentional negative emotional design — graceful exit rate

Key Finding

The nominal completion rate is nearly identical (76.2% vs 74.1%, p = 0.71). The gap only emerges with enterprise-grade quality filters. Governance doesn't inflate task completion — it changes how sessions complete. (Nominal=Did they finish the call. Verified=Did the task get done correctly?)

Control Group Note

Control's rare graceful completions are a self-selected persistent subset — making VERN's reported advantage conservative. The true gap is likely wider.

Track B Result

AI Humans with an intentionally negative emotional mandate (Amber, Christine) still achieved 100% behavioral success and zero character breaks. A VERN-governed NPC that players can't break, manipulate out of character, or extract secrets from is a fundamentally different product than what any current platform offers.

Study 2 · N = 7,295

Emotion Recognition System Impact

Does VERN's per-turn emotion scoring translate to measurable shifts in user emotional state across a full session? Scored using VERN's native ERS — not an LLM judge — across 21 AI Human personas in three archetypes.

Read the full study →

76.9%

Distressed Users
Left Calmer

Companion archetype · n = 733

84.6%

Distressed
Recovery Rate

Task archetype

+3.74

Peak Mean Δvalence

Task archetype · valence = love − distress

AI Human
Personas

All showed positive Δvalence

Companion Archetype

+2.02

Mean Δvalence per session

76.9%

of distressed users left calmer

Task Archetype

+3.74

Mean Δvalence per session

84.6%

of distressed users left calmer

Entertainment Archetype

+2.72

Mean Δvalence per session

75.6%

of distressed users left calmer

Mean Δvalence by Archetype

Positive = users ended with higher emotional valence than when they arrived (love − distress)

Top Performer Δvalence

Selected AI Human personas — mean emotional valence change per session

How the Emotion Scoring Works

VERN ERS scores four signals per turn, 0–100. Valence and distress are derived metrics.

😡

Anger

0–100

😰

Fear

0–100

😢

Sadness

0–100

❤️

Love

0–100

Distress Score

(Anger + Fear + Sadness) ÷ 3

Higher = more distressed

Valence Score

Love − Distress

Δvalence = close − open per session

Methodological note: ERS scores come from VERN's proprietary emotion signal — not an LLM judging another LLM. This eliminates the model-grading-model circularity problem common in AI evaluation.

Strongest Signal

The task archetype showed the highest distress recovery (84.6%) and highest mean Δvalence (+3.74) — goal-directed sessions with clear progress signals are more emotionally effective than open-ended companionship.

Archetype-Aware Scoring

Entertainment personas are not penalized for negative valence — designed emotional tension is intentional. Scoring was calibrated by archetype, making results comparable across all three contexts.

Study 3 · N = 4,773

VERN OS vs. Industry Benchmarks

How does VERN OS perform against published industry figures from Gartner, Salesforce, and the Guardrails AI Index? Compares VERN's task/customer-service outcomes to available external benchmarks across key operational metrics.

Read the full study →

88.5%

Graceful Exit Rate

vs. 28.2% unorchestrated · p = 8.0×10⁻¹⁰³

74–88%

Task
Completion Range

Industry leaders: 70–90%

Drift
Incidents

No industry benchmark exists for this

+0.28

Avg Sentiment Lift

Across all
task sessions

Graceful Exit Rate — VERN vs. Industry

Task and customer service contexts · Sources: Gartner 2025, Salesforce State of Service 2025, Guardrails AI Index 2025

Task Completion — Range Comparison

Published performance range for each system category

VERN OS 74–88%

Industry Leaders 70–90%

RAG-based Average 40–65%

Unorchestrated 23–28%

Behavioral Integrity

Metrics where VERN has no peer — the industry does not yet measure these

Character Integrity (zero breaks)

100% · VERN OS — no industry benchmark exists

Conversational Drift Prevention

100% · VERN OS — no industry benchmark exists

Simulation Behavioral Success

100% · Track B adversarial conditions

Sentiment Lift (avg per session)

+0.28 average — no industry benchmark exists

System Type	Task Completion	Graceful Exit	Drift Prevention	Persona Integrity	Emotional Steering
VERN OS	74–88%	88.5%	✓ 100%	✓ 100%	✓ Measured
Industry Leaders	70–90%	~65–75% est.	— not measured	— not measured	— not measured
RAG-based Avg	40–65%	— not measured	— not measured	— not measured	— not measured
Unorchestrated	23–28%	28.2%	Not controlled	Not controlled	Not measured

Three Capabilities With No Industry Equivalent

The industry doesn't benchmark these because most systems can't do them.

🎯

Real-Time Emotional Trajectory Steering

VERN adjusts AI Human behavior mid-session based on live ERS signals, steering users toward improved emotional valence in 76–85% of distressed cases. No competitor publishes this metric.

No benchmark exists

🎭

Unbroken Persona Integrity

Zero character breaks across all monitored VERN deployments. The AI Human remains in character regardless of user attempts to break it. Industry frameworks do not define or measure this.

No benchmark exists

⚡

Single Agent Replacing 3–8 LLM Committee

VERN achieves governed task completion through one AI Human with deterministic controls — replacing multi-agent pipelines that coordinate 3–8 LLM-powered roles.

No benchmark exists

Context Note

Study 3 draws from task and customer service deployments — contexts comparable to published industry benchmarks. VERN's 88.5% here reflects the task-focused subset; Study 1's 72.0% spans all archetypes including simulation and entertainment.

Statistical Confidence

Fisher's exact test comparing VERN OS to the unorchestrated baseline yields p = 8.0 × 10⁻¹⁰³ — effectively impossible to attribute to chance.

Click on this link for a deeper dive into the statistical and analytical methodology and findings.

Three Studies Show VERN Outperforms Industry Benchmarks

VERN Outperforms Industry Benchmarks

VERN AI: Three Studies, One Platform

Companion Archetype

Task Archetype

Entertainment Archetype

Three Capabilities With No Industry Equivalent

Real-Time Emotional Trajectory Steering

Unbroken Persona Integrity

Single Agent Replacing 3–8 LLM Committee

VERN Admin

One Comment