Governing Stochastic Generation:
Deterministic Runtime Orchestration via VERN OS
Study Figures & Visualizations Production Dataset: N = 8,374 sessions Annotated Sample: N = 627
72.0%
VERN OS — Graceful Exit
OR 8.58× · +48.9 pp
23.1%
Control — No BCM
52.6%
VERN OS — Verified Completion
OR 10.9× · +43.4 pp
9.2%
Control — No BCM
0
Conversational Drift
Zero character breaks across all VERN deployments
100%
Track B Behavioral Success
Amber & Christine — multi-directional control
Section 3 — VERN OS Architecture
Figure 1. VERN OS three-layer runtime architecture. The deterministic orchestration layer sits between the user-facing application and the underlying stochastic LLM. The Behavioral Control Module (BCM) evaluates deployment policy before every inference call. The emotional signal layer tracks user affective state across turns. The audit layer logs session outcomes for post-interaction review and compliance. VERN OS does not replace the LLM — it governs the permissible behavioral space within which it operates.
Section 5.1 — H₁: BCM Governance & Graceful Exit
Figure 2. Graceful exit rates across the full production dataset (N = 8,374). VERN OS achieved 72.0% versus 23.1% for the unorchestrated control — a 48.9 percentage-point separation. BCM-governed sessions are 8.6× more likely to reach a graceful close. Excludes Izzy and Maria-HMF, both of which incorporated elements of both VERN OS and unorchestrated control architectures and cannot be cleanly assigned to either cohort. Replicates and updates original Figure 1.
z = 31.31 p < 0.0001 Cohen's h = 1.024 (large) OR = 8.58× 95% CI VERN: 71.0–73.0% 95% CI Control: 20.6–25.7%
z = 31.31 p < 0.0001 Cohen's h = 1.024 (large) OR = 8.58× 95% CI VERN: 71.0–73.0% 95% CI Control: 20.6–25.7%
Figure 3. Graceful exit rates by deployment archetype. The task/workflow archetype shows the strongest separation (+63.2 pp). Without BCM governance, task-oriented AI Humans fail to deliver lead captures, workflow completions, recommendations, or routing in 91.6% of interactions.
Section 5.3 — Supplemental Analysis: Strict vs. Verified Task Completion
Figure 4. Three-tier task completion analysis (N = 208 annotated task-archetype sessions). At the strict/nominal level, the groups are statistically indistinguishable (NS). The verified filter — requiring on-goal behavior, zero tangents, and graceful exit simultaneously — reveals the full magnitude of BCM governance's contribution. This is the paper's primary new finding.
What does "Evaluable" mean? Not every session is a fair test of an AI Human's capability — sometimes a user disengages within the first exchange before the system has any opportunity to pursue its goal. The evaluable tier filters to sessions where the user was sufficiently engaged (engagement score ≥ 0.5) to give the AI Human a genuine attempt at completing the workflow. It answers the question: of the sessions where the user actually showed up, how often did the AI Human succeed? The evaluable rate is therefore higher than the strict rate for VERN OS (80.7% vs. 76.2%) because passive early drop-offs are excluded. For the control group the evaluable and strict rates are identical, since their completions are already rare and mostly occur in engaged sessions.
Strict: z = 0.37, p = 0.71 NS
Verified: z = 5.90
p = 5.2×10⁻⁹
Cohen's h = 1.005
OR = 10.9×
Figure 5. Per-persona task completion: strict (nominal) rate shown as a light bar behind the verified (dark) bar. Control personas show catastrophic collapse when the enterprise-grade filter is applied: Denise drops 80 pp, Becky 68 pp, Ronnie reaches zero. Craig (VERN OS) is the only persona where strict and verified rates are identical — every BCM-governed completion also exits cleanly. *Dave's lower verified rate relates to a post-task webhook routing issue, not interactive execution failure.
Section 5.4 — Industry Benchmark Positioning
Figure 6. VERN OS's 72.0% graceful exit rate matches or exceeds best-in-class e-commerce chatbot completion benchmarks and outperforms the top of Gartner's range for mature RAG deployments. The unorchestrated control group (23.1%) falls within the rule-based bot range, confirming that LLM capability alone does not produce reliable session completion without runtime governance.
Industry benchmarks: Gartner (2024); industry aggregates, 2025.
Section 5.6 — Track B: Simulation & Controlled Tension
Figure 7. Track B simulation personas are BCM-governed deployments with an intentionally negative emotional mandate. Amber (entertainment) and Christine (horror) each achieved high graceful exit rates despite designing for negative emotional outcomes — demonstrating that VERN OS provides multi-directional emotional trajectory control, not merely a bias toward positive affect. Both achieved 100% behavioral success with zero character breaks.
Appendix A — Full Deployment Cohort: Graceful Exit Summary
Figure 8. Graceful exit rates for all 21 personas across groups. Carlos and Nick (Control, Companion) show higher rates because they are companion-only without a task mandate — consistent with the archetype gap shown in Figure 3. Control task personas (Ronnie, Becky, Denise) show dramatically lower rates than even other control companions. *Dave's lower rate is attributable to a post-task webhook routing disconnect, not interactive failure.
Excludes Izzy and Maria-HMF — both deployments incorporated elements of both VERN OS and unorchestrated control architectures and cannot be cleanly assigned to either cohort.
Governing Stochastic Generation: Deterministic Runtime Orchestration via VERN OS | Study Figures & Visualizations | N = 8,374 production sessions · N = 627 annotated
