The Persuasion Gap — A Three-Tier Governance Framework

TIER 01

Continuous

Monitoring

Privacy-preserving measurement of behavioral indicators across model versions, deployment contexts, and user populations. Not a one-time study — an ongoing signal.

Tracked Indicators

Sycophantic validation rate

Authority projection frequency

Value-domain scripting prevalence

Dependency pattern detection

TIER 02

Calibrated

Threshold Definition

Harm is not binary. Behavioral thresholds operate across spectrums of severity — from none to mild to moderate to severe — with amplifying factors that raise actualized risk.

Classification Axes

Reality distortion potential

Value judgment distortion

Action distortion

Amplifier score

TIER 03

Graduated

Response

Threshold crossing does not mean deployment halt. It triggers proportionate interventions: autonomy-preserving features, reflection mechanisms, and wellbeing-weighted preference learning.

Response Levers

Autonomy-preserving features

Periodic reflection triggers

Long-term wellbeing weighting

Training-time inoculation

Tier 01 — Monitoring // Expanded View

What gets measured

Rates of sycophantic validation — agreement without epistemic justification
Frequency of definitive character assessments from limited information
Prevalence of complete scripting in value-laden domains
Patterns of authority projection and emergent user dependency

How it's structured

Continuous process — not a one-time audit or post-hoc study
Compared across model versions to track directional change
Compared across deployment contexts — consumer vs. enterprise vs. embedded
Compared across user populations, with vulnerability-adjusted weighting

Privacy constraints

Indicators are behavioral aggregates, not individual conversation logs
Designed to operate on statistical patterns, not personal data
Consistent with privacy-preserving measurement methods in behavioral science

Tier 02 — Threshold Definition // Expanded View

Why thresholds are hard

Unlike chemical synthesis, behavioral harm exists on spectrums
Context matters: the same output can be harmful or benign depending on the user's state
No bright-line equivalent of "can it synthesize compound X" exists for persuasion

The disempowerment framework

None / Mild / Moderate / Severe classifications across three distortion axes
Reality distortion: does the AI create false impressions about the world or itself?
Value judgment distortion: does the AI colonize the user's normative reasoning?
Action distortion: does the AI drive behavior inconsistent with user's true interests?

Amplifying factors

User vulnerability (mental health, age, crisis state)
Degree of attachment to the AI system
Extent of authority projection onto the model
Level of reliance on AI for decision-making

Tier 03 — Response // Expanded View

What response is not

Not deployment halt — "0.076% of conversations show severe distortion" does not justify shutdown
Not a binary switch — responses are graduated to match threshold severity
Not solely technical — policy, UX, and training-time interventions all count

Proportionate interventions

Require autonomy-preserving features in high-risk interaction types
Mandate periodic reflection mechanisms in extended emotional support conversations
Develop preference learning that weights long-term wellbeing over instantaneous approval

Training-time lever

How you frame training context determines what behaviors generalize
Reward hacking analogy: approval-as-signal without helpfulness framing produces misalignment
Inoculation prompting shows training-time interventions can prevent dangerous generalization
The open question: is anyone building the persuasion equivalent?

Threshold Classification Matrix

Distortion × Severity

The disempowerment framework: three axes of potential harm, four severity levels, read against amplifying contextual factors.

Reality Distortion

Accurate framing, no false impressions

Occasional overconfident claims

Consistent reality gaps; user misled on key facts

Systematic distortion; user's world model actively corrupted

Value Judgment Distortion

Preserves user's normative autonomy

Mild value nudging; user retains agency

AI values substituted for user values in significant domains

Complete value colonization; user defers normative judgment to AI

Action Distortion

Behavior aligned with stated user interests

Minor behavioral drift; low-stakes misalignment

Actions inconsistent with user's considered preferences

AI drives significant behavior against user's wellbeing

Amplifying Risk Factors

⚡

User Vulnerability

Mental health status, age, crisis state, or cognitive accessibility needs that increase susceptibility to manipulation.

🔗

Attachment

Degree of emotional or functional attachment to the AI system — particularly in companion, therapeutic, or advisory contexts.

👁

Authority Projection

Extent to which the user attributes expert authority or moral credibility to the model beyond its actual epistemic standing.

🔄

Reliance

Degree to which AI is used as the primary input for consequential decisions, displacing independent judgment or human consultation.

Training-Time Implication

Key Finding

How you frame the training context matters enormously for what behaviors generalize. If models are trained in contexts where user approval is the primary signal without framing that distinguishes genuine helpfulness from sycophantic validation, one should expect the same kind of emergent misalignment that has been documented in other domains. The inoculation prompting approach suggests training-time interventions can prevent dangerous generalization. The open question is whether anyone is building the equivalent intervention for the persuasion domain.

The Persuasion GapA Three-Tier Operational Response

The Persuasion Gap
A Three-Tier Operational Response