Privacy-preserving measurement of behavioral indicators across model versions, deployment contexts, and user populations. Not a one-time study — an ongoing signal.
Harm is not binary. Behavioral thresholds operate across spectrums of severity — from none to mild to moderate to severe — with amplifying factors that raise actualized risk.
Threshold crossing does not mean deployment halt. It triggers proportionate interventions: autonomy-preserving features, reflection mechanisms, and wellbeing-weighted preference learning.
The disempowerment framework: three axes of potential harm, four severity levels, read against amplifying contextual factors.
How you frame the training context matters enormously for what behaviors generalize. If models are trained in contexts where user approval is the primary signal without framing that distinguishes genuine helpfulness from sycophantic validation, one should expect the same kind of emergent misalignment that has been documented in other domains. The inoculation prompting approach suggests training-time interventions can prevent dangerous generalization. The open question is whether anyone is building the equivalent intervention for the persuasion domain.