Fail-Closed Alignment for Large Language Models
Zachary Coalson, Beth Sohler et al.
TLDR: The paper introduces 'fail-closed alignment' for large language models to enhance safety by ensuring refusal mechanisms remain effective even if part of the system is compromised.