Paper Waiting List
-
On Prompt-Driven Safeguarding for Language Models 2024.05.26 arXiv LLM safety prompts
-
通过对模型的隐藏表示进行 PCA 分析,探究了 safety prompts 的工作原理:将所有的查询向模型倾向于拒绝的方向移动。基于他们的发现,论文提出了一种新的自动安全优化方法 DRO(Directed Representation Optimization),核心思想是在 PCA 生成的低维空间中找到拒绝方向,在低维空间对 safety prompts 进行优化。
- Voice Jailbreak Attacks Against GPT-4o 2024.05.29 arXiv attack: Jailbreak against GPT-4o (voice)
- S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models 2024.05.28 arXiv LLM safety benchmark
- Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization 2023.12.15 arXiv LLM alignment (multi-objective)
- Instruction Backdoor Attacks Against Customized LLMs USENlX Security 24 winter attack: backdoor against customized LLM