对齐的理论、技术与评估(Theories, Techniques, and Evaluation of AI Alignment)

Ji Jiaming (吉嘉铭), Qiu Tianyi (邱天异), Chen Boyuan (陈博远), Yang Yaodong (杨耀东)


Abstract
“人工智能对齐(AI Alignment)旨在使人工智能系统的行为与人类的意图和价值观相一致。随着人工智能系统的能力日益增强,对齐失败带来的风险也在不断增加。数百位人工智能专家和公众人物已经表达了对人工智能风险的担忧,他们认为乜减轻人工智能带来的灭绝风险应该成为全球优先考虑的问题,与其他社会规模的风险如大流行病和核战争并列(CAIS,2023)。为了提供对齐领域的全面和最新概述,本文深入探讨了对齐的核心理论、技术和评估。首先,本文确定了人工智能对齐的四个关键目标:鲁棒性(Robustness)、可解释性(Interpretability)、可控性(Controllability)和道德性(Ethicality)(RICE)。在这四个目标原则的指导下,本文概述了当前人工智能对齐研究的全貌,并将其分解为两个关键组成部分:前向对齐和后向对齐。本文旨在为对齐研究提供全面且对初学者友好的调研。同时本文还发布并持续更新网站 www.alignmentsurvey.com,该网站提供了一系列教程、论文集和其他资源。更详尽的讨论与分析请见 https://arxiv.org/abs/2310.19852。”
Anthology ID:
2024.ccl-2.7
Volume:
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum)
Month:
July
Year:
2024
Address:
Taiyuan, China
Editor:
Xin Zhao
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
120–140
Language:
Chinese
URL:
https://aclanthology.org/2024.ccl-2.7/
DOI:
Bibkey:
Cite (ACL):
Ji Jiaming, Qiu Tianyi, Chen Boyuan, and Yang Yaodong. 2024. 对齐的理论、技术与评估(Theories, Techniques, and Evaluation of AI Alignment). In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 2: Frontier Forum), pages 120–140, Taiyuan, China. Chinese Information Processing Society of China.
Cite (Informal):
对齐的理论、技术与评估(Theories, Techniques, and Evaluation of AI Alignment) (Jiaming et al., CCL 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.ccl-2.7.pdf