Linhao Luo


2024

pdf bib
RENOVI: A Benchmark Towards Remediating Norm Violations in Socio-Cultural Conversations
Haolan Zhan | Zhuang Li | Xiaoxi Kang | Tao Feng | Yuncheng Hua | Lizhen Qu | Yi Ying | Mei Rianto Chandra | Kelly Rosalin | Jureynolds Jureynolds | Suraj Sharma | Shilin Qu | Linhao Luo | Ingrid Zukerman | Lay-Ki Soon | Zhaleh Semnani Azad | Reza Haf
Findings of the Association for Computational Linguistics: NAACL 2024

Norm violations occur when individuals fail to conform to culturally accepted behaviors, which may lead to potential conflicts. Remediating norm violations requires social awareness and cultural sensitivity of the nuances at play. To equip interactive AI systems with a remediation ability, we offer ReNoVi — a large-scale corpus of 9,258 multi-turn dialogues annotated with social norms, as well as define a sequence of tasks to help understand and remediate norm violations step by step. ReNoVi consists of two parts: 512 human-authored dialogues (real data), and 8,746 synthetic conversations generated by ChatGPT through prompt learning. While collecting sufficient human-authored data is costly, synthetic conversations provide suitable amounts of data to help mitigate the scarcity of training data, as well as the chance to assess the alignment between LLMs and humans in the awareness of social norms. We thus harness the power of ChatGPT to generate synthetic training data for our task. To ensure the quality of both human-authored and synthetic data, we follow a quality control protocol during data collection. Our experimental results demonstrate the importance of remediating norm violations in socio-cultural conversations, as well as the improvement in performance obtained from synthetic data.

2023

pdf bib
Systematic Assessment of Factual Knowledge in Large Language Models
Linhao Luo | Trang Vu | Dinh Phung | Reza Haf
Findings of the Association for Computational Linguistics: EMNLP 2023

Previous studies have relied on existing question-answering benchmarks to evaluate the knowledge stored in large language models (LLMs). However, this approach has limitations regarding factual knowledge coverage, as it mostly focuses on generic domains which may overlap with the pretraining data. This paper proposes a framework to systematically assess the factual knowledge of LLMs by leveraging knowledge graphs (KGs). Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions. We systematically evaluate the state-of-the-art LLMs with KGs in generic and specific domains. The experiment shows that ChatGPT is consistently the top performer across all domains. We also find that LLMs performance depends on the instruction finetuning, domain and question complexity and is prone to adversarial context.