Haofei Yu


2024

pdf bib
SOTOPIA-π: Interactive Learning of Socially Intelligent Language Agents
Ruiyi Wang | Haofei Yu | Wenxin Zhang | Zhengyang Qi | Maarten Sap | Yonatan Bisk | Graham Neubig | Hao Zhu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Humans learn social skills through both imitation and social interaction. This social learning process is largely understudied by existing research on building language agents. Motivated by this gap, we propose an interactive learning method, SOTOPIA-π, that improves the social intelligence of language agents. This method leverages behavior cloning and self-reinforcement based training on filtered social interaction data according to large language model (LLM) rating. We show that our training method allows a 7B LLM to reach the social goal completion ability of an expert model (GPT-4-based agent) without the loss of more generic abilities, such as the ability to answer knowledge-based questions. We also demonstrate that this training paradigm uncovers some weaknesses in standard evaluation and safety training paradigms that (1) LLM-based evaluation of social intelligence overestimates the abilities of the language agents trained specifically for social interaction, and that (2) despite not training for better safety or question answering (QA) ability, our methods improve the safety of language agents and maintain general QA ability on the MMLU benchmark.

2023

pdf bib
RFiD: Towards Rational Fusion-in-Decoder for Open-Domain Question Answering
Cunxiang Wang | Haofei Yu | Yue Zhang
Findings of the Association for Computational Linguistics: ACL 2023

Open-Domain Question Answering (ODQA) systems necessitate a reader model capable of generating answers by simultaneously referring to multiple passages. Although representative models like Fusion-in-Decoder (FiD) have been proposed to address this challenge, these systems can inadvertently rely on spurious features instead of genuine causal relationships between the question and the passages to generate answers. To counter this problem, we introduce the Rational Fusion-in-Decoder (RFiD) model. Our model leverages the encoders of FiD to differentiate between causal relationships and spurious features, subsequently guiding the decoder to generate answers informed by this discernment. Experimental results on two ODQA datasets, Natural Questions (NQ) and TriviaQA (TQ), demonstrate that our model surpasses previous methods, achieving improvements of up to 1.5 and 0.7 in Exact Match scores on NQ, and exhibits an enhanced ability to identify causal relationships.

pdf bib
Uni-Encoder: A Fast and Accurate Response Selection Paradigm for Generation-Based Dialogue Systems
Chiyu Song | Hongliang He | Haofei Yu | Pengfei Fang | Leyang Cui | Zhenzhong Lan
Findings of the Association for Computational Linguistics: ACL 2023

Sample-and-rank is a key decoding strategy for modern generation-based dialogue systems. It helps achieve diverse and high-quality responses by selecting an answer from a small pool of generated candidates. The current state-of-the-art ranking methods mainly use an encoding paradigm called Cross-Encoder, which separately encodes each context-candidate pair and ranks the candidates according to their fitness scores. However, Cross-Encoder repeatedly encodes the same lengthy context for each candidate, resulting in high computational costs. Poly-Encoder addresses the above problems by reducing the interaction between context and candidates, but with a price of performance drop. In this work, we develop a new paradigm called Uni-Encoder, that keeps the full attention over each pair as in Cross-Encoder while only encoding the context once, as in Poly-Encoder. Uni-Encoder encodes all the candidates with the context in one forward pass. We use the same positional embedding for all candidates to ensure they are treated equally and design a new attention mechanism to avoid confusion. Our Uni-Encoder can simulate other ranking paradigms using different attention and response concatenation methods. Extensive experiments show that our proposed paradigm achieves new state-of-the-art results on four benchmark datasets with high computational efficiency. For instance, it improves R10@1 by 2.9% with an approximately 4X faster inference speed on the Ubuntu V2 dataset.

pdf bib
TRAMS: Training-free Memory Selection for Long-range Language Modeling
Haofei Yu | Cunxiang Wang | Yue Zhang | Wei Bi
Findings of the Association for Computational Linguistics: EMNLP 2023

The Transformer architecture is crucial for numerous AI models, but it still faces challenges in long-range language modeling. Though several specific transformer architectures have been designed to tackle issues of long-range dependencies, existing methods like Transformer-XL are plagued by a high percentage of ineffective memories. In this study, we present a plug-and-play strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens participating in attention calculation based on one simple metric. This strategy allows us to keep tokens that are likely to have a high attention score with the current queries and ignore the other ones. We have tested our approach on the word-level benchmark (WikiText-103) and the character-level benchmark (enwik8), and the results indicate an improvement without having additional training or adding additional parameters.

pdf bib
Counting the Bugs in ChatGPT’s Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Leonie Weissweiler | Valentin Hofmann | Anjali Kantharuban | Anna Cai | Ritam Dutt | Amey Hengle | Anubha Kabra | Atharva Kulkarni | Abhishek Vijayakumar | Haofei Yu | Hinrich Schuetze | Kemal Oflazer | David Mortensen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.