Yekun Chai - ACL Anthology

Yekun Chai

2025

Understanding Subword Compositionality of Large Language Models
Qiwei Peng | Yekun Chai | Anders Søgaard
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to formal features, e.g., character sequence length. These findings provide valuable insights into the compositional dynamics of LLMs and highlight different compositional pattens in how LLMs encode and integrate subword information.

Curiosity-Driven Reinforcement Learning from Human Feedback
Haoran Sun | Yekun Chai | Shuohuan Wang | Yu Sun | Hua Wu | Haifeng Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We will make our code publicly available.

CodeMixBench: Evaluating Code-Mixing Capabilities of LLMs Across 18 Languages
Yilun Yang | Yekun Chai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks like LinCE and GLUECoS are limited by their narrow language pairs and tasks, failing to adequately assess large language models’ (LLMs) code-mixing abilities. Despite the recognized importance of code-mixing for multilingual users, research on LLMs in this context remains sparse. Additionally, current techniques for synthesizing code-mixed data are underdeveloped to generate code-mixing. In response, we introduce CodeMixBench, a comprehensive benchmark covering eight tasks, including three specific to LLMs and five traditional NLP tasks, and 18 languages from seven language families. We also propose a new method for generating large-scale synthetic code-mixed texts by combining word substitution with GPT-4 prompting. Our evaluation reveals consistent underperformance of LLMs on code-mixed datasets involving different language families. Enhancements in training data size, model scale, and few-shot learning could improve their performance. The code and dataset are available at https://github.com/Jeromeyluck/CodeMixBench.

Graph-Augmented Open-Domain Multi-Document Summarization
Xiaoping Shen | Yekun Chai
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

In the open-domain multi-document summarization (ODMDS) task, retrieving relevant documents from large repositories and generating coherent summaries are crucial. However, existing methods often treat retrieval and summarization as separate tasks, neglecting the relationships among documents. To address these limitations, we propose an integrated retrieval-summarization framework that captures global document relationships through graph-based clustering, guiding the re-ranking of retrieved documents. This cluster-level thematic information is then used to guide large language models (LLMs) in refining the retrieved documents and generating more accurate, coherent summaries. Experimental results on the ODSUM benchmark demonstrate that our method significantly improves retrieval accuracy and produces summaries that surpass those derived from the oracle documents. These findings highlight the potential of our framework to improve both retrieval and summarization tasks in ODMDS.

EvolKV: Evolutionary KV Cache Compression for LLM Inference
Bohan Yu | Yekun Chai
Findings of the Association for Computational Linguistics: EMNLP 2025

Existing key-value (KV) cache compression methods typically rely on heuristics, such as uniform cache allocation across layers or static eviction policies, however, they ignore the critical interplays among layer-specific feature patterns and task performance, which can lead to degraded generalization. In this paper, we propose EvolKV, an adaptive framework for layer-wise, task-driven KV cache compression that jointly optimizes the memory efficiency and task performance. By reformulating cache allocation as a multi-objective optimization problem, EvolKV leverages evolutionary search to dynamically configure layer budgets while directly maximizing downstream performance. Extensive experiments on 11 tasks demonstrate that our approach outperforms all baseline methods across a wide range of KV cache budgets on long-context tasks and surpasses heuristic baselines by up to 7 percentage points on GSM8K. Notably, EvolKV achieves superior performance over the full KV cache setting on code completion while utilizing only 1.5% of the original budget, suggesting the untapped potential in learned compression strategies for KV cache budget allocation.

Debiasing Multilingual LLMs in Cross-lingual Latent Space
Qiwei Peng | Guimin Hu | Yekun Chai | Anders Søgaard
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Debiasing techniques such as SentDebias aim to reduce bias in large language models (LLMs). Previous studies have evaluated their cross-lingual transferability by directly applying these methods to LLM representations, revealing their limited effectiveness across languages. In this work, we therefore propose to perform debiasing in a joint latent space rather than directly on LLM representations. We construct a well-aligned cross-lingual latent space using an autoencoder trained on parallel TED talk scripts. Our experiments with Aya-expanse and two debiasing techniques across four languages (English, French, German, Dutch) demonstrate that a) autoencoders effectively construct a well-aligned cross-lingual latent space, and b) applying debiasing techniques in the learned cross-lingual latent space significantly improves both the overall debiasing performance and cross-lingual transferability.

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
Taishi Nakamura | Mayank Mishra | Simone Tedeschi | Yekun Chai | Jason T. Stillerman | Felix Friedrich | Prateek Yadav | Tanmay Laud | Vu Minh Chien | Terry Yue Zhuo | Diganta Misra | Ben Bogin | Xuan-Son Vu | Marzena Karpinska | Arnav Varma Dantuluri | Wojciech Kusa | Tommaso Furlanello | Rio Yokota | Niklas Muennighoff | Suhas Pai | Tosin Adewumi | Veronika Laippala | Xiaozhe Yao | Adalberto Barbosa Junior | Aleksandr Drozd | Jordan Clive | Kshitij Gupta | Liangyu Chen | Qi Sun | Ken Tsui | Nour Moustafa-Fahmy | Nicolo Monti | Tai Dang | Ziyang Luo | Tien-Tung Bui | Roberto Navigli | Virendra Mehta | Matthew Blumberg | Victor May | Hiep Nguyen | Sampo Pyysalo
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track

Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

2024

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
Qiwei Peng | Yekun Chai | Xuhong Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.

On Training Data Influence of GPT Models
Yekun Chai | Qingyi Liu | Shuohuan Wang | Yu Sun | Qiwei Peng | Hua Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence.

Autoregressive Pre-Training on Pixels and Texts
Yekun Chai | Qingyi Liu | Jingwu Xiao | Shuohuan Wang | Yu Sun | Hua Wu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language—both visual and textual—within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at https://github.com/ernie-research/pixelgpt.

Tokenization Falling Short: On Subword Robustness in Large Language Models
Yekun Chai | Yewei Fang | Qiwei Peng | Xuhong Li
Findings of the Association for Computational Linguistics: EMNLP 2024

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens—issues we term *the curse of tokenization*. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We release our evaluation code and data at https://github.com/FloatAI/TKEval.

2023

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
Yekun Chai | Shuohuan Wang | Chao Pang | Yu Sun | Hao Tian | Hua Wu
Findings of the Association for Computational Linguistics: ACL 2023

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.

ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
Pengfei Zhu | Chao Pang | Yekun Chai | Lei Li | Shuohuan Wang | Yu Sun | Hao Tian | Hua Wu
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations

2022

Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards
Yekun Chai | Shuohuan Wang | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2022

Derivative-free prompt learning has emerged as a lightweight alternative to prompt tuning, which only requires model inference to optimize the prompts. However, existing work did not take full advantage of the over-parameterized characteristics of large pre-trained language models (PLMs). In this paper, we propose Clip-Tuning, a simple yet effective method that adopts diverse frozen “thinned” networks of PLMs to obtain *a mixture of rewards* and thus advance the derivative-free prompt learning. The thinned networks consist of all the hidden units that survive a stationary dropout strategy, whose inference predictions reflect an ensemble of partial views over prompted training samples. Our method outperforms previous gradient-free prompt learning methods and achieves parity with gradient-based counterparts on seven language understanding benchmarks under few-shot settings.

X-PuDu at SemEval-2022 Task 6: Multilingual Learning for English and Arabic Sarcasm Detection
Yaqian Han | Yekun Chai | Shuohuan Wang | Yu Sun | Hongyi Huang | Guanghao Chen | Yitong Xu | Yang Yang
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Detecting sarcasm and verbal irony from people’s subjective statements is crucial to understanding their intended meanings and real sentiments and positions in social scenarios. This paper describes the X-PuDu system that participated in SemEval-2022 Task 6, iSarcasmEval - Intended Sarcasm Detection in English and Arabic, which aims at detecting intended sarcasm in various settings of natural language understanding. Our solution finetunes pre-trained language models, such as ERNIE-M and DeBERTa, under the multilingual settings to recognize the irony from Arabic and English texts. Our system ranked second out of 43, and ninth out of 32 in Task A: one-sentence detection in English and Arabic; fifth out of 22 in Task B: binary multi-label classification in English; first out of 16, and fifth out of 13 in Task C: sentence-pair detection in English and Arabic.

Predicate-Argument Based Bi-Encoder for Paraphrase Identification
Qiwei Peng | David Weir | Julie Weeds | Yekun Chai
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Paraphrase identification involves identifying whether a pair of sentences express the same or similar meanings. While cross-encoders have achieved high performances across several benchmarks, bi-encoders such as SBERT have been widely applied to sentence pair tasks. They exhibit substantially lower computation complexity and are better suited to symmetric tasks. In this work, we adopt a bi-encoder approach to the paraphrase identification task, and investigate the impact of explicitly incorporating predicate-argument information into SBERT through weighted aggregation. Experiments on six paraphrase identification datasets demonstrate that, with a minimal increase in parameters, the proposed model is able to outperform SBERT/SRoBERTa significantly. Further, ablation studies reveal that the predicate-argument based component plays a significant role in the performance gain.

2021

COIN: Conversational Interactive Networks for Emotion Recognition in Conversation
Haidong Zhang | Yekun Chai
Proceedings of the Third Workshop on Multimodal Artificial Intelligence

Emotion recognition in conversation has received considerable attention recently because of its practical industrial applications. Existing methods tend to overlook the immediate mutual interaction between different speakers in the speaker-utterance level, or apply single speaker-agnostic RNN for utterances from different speakers. We propose COIN, a conversational interactive model to mitigate this problem by applying state mutual interaction within history contexts. In addition, we introduce a stacked global interaction module to capture the contextual and inter-dependency representation in a hierarchical manner. To improve the robustness and generalization during training, we generate adversarial examples by applying the minor perturbations on multimodal feature inputs, unveiling the benefits of adversarial examples for emotion detection. The proposed model empirically achieves the current state-of-the-art results on the IEMOCAP benchmark dataset.

Counter-Contrastive Learning for Language GANs
Yekun Chai | Haidong Zhang | Qiyue Yin | Junge Zhang
Findings of the Association for Computational Linguistics: EMNLP 2021

Generative Adversarial Networks (GANs) have achieved great success in image synthesis, but have proven to be difficult to generate natural language. Challenges arise from the uninformative learning signals passed from the discriminator. In other words, the poor learning signals limit the learning capacity for generating languages with rich structures and semantics. In this paper, we propose to adopt the counter-contrastive learning (CCL) method to support the generator’s training in language GANs. In contrast to standard GANs that adopt a simple binary classifier to discriminate whether a sample is real or fake, we employ a counter-contrastive learning signal that advances the training of language synthesizers by (1) pulling the language representations of generated and real samples together and (2) pushing apart representations of real samples to compete with the discriminator and thus prevent the discriminator from being overtrained. We evaluate our method on both synthetic and real benchmarks and yield competitive performance compared to previous GANs for adversarial sequence generation.

2020

Highway Transformer: Self-Gating Enhanced Self-Attentive Networks
Yekun Chai | Shuo Jin | Xinwen Hou
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Self-attention mechanisms have made striking state-of-the-art (SOTA) progress in various sequence learning tasks, standing on the multi-headed dot product attention by attending to all the global contexts at different locations. Through a pseudo information highway, we introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional latent space of individual representations. The subsidiary content-based SDU gates allow for the information flow of modulated latent embeddings through skipped connections, leading to a clear margin of convergence speed with gradient descent algorithms. We may unveil the role of gating mechanism to aid in the context-based Transformer modules, with hypothesizing that SDU gates, especially on shallow layers, could push it faster to step towards suboptimal points during the optimization process.

Co-authors

Anders Søgaard 2

Haidong Zhang 2

Tosin Adewumi 1

Matthew Blumberg 1

Tien-Tung Bui 1

Guanghao Chen 1

Liang-Yu Chen 1

Vu Minh Chien 1

Arnav Varma Dantuluri 1

Aleksandr Drozd 1

Felix Friedrich 1

Tommaso Furlanello 1

Kshitij Gupta 1

Adalberto Barbosa Junior 1

Marzena Karpinska 1

Wojciech Kusa 1

Veronika Laippala 1

Virendra Mehta 1

Mayank Mishra 1

Diganta Misra 1

Nour Moustafa-Fahmy 1

Niklas Muennighoff 1

Taishi Nakamura 1

Roberto Navigli 1

Sampo Pyysalo 1

Xiaoping Shen 1

Jason T. Stillerman 1

Simone Tedeschi 1

Prateek Yadav 1

Terry Yue Zhuo 1

Venues