Huayang Li - ACL Anthology

Huayang Li

2025

It’s Not Bragging If You Can Back It Up: Can LLMs Understand Braggings?
Jingjie Zeng | Huayang Li | Liang Yang | Yuanyuan Sun | Hongfei Lin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Bragging, as a pervasive social-linguistic phenomenon, reflects complex human interaction patterns. However, the understanding and generation of appropriate bragging behavior in large language models (LLMs) remains underexplored. In this paper, we propose a comprehensive study that combines analytical and controllable approaches to examine bragging in LLMs. We design three tasks, bragging recognition, bragging explanation, and bragging generation, along with novel evaluation metrics to assess the models’ ability to identify bragging intent, social appropriateness, and account for context sensitivity. Our analysis reveals the challenges of bragging in the social context, such as recognizing bragging and responding appropriately with bragging in conversation. This work provides new insights into how LLMs process bragging and highlights the need for more research on generating contextually appropriate behavior in LLMs.

DUT_IR at SemEval-2025 Task 11: Enhancing Multi-Label Emotion Classification with an Ensemble of Pre-trained Language Models and Large Language Models
Chao Liu | Junliang Liu | Tengxiao Lv | Huayang Li | Tao Zeng | Ling Luo | Yuanyuan Sun | Hongfei Lin
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

In this work, we tackle the challenge of multi-label emotion classification, where a sentence can simultaneously express multiple emotions. This task is particularly difficult due to the overlapping nature of emotions and the limited context available in short texts. To address these challenges, we propose an ensemble approach that integrates Pre-trained Language Models (BERT-based models) and Large Language Models, each capturing distinct emotional cues within the text. The predictions from these models are aggregated through a voting mechanism, enhancing classification accuracy. Additionally, we incorporate threshold optimization and class weighting techniques to mitigate class imbalance. Our method demonstrates substantial improvements over baseline models. Our approach ranked 4th out of 90 on the English leaderboard and exhibited strong performance in English in SemEval-2025 Task 11 Track A.

DUTtask10 at SemEval-2025 Task 10: ThoughtFlow: Hierarchical Narrative Classification via Stepwise Prompting
Du Py | Huayang Li | Liang Yang | Zhang Shaowu
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper describes our system for SemEval-2025 Task 10: Hierarchical Narrative Classification. We propose a two-step hierarchical approach that combines generative reasoning and fine-tuning for sub-narrative classification. The main techniques of our system are: 1) leveraging a large pre-trained model to generate a reasoning process for better context understanding, 2) fine-tuning the model for precise sub-narrative categorization, 3) using a multi-label classification strategy for more accurate sub-narrative identification, and 4) incorporating data augmentation to increase the diversity and robustness of the training data. Our system ranked 1st in Subtask 2 for Hindi, achieving an F1 macro coarse score of 0.56900 and an F1 samples score of 0.53500. The results demonstrate the effectiveness of our approach in classifying narratives and sub-narratives in a multilingual setting, with the additional benefit of enhanced model performance through data augmentation.

2024

M3T: A New Benchmark Dataset for Multi-Modal Document-Level Machine Translation
Benjamin Hsu | Xiaoyu Liu | Huayang Li | Yoshinari Fujinuma | Maria Nadejde | Xing Niu | Ron Litman | Yair Kittenplon | Raghavendra Pappagari
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Document translation poses a challenge for Neural Machine Translation (NMT) systems. Most document-level NMT systems rely on meticulously curated sentence-level parallel data, assuming flawless extraction of text from documents along with their precise reading order. These systems also tend to disregard additional visual cues such as the document layout, deeming it irrelevant. However, real-world documents often possess intricate text layouts that defy these assumptions. Extracting information from Optical Character Recognition (OCR) or heuristic rules can result in errors, and the layout (e.g., paragraphs, headers) may convey relationships between distant sections of text. This complexity is particularly evident in widely used PDF documents, which represent information visually. This paper addresses this gap by introducing M3T a novel benchmark dataset tailored to evaluate NMT systems on the comprehensive task of translating semi-structured documents. This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications.

Cross-lingual Contextualized Phrase Retrieval
Huayang Li | Deng Cai | Zhi Qu | Qu Cui | Hidetaka Kamigaito | Lemao Liu | Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2024

Phrase-level dense retrieval has shown many appealing characteristics in downstream NLP tasks by leveraging the fine-grained information that phrases offer. In our work, we propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval, which aims to augment cross-lingual applications by addressing polysemy using context information. However, the lack of specific training data and models are the primary challenges to achieve our goal. As a result, we extract pairs of cross-lingual phrases using word alignment information automatically induced from parallel sentences. Subsequently, we train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning, which encourages the hidden representations of phrases with similar contexts and semantics to align closely. Comprehensive experiments on both the cross-lingual phrase retrieval task and a downstream task, i.e, machine translation, demonstrate the effectiveness of CCPR. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher. When utilizing CCPR to augment the large-language-model-based translator, it achieves average gains of 0.7 and 1.5 in BERTScore for translations from X=>En and vice versa, respectively, on WMT16 dataset. We will release our code and data.

TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild
Huayang Li | Siheng Li | Deng Cai | Longyue Wang | Lemao Liu | Taro Watanabe | Yujiu Yang | Shuming Shi
Findings of the Association for Computational Linguistics: ACL 2024

Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering LLMs with multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. Extensive quantitative and qualitative experiments demonstrate that MIM trained on TextBind achieves remarkable generation capability in multimodal conversations compared to recent baselines.

A Frustratingly Simple Decoding Method for Neural Text Generation
Haoran Yang | Deng Cai | Huayang Li | Wei Bi | Wai Lam | Shuming Shi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We introduce a frustratingly simple, highly efficient, and surprisingly effective decoding method, termed Frustratingly Simple Decoding (FSD), for neural text generation. The idea behind FSD is straightforward: We construct an anti-language model (anti-LM) based on previously generated text, which is employed to penalize the future generation of repetitive content. The anti-LM can be implemented as simple as an n-gram language model or a vectorized variant. In this way, FSD incurs no additional model parameters and negligible computational overhead (FSD can be as fast as greedy search). Despite its simplicity, FSD is surprisingly effective and generalizes across different datasets, models, and languages. Extensive experiments show that FSD outperforms established strong baselines in terms of generation quality, decoding speed, and universality.

2023

PandaGPT: One Model To Instruction-Follow Them All
Yixuan Su | Tian Lan | Huayang Li | Jialu Xu | Yan Wang | Deng Cai
Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants!

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation, writing stories inspired by videos, and answering questions about audios. More interestingly, PandaGPT can take multimodal inputs simultaneously and compose their semantics naturally. For example, PandaGPT can connect how objects look in an image/video and how they sound in an audio. To do so, PandaGPT combines the multimodal encoders from ImageBind and the large language models from Vicuna. Notably, only aligned image-text pairs are required for the training of PandaGPT. Thanks to the strong capability of ImageBind in embedding data from different modalities into the same space, PandaGPT displays emergent, i.e. zero-shot, cross-modal behaviors for data other than image and text (e.g., video, audio, depth, thermal, and IMU). We hope that PandaGPT serves as an initial step toward building AGI that can perceive and understand inputs in different modalities holistically, as we humans do.

2022

Visualizing the Relationship Between Encoded Linguistic Information and Task Performance
Jiannan Xiang | Huayang Li | Defu Lian | Guoping Huang | Taro Watanabe | Lemao Liu
Findings of the Association for Computational Linguistics: ACL 2022

Probing is popular to analyze whether linguistic information can be captured by a well-trained deep neural model, but it is hard to answer how the change of the encoded linguistic information will affect task performance. To this end, we study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality. Its key idea is to obtain a set of models which are Pareto-optimal in terms of both objectives. From this viewpoint, we propose a method to optimize the Pareto-optimal models by formalizing it as a multi-objective optimization problem. We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances. Experimental results demonstrate that the proposed method is better than a baseline method. Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance, because the model architecture is also an important factor.

Residual Learning of Neural Text Generation with n-gram Language Model
Huayang Li | Deng Cai | Jin Xu | Taro Watanabe
Findings of the Association for Computational Linguistics: EMNLP 2022

N-gram language models (LM) has been largely superseded by neural LMs as the latter exhibits better performance. However, we find that n-gram models can achieve satisfactory performance on a large proportion of testing cases, indicating they have already captured abundant knowledge of the language with relatively low computational cost. With this observation, we propose to learn a neural LM that fits the residual between an n-gram LM and the real-data distribution. The combination of n-gram LMs and neural LMs not only allows the neural part to focus on deeper understanding of the language, but also provides a flexible way to customize a LM by switching the underlying n-gram model without changing the neural model. Experimental results on three typical language tasks (i.e., language modeling, machine translation, and summarization) demonstrate that our approach attains additional performance gains over popular standalone neural models consistently. We also show that our approach allows for effective domain adaptation by simply switching to a domain-specific n-gram model, without any extra training.

Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics
Jiannan Xiang | Huayang Li | Yahui Liu | Lemao Liu | Guoping Huang | Defu Lian | Shuming Shi
Findings of the Association for Computational Linguistics: ACL 2022

Current practices in metric evaluation focus on one single dataset, e.g., Newstest dataset in each year’s WMT Metrics Shared Task. However, in this paper, we qualitatively and quantitatively show that the performances of metrics are sensitive to data. The ranking of metrics varies when the evaluation is conducted on different datasets. Then this paper further investigates two potential hypotheses, i.e., insignificant data points and the deviation of i.i.d assumption, which may take responsibility for the issue of data variance. In conclusion, our findings suggest that when evaluating automatic translation metrics, researchers should take data variance into account and be cautious to report the results on unreliable datasets, because it may leads to inconsistent results with most of the other datasets.

2021

Assessing Dialogue Systems with Distribution Distances
Jiannan Xiang | Yahui Liu | Deng Cai | Huayang Li | Defu Lian | Lemao Liu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Data Augmentation for Text Generation Without Any Augmented Data
Wei Bi | Huayang Li | Jiacheng Huang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Data augmentation is an effective way to improve the performance of many neural text generation models. However, current data augmentation methods need to define or choose proper data mapping functions that map the original samples into the augmented samples. In this work, we derive an objective to formulate the problem of data augmentation on text generation tasks without any use of augmented data constructed by specific mapping functions. Our proposed objective can be efficiently optimized and applied to popular loss functions on text generation tasks with a convergence rate guarantee. Experiments on five datasets of two text generation tasks show that our approach can approximate or even surpass popular data augmentation methods.

GWLAN: General Word-Level AutocompletioN for Computer-Aided Translation
Huayang Li | Lemao Liu | Guoping Huang | Shuming Shi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Computer-aided translation (CAT), the use of software to assist a human translator in the translation process, has been proven to be useful in enhancing the productivity of human translators. Autocompletion, which suggests translation results according to the text pieces provided by human translators, is a core function of CAT. There are two limitations in previous research in this line. First, most research works on this topic focus on sentence-level autocompletion (i.e., generating the whole translation as a sentence based on human input), but word-level autocompletion is under-explored so far. Second, almost no public benchmarks are available for the autocompletion task of CAT. This might be among the reasons why research progress in CAT is much slower compared to automatic MT. In this paper, we propose the task of general word-level autocompletion (GWLAN) from a real-world CAT scenario, and construct the first public benchmark to facilitate research in this topic. In addition, we propose an effective method for GWLAN and compare it with several strong baselines. Experiments demonstrate that our proposed method can give significantly more accurate predictions than the baseline methods on our benchmark datasets.

Neural Machine Translation with Monolingual Translation Memory
Deng Cai | Yan Wang | Huayang Li | Wai Lam | Lemao Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Prior work has proved that Translation Memory (TM) can boost the performance of Neural Machine Translation (NMT). In contrast to existing work that uses bilingual corpus as TM and employs source-side similarity search for memory retrieval, we propose a new framework that uses monolingual memory and performs learnable memory retrieval in a cross-lingual manner. Our framework has unique advantages. First, the cross-lingual memory retriever allows abundant monolingual data to be TM. Second, the memory retriever and NMT model can be jointly optimized for the ultimate translation goal. Experiments show that the proposed method obtains substantial improvements. Remarkably, it even outperforms strong TM-augmented NMT baselines using bilingual TM. Owning to the ability to leverage monolingual data, our model also demonstrates effectiveness in low-resource and domain adaptation scenarios.

2020

Evaluating Explanation Methods for Neural Machine Translation
Jierui Li | Lemao Liu | Huayang Li | Guanlin Li | Guoping Huang | Shuming Shi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recently many efforts have been devoted to interpreting the black-box NMT models, but little progress has been made on metrics to evaluate explanation methods. Word Alignment Error Rate can be used as such a metric that matches human understanding, however, it can not measure explanation methods on those target words that are not aligned to any source word. This paper thereby makes an initial attempt to evaluate explanation methods from an alternative viewpoint. To this end, it proposes a principled metric based on fidelity in regard to the predictive behavior of the NMT model. As the exact computation for this metric is intractable, we employ an efficient approach as its approximation. On six standard translation tasks, we quantitatively evaluate several explanation methods in terms of the proposed metric and we reveal some valuable findings for these explanation methods in our experiments.

On the Branching Bias of Syntax Extracted from Pre-trained Language Models
Huayang Li | Lemao Liu | Guoping Huang | Shuming Shi
Findings of the Association for Computational Linguistics: EMNLP 2020

Many efforts have been devoted to extracting constituency trees from pre-trained language models, often proceeding in two stages: feature definition and parsing. However, this kind of methods may suffer from the branching bias issue, which will inflate the performances on languages with the same branch it biases to. In this work, we propose quantitatively measuring the branching bias by comparing the performance gap on a language and its reversed language, which is agnostic to both language models and extracting methods. Furthermore, we analyze the impacts of three factors on the branching bias, namely feature definitions, parsing algorithms, and language models. Experiments show that several existing works exhibit branching biases, and some implementations of these three factors can introduce the branching bias.

Co-authors

Jiannan Xiang 3

Hongfei Lin (林鸿飞) 2

Yuanyuan Sun (孙媛媛, 孙嫒媛) 2

Liang Yang (杨亮) 2

Yoshinari Fujinuma 1

Jiacheng Huang 1

Hidetaka Kamigaito 1

Yair Kittenplon 1

Ling Luo (罗凌) 1

Maria Nadejde 1

Raghavendra Pappagari 1

Venues