Dawei Zhu - ACL Anthology

Dawei Zhu

2025

Hierarchical Memory Organization for Wikipedia Generation
Eugene J. Yu | Dawei Zhu | Yifan Song | Xiangyu Wong | Jiebin Zhang | Wenxuan Shi | Xiaoguang Li | Qun Liu | Sujian Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generating Wikipedia articles autonomously is a challenging task requiring the integration of accurate, comprehensive, and well-structured information from diverse sources. This paper introduces the Memory Organization-based Generation (MOG) framework, a novel approach to address these challenges by leveraging a hierarchical memory architecture. MOG extracts fine-grained memory units from web documents, recursively organizes them into a Wikipedia-style hierarchical structure, and uses this structure to guide the generation process. This ensures alignment between memory and the article outline, improving both informativeness and verifiability while minimizing hallucinations. Additionally, a citation module is implemented to enhance traceability by linking every generated sentence to specific memory units. Evaluations on our newly created WikiStart dataset demonstrate that MOG outperforms baseline methods in producing informative and reliable articles, making it particularly robust in real-world scenarios.

WIKIGENBENCH:Exploring Full-length Wikipedia Generation under Real-World Scenario
Jiebin Zhang | Eugene J. Yu | Qinyu Chen | Chenhao Xiong | Dawei Zhu | Han Qian | Mingbo Song | Weimin Xiong | Xiaoguang Li | Qun Liu | Sujian Li
Proceedings of the 31st International Conference on Computational Linguistics

It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.

EERPD: Leveraging Emotion and Emotion Regulation for Improving Personality Detection
Zheng Li | Sujian Li | Dawei Zhu | Qilong Ma | Weimin Xiong
Proceedings of the 31st International Conference on Computational Linguistics

Personality is a fundamental construct in psychology, reflecting an individual’s behavior, thinking, and emotional patterns. While previous researches have made progress in personality detection, their designed methods generally overlook the important connection between psychological knowledge “emotion regulation” and personality traits. Based on this, we propose a new personality detection method called EERPD. This method introduces the use of emotion regulation, a psychological concept highly correlated with personality, for personality prediction. By combining this concept with emotion features, EERPD retrieves few-shot examples and provides process CoTs for inferring labels from text. This approach enhances the understanding of LLM for personality implicit within text and improves the performance in personality detection. Experimental results demonstrate that EERPD significantly enhances the accuracy and robustness of personality detection, outperforming previous SOTA by 15.05/4.29 in average F1 on the two benchmark datasets.

InternLM-Law: An Open-Sourced Chinese Legal Large Language Model
Zhiwei Fei | Songyang Zhang | Xiaoyu Shen | Dawei Zhu | Xiao Wang | Jidong Ge | Vincent Ng
Proceedings of the 31st International Conference on Computational Linguistics

We introduce InternLM-Law, a large language model (LLM) tailored for addressing diverse legal tasks related to Chinese laws. These tasks range from responding to standard legal questions (e.g., legal exercises in textbooks) to analyzing complex real-world legal situations. Our work contributes to Chinese Legal NLP research by (1) conducting one of the most extensive evaluations of state-of-the-art general-purpose and legal-specific LLMs to date that involves an automatic evaluation on the 20 legal NLP tasks in LawBench, a human evaluation on a challenging version of the Legal Consultation task, and an automatic evaluation of a model’s ability to handle very long legal texts; (2) presenting a methodology for training a Chinese legal LLM that offers superior performance to all of its counterparts in our extensive evaluation; and (3) facilitating future research in this area by making all of our code and model publicly available at https://github.com/InternLM/InternLM-Law.

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
Yunuo Liu | Dawei Zhu | Zena Al-Khalili | Dai Cheng | Yanjun Chen | Dietrich Klakow | Wei Zhang | Xiaoyu Shen
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We present PricingLogic, the first benchmarkthat probes whether Large Language Mod-els (LLMs) can reliably automate tourism-booking prices when multiple, overlapping farerules apply. Travel agencies are eager to of-fload this error-prone task to AI systems; how-ever, deploying LLMs without verified reliabil-ity could result in significant financial lossesand erode customer trust. PricingLogic com-prises 300 natural-language questions based onbooking requests derived from 42 real-worldpricing policies, spanning two levels of diffi-culty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interactingdiscounts. Evaluations of a line of LLMs re-veal a steep performance drop on the harder tier,exposing systematic failures in rule interpreta-tion and arithmetic reasoning. These resultshighlight that, despite their general capabilities,today’s LLMs remain unreliable for revenue-critical applications without further safeguardsor domain adaptation. Our code and dataset areavaliable in https://github.com/EIT-NLP/PricingLogic.

Same evaluation, more tokens: On the effect of input length for machine translation evaluation using Large Language Models
Tobias Domhan | Dawei Zhu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Accurately evaluating machine-translated text remains a long-standing challenge, particularly for long documents. Recent work has shown that large language models (LLMs) can serve as reliable and interpretable sentence-level translation evaluators via MQM error span annotations. With modern LLMs supporting larger context windows, a natural question arises: can we feed entire document translations into an LLM for quality assessment? Ideally, evaluation should be invariant to text length, producing consistent error spans regardless of input granularity. However, our analysis shows that text length significantly impacts evaluation: longer texts lead to fewer error spans and reduced system ranking accuracy. To address this limitation, we evaluate several strategies, including granularity-aligned prompting, Focus Sentence Prompting (FSP), and a fine-tuning approach to better align LLMs with the evaluation task. The latter two methods largely mitigate this length bias, making LLMs more reliable for long-form translation evaluation.

Language models can learn implicit multi-hop reasoning, but only if they have lots of training data
Yuekun Yao | Yupei Du | Dawei Zhu | Michael Hahn | Alexander Koller
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Implicit reasoning is the ability of a language model to solve multi-hop reasoning tasks in a single forward pass, without chain of thought.We investigate this capability using GPT2-style language models trained from scratch on controlled k-hop reasoning datasets (k = 2, 3, 4). We show that while such models can indeed learn implicit k-hop reasoning,the required training data grows exponentially in k, and the requirednumber of transformer layers grows linearly in k.We offer a theoretical explanation for why this depth growth is necessary.We further find that the data requirement can be mitigated, but not eliminated,through curriculum learning.

This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yorùbá, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating the ability of neural machine translation (NMT) models and large language models (LLMs) to translate between English and these languages, at both the sentence and pseudo-document levels, the outputs being realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieves the best average performance among the standard NMT models, while GPT-4o outperforms general-purpose LLMs. Fine-tuning selected models leads to substantial performance gains, but models trained on sentences struggle to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, over-generation, repetition of words and phrases, and off-target translations, specifically for translation into African languages.

LongAttn: Selecting Long-context Training Data via Token-level Attention
Longyun Wu | Dawei Zhu | Guangxiang Zhao | Zhuocheng Yu | Junfeng Ran | Xiangyu Wong | Lin Sun | Sujian Li
Findings of the Association for Computational Linguistics: ACL 2025

With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with **long-range dependencies** is crucial. Existing methods to select long-context data often rely on sentence-level analysis,which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, **LongAttn**, which leverages the self-attention mechanism of LLMs to measure the **long-range dependencies** for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies **long-range dependencies**, enabling more accurate and efficient data selection. We filter **LongABC-32K** from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent **effectiveness**, **scalability**, and **efficiency**. We will release our code and the high-quality long-context dataset **LongABC-32K** in the future.

Chain-of-Thought Matters: Improving Long-Context Language Models with Reasoning Path Supervision
Dawei Zhu | Xiyu Wei | Guangxiang Zhao | Wenhao Wu | Haosheng Zou | Junfeng Ran | XWang | Lin Sun | Xiangzheng Zhang | Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT’s benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this, we propose a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. This protocol evaluates both answer correctness and process reliability, with the latter decomposed into source faithfulness and intrinsic consistency components for efficient and accurate assessment. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models will be released upon acceptance.

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Jiebin Zhang | Dawei Zhu | Yifan Song | Wenhao Wu | Chuqiao Kuang | Xiaoguang Li | Lifeng Shang | Qun Liu | Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2025

As large language models (LLMs) process increasing context windows, the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimensions separately. However, these works have left the trade-off between these two orthogonal dimensions largely unexplored. In this paper, we leverage the Information Bottleneck principle to formulate KV cache compression within a unified theoretical framework. We demonstrate that a carefully managed token-precision trade-off can achieve an optimal point within the Information Bottleneck compared to standalone KV pruning or KV quantization. Experiments reveal that storing more tokens in the KV cache at lower precision—a strategy we term quantized pruning—can significantly enhance the long-context performance of LLMs. An in-depth analysis of this token-precision trade-off across key aspects shows that quantized pruning achieves substantial improvements in retrieval-related tasks and consistently performs well across varying input lengths. Furthermore, quantized pruning exhibits notable stability and effectiveness across different KV pruning methods, quantization strategies, and model scales. These findings offer valuable insights into optimizing KV cache compression through balanced token-precision trade-off strategies. Our code isavailable at https://github.com/zhzihao/QPruningKV.

From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks
Andreas Stephan | Dawei Zhu | Matthias Aßenmacher | Xiaoyu Shen | Benjamin Roth
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance.

Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation
Yirong Sun | Dawei Zhu | Yanjun Chen | Erjia Xiao | Xinghao Chen | Xiaoyu Shen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT). Unlike prior approaches that require specialized techniques, we evaluate LLMs by directly prompting them to translate entire documents in a single pass. Our results show that this method improves translation quality compared to translating sentences separately, even without document-level fine-tuning. However, this advantage is not reflected in BLEU scores, which often favor sentence-based translations. We propose using the LLM-as-a-judge paradigm for evaluation, where GPT-4 is used to assess document coherence, accuracy, and fluency in a more nuanced way than n-gram-based metrics. Overall, our work demonstrates that instruction-tuned LLMs can effectively leverage document context for translation. However, we caution against using BLEU scores for evaluating docMT, as they often provide misleading outcomes, failing to capture the quality of document-level translation.

Findings of the WMT25 Terminology Translation Task: Terminology is Useful Especially for Good MTs
Kirill Semenov | Xu Huang | Vilém Zouhar | Nathaniel Berger | Dawei Zhu | Arturo Oncevay | Pinzhen Chen
Proceedings of the Tenth Conference on Machine Translation

The WMT25 Terminology Translation Task releases new resources in high-stakes domains and investigates the capabilities of translation systems to accurately and consistently translate specialized terms. This year, we feature new domain and language coverage over previous editions, introducing two distinct tracks: (1) sentence-level translation in the information technology domain for English→German, English→Russian, and English→Spanish, and (2) document-level translation in the finance domain for English↔Traditional Chinese with a document-level one-to-many dictionary. Participants are challenged to translate texts under three modes: no terminology, proper terminology, and random terminology, allowing for a causal analysis of terminology utility. Evaluation combines overall quality, terminology accuracy, and terminology consistency. This shared task attracted broad participation, with 13 teams submitting 20 systems in Track 1 and 4 teams participating in Track 2. The results show that providing proper terminology consistently boosts both overall translation quality and term accuracy, whereas reliance on random terminology yields smaller gains. Despite the near-saturation of sentence-level benchmarks, document-level finance translation still fallsshort, indicating an urgent need for long-form evaluation and more robust metrics tailored to professional domains.

2024

Large Language Models are not Fair Evaluators
Peiyi Wang | Lei Li | Liang Chen | Zefan Cai | Dawei Zhu | Binghuai Lin | Yunbo Cao | Lingpeng Kong | Qi Liu | Tianyu Liu | Zhifang Sui
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we uncover a positional bias in the evaluation paradigm of adopting large language models (LLMs), e.g., GPT-4, as a referee to score and compare the quality of responses generated by candidate models. We find that the quality ranking of candidate responses can be easily hacked by simply altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other, e.g., Vicuna-13B could beat ChatGPT on 66 over 80 tested queries with ChatGPT as an evaluator. We propose a simple yet effective calibration framework to address our discovered positional bias.To evaluate the effectiveness of our framework, we manually annotate the “win/tie/lose” outcomes of responses from ChatGPT and Vicuna-13B in the Vicuna Benchmark’s question prompt. Extensive experiments demonstrate that our approach successfully alleviates evaluation bias, resulting in closer alignment with human judgments.

Fine-Tuning Large Language Models to Translate: Will a Touch of Noisy Data in Misaligned Languages Suffice?
Dawei Zhu | Pinzhen Chen | Miaoran Zhang | Barry Haddow | Xiaoyu Shen | Dietrich Klakow
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Traditionally, success in multilingual machine translation can be attributed to three key factors in training data: large volume, diverse translation directions, and high quality. In the current practice of fine-tuning large language models (LLMs) for translation, we revisit the importance of these factors. We find that LLMs display strong translation capability after being fine-tuned on as few as 32 parallel sentences and that fine-tuning on a single translation direction enables translation in multiple directions. However, the choice of direction is critical: fine-tuning LLMs with only English on the target side can lead to task misinterpretation, which hinders translation into non-English languages. Problems also arise when noisy synthetic data is placed on the target side, especially when the target language is well-represented in LLM pre-training. Yet interestingly, synthesized data in an under-represented language has a less pronounced effect. Our findings suggest that when adapting LLMs to translation, the requirement on data quantity can be eased but careful considerations are still crucial to prevent an LLM from exploiting unintended data biases.

LongEmbed: Extending Embedding Models for Long Context Retrieval
Dawei Zhu | Liang Wang | Nan Yang | Yifan Song | Wenhao Wu | Furu Wei | Sujian Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Embedding models play a pivotal role in modern NLP applications such as document retrieval. However, existing embedding models are limited to encoding short documents of typically 512 tokens, restrained from application scenarios requiring long inputs. This paper explores context window extension of existing embedding models, pushing their input length to a maximum of 32,768. We begin by evaluating the performance of existing embedding models using our newly constructed LongEmbed benchmark, which includes two synthetic and four real-world tasks, featuring documents of varying lengths and dispersed target information. The benchmarking results highlight huge opportunities for enhancement in current models. Via comprehensive experiments, we demonstrate that training-free context window extension strategies can effectively increase the input length of these models by several folds. Moreover, comparison of models using Absolute Position Encoding (APE) and Rotary Position Encoding (RoPE) reveals the superiority of RoPE-based embedding models in context window extension, offering empirical guidance for future models. Our benchmark, code and trained models will be released to advance the research in long context embedding models.

The Accuracy Paradox in RLHF: When Better Reward Models Don’t Yield Better Language Models
Yanjun Chen | Dawei Zhu | Yirong Sun | Xinghao Chen | Wei Zhang | Xiaoyu Shen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with human expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and completeness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving model performance and how to choose the most suitable reward models.

To Preserve or To Compress: An In-Depth Study of Connector Selection in Multimodal Large Language Models
Junyan Lin | Haoran Chen | Dawei Zhu | Xiaoyu Shen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

In recent years, multimodal large language models (MLLMs) have attracted widespread attention from both industry and academia. Based on the integration position, MLLMs can be categorized into external and internal fusion architectures, with the former being more predominant. However, there remains considerable debate on how to construct the optimal external fusion MLLM architecture, especially regarding the performance of different connectors on tasks with varying granularities. This paper systematically investigates the impact of connectors on MLLM performance. Specifically, we classify connectors into feature-preserving and feature-compressing types. Utilizing a unified classification standard, we categorize sub-tasks from three comprehensive benchmarks, MMBench, MME, and SEED-Bench, into three task types: coarse-grained perception, fine-grained perception, and reasoning, and evaluate the performance from this perspective. Our findings reveal significant performance differences between different types of connectors across various tasks, offering essential guidance for MLLM architecture design and advancing the understanding of MLLM architecture optimization.

LawBench: Benchmarking Legal Knowledge of Large Language Models
Zhiwei Fei | Xiaoyu Shen | Dawei Zhu | Fengzhe Zhou | Zhuo Han | Alan Huang | Songyang Zhang | Kai Chen | Zhixin Yin | Zongwen Shen | Jidong Ge | Vincent Ng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

We present LawBench, the first evaluation benchmark composed of 20 tasks aimed to assess the ability of Large Language Models (LLMs) to perform Chinese legal-related tasks. LawBench is meticulously crafted to enable precise assessment of LLMs’ legal capabilities from three cognitive levels that correspond to the widely accepted Bloom’s cognitive taxonomy. Using LawBench, we present a comprehensive evaluation of 21 popular LLMs and the first comparative analysis of the empirical results in order to reveal their relative strengths and weaknesses. All data, model predictions and evaluation code are accessible from https://github.com/open-compass/LawBench.

Assessing “Implicit” Retrieval Robustness of Large Language Models
Xiaoyu Shen | Rexhina Blloshmi | Dawei Zhu | Jiahuan Pei | Wei Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Retrieval-augmented generation has gained popularity as a framework to enhance large language models with external knowledge. However, its effectiveness hinges on the retrieval robustness of the model. If the model lacks retrieval robustness, its performance is constrained by the accuracy of the retriever, resulting in significant compromises when the retrieved context is irrelevant. In this paper, we evaluate the “implicit” retrieval robustness of various large language models, instructing them to directly output the final answer without explicitly judging the relevance of the retrieved context. Our findings reveal that fine-tuning on a mix of gold and distracting context significantly enhances the model’s robustness to retrieval inaccuracies, while still maintaining its ability to extract correct answers when retrieval is accurate. This suggests that large language models can implicitly handle relevant or irrelevant retrieved context by learning solely from the supervision of the final answer in an end-to-end manner. Introducing an additional process for explicit relevance judgment can be unnecessary and disrupts the end-to-end approach.

AgentBank: Towards Generalized LLM Agents via Fine-Tuning on 50000+ Interaction Trajectories
Yifan Song | Weimin Xiong | Xiutian Zhao | Dawei Zhu | Wenhao Wu | Ke Wang | Cheng Li | Wei Peng | Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2024

Fine-tuning on agent-environment interaction trajectory data holds significant promise for surfacing generalized agent capabilities in open-source large language models (LLMs). In this work, we introduce AgentBank, by far the largest trajectory tuning data collection featuring more than 50k diverse high-quality interaction trajectories which comprises 16 tasks covering five distinct agent skill dimensions. Leveraging a novel annotation pipeline, we are able to scale the annotated trajectories and generate a trajectory dataset with minimized difficulty bias. Furthermore, we fine-tune LLMs on AgentBank to get a series of agent models, Samoyed. Our comparative experiments demonstrate the effectiveness of scaling the interaction trajectory data to acquire generalized agent capabilities. Additional studies also reveal some key observations regarding trajectory tuning and agent skill generalization.

CoUDA: Coherence Evaluation via Unified Data Augmentation
Dawei Zhu | Wenhao Wu | Yifan Song | Fangwei Zhu | Ziqiang Cao | Sujian Li
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Coherence evaluation aims to assess the organization and structure of a discourse, which remains challenging even in the era of large language models. Due to the scarcity of annotated data, data augmentation is commonly used for training coherence evaluation models. However, previous augmentations for this task primarily rely on heuristic rules, lacking designing criteria as guidance.In this paper, we take inspiration from linguistic theory of discourse structure, and propose a data augmentation framework named CoUDA. CoUDA breaks down discourse coherence into global and local aspects, and designs augmentation strategies for both aspects, respectively.Especially for local coherence, we propose a novel generative strategy for constructing augmentation samples, which involves post-pretraining a generative model and applying two controlling mechanisms to control the difficulty of generated samples. During inference, CoUDA also jointly evaluates both global and local aspects to comprehensively assess the overall coherence of a discourse.Extensive experiments in coherence evaluation show that, with only 233M parameters, CoUDA achieves state-of-the-art performance in both pointwise scoring and pairwise ranking tasks, even surpassing recent GPT-3.5 and GPT-4 based metrics.

A Preference-driven Paradigm for Enhanced Translation with Large Language Models
Dawei Zhu | Sony Trenous | Xiaoyu Shen | Dietrich Klakow | Bill Byrne | Eva Hasler
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Recent research has shown that large language models (LLMs) can achieve remarkable translation performance through supervised fine-tuning (SFT) using only a small amount of parallel data. However, SFT simply instructs the model to imitate the reference translations at the token level, making it vulnerable to the noise present in the references. Hence, the assistance from SFT often reaches a plateau once the LLMs have achieved a certain level of translation capability, and further increasing the size of parallel data does not provide additional benefits. To overcome this plateau associated with imitation-based SFT, we propose a preference-based approach built upon the Plackett-Luce model. The objective is to steer LLMs towards a more nuanced understanding of translation preferences from a holistic view, while also being more resilient in the absence of gold translations. We further build a dataset named MAPLE to verify the effectiveness of our approach, which includes multiple translations of varying quality for each source sentence. Extensive experiments demonstrate the superiority of our approach in “breaking the plateau” across diverse LLMs and test settings. Our in-depth analysis underscores the pivotal role of diverse translations and accurate preference scores in the success of our approach.

Robust Pronoun Fidelity with English LLMs: Are they Reasoning, Repeating, or Just Biased?
Vagrant Gautam | Eileen Bingert | Dawei Zhu | Anne Lauscher | Dietrich Klakow
Transactions of the Association for Computational Linguistics, Volume 12

Robust, faithful, and harm-free pronoun use for individuals is an important goal for language model development as their use increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce the task of pronoun fidelity: Given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later. We present RUFF, a carefully designed dataset of over 5 million instances to measure robust pronoun fidelity in English, and we evaluate 37 model variants from nine popular families, across architectures (encoder-only, decoder-only, and encoder-decoder) and scales (11M-70B parameters). When an individual is introduced with a pronoun, models can mostly faithfully reuse this pronoun in the next sentence, but they are significantly worse with she/her/her, singular they, and neopronouns. Moreover, models are easily distracted by non-adversarial sentences discussing other people; even one sentence with a distractor pronoun causes accuracy to drop on average by 34 percentage points. Our results show that pronoun fidelity is not robust, in a simple, naturalistic setting where humans achieve nearly 100% accuracy. We encourage researchers to bridge the gaps we find and to carefully evaluate reasoning in settings where superficial repetition might inflate perceptions of model performance.

2023

Weaker Than You Think: A Critical Look at Weakly Supervised Learning
Dawei Zhu | Xiaoyu Shen | Marius Mosbach | Andreas Stephan | Dietrich Klakow
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Weakly supervised learning is a popular approach for training machine learning models in low-resource settings. Instead of requesting high-quality yet costly human annotations, it allows training models with noisy annotations obtained from various weak sources. Recently, many sophisticated approaches have been proposed for robust training under label noise, reporting impressive results. In this paper, we revisit the setup of these approaches and find that the benefits brought by these approaches are significantly overestimated. Specifically, we find that the success of existing weakly supervised learning approaches heavily relies on the availability of clean validation samples which, as we show, can be leveraged much more efficiently by simply training on them. After using these clean labels in training, the advantages of using these sophisticated approaches are mostly wiped out. This remains true even when reducing the size of the available clean data to just five samples per class, making these approaches impractical. To understand the true value of weakly supervised learning, we thoroughly analyze diverse NLP datasets and tasks to ascertain when and why weakly supervised approaches work. Based on our findings, we provide recommendations for future research.

Meta Self-Refinement for Robust Learning with Weak Supervision
Dawei Zhu | Xiaoyu Shen | Michael Hedderich | Dietrich Klakow
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Training deep neural networks (DNNs) under weak supervision has attracted increasing research attention as it can significantly reduce the annotation cost. However, labels from weak supervision can be noisy, and the high capacity of DNNs enables them to easily overfit the label noise, resulting in poor generalization. Recent methods leverage self-training to build noise-resistant models, in which a teacher trained under weak supervision is used to provide highly confident labels for teaching the students. Nevertheless, the teacher derived from such frameworks may have fitted a substantial amount of noise and therefore produce incorrect pseudo-labels with high confidence, leading to severe error propagation. In this work, we propose Meta Self-Refinement (MSR), a noise-resistant learning framework, to effectively combat label noise from weak supervision. Instead of relying on a fixed teacher trained with noisy labels, we encourage the teacher to refine its pseudo-labels. At each training step, MSR performs a meta gradient descent on the current mini-batch to maximize the student performance on a clean validation set. Extensive experimentation on eight NLP benchmarks demonstrates that MSR is robust against label noise in all settings and outperforms state-of-the-art methods by up to 11.4% in accuracy and 9.26% in F1 score.

InfoCL: Alleviating Catastrophic Forgetting in Continual Text Classification from An Information Theoretic Perspective
Yifan Song | Peiyi Wang | Weimin Xiong | Dawei Zhu | Tianyu Liu | Zhifang Sui | Sujian Li
Findings of the Association for Computational Linguistics: EMNLP 2023

Continual learning (CL) aims to constantly learn new knowledge over time while avoiding catastrophic forgetting on old tasks. We focus on continual text classification under the class-incremental setting. Recent CL studies have identified the severe performance decrease on analogous classes as a key factor for catastrophic forgetting. In this paper, through an in-depth exploration of the representation learning process in CL, we discover that the compression effect of the information bottleneck leads to confusion on analogous classes. To enable the model learn more sufficient representations, we propose a novel replay-based continual text classification method, InfoCL. Our approach utilizes fast-slow and current-past contrastive learning to perform mutual information maximization and better recover the previously learned representations. In addition, InfoCL incorporates an adversarial memory augmentation strategy to alleviate the overfitting problem of replay. Experimental results demonstrate that InfoCL effectively mitigates forgetting and achieves state-of-the-art performance on three text classification tasks.

2022

ConFiguRe: Exploring Discourse-level Chinese Figures of Speech
Dawei Zhu | Qiusi Zhan | Zhejian Zhou | Yifan Song | Jiebin Zhang | Sujian Li
Proceedings of the 29th International Conference on Computational Linguistics

Figures of speech, such as metaphor and irony, are ubiquitous in literature works and colloquial conversations. This poses great challenge for natural language understanding since figures of speech usually deviate from their ostensible meanings to express deeper semantic implications. Previous research lays emphasis on the literary aspect of figures and seldom provide a comprehensive exploration from a view of computational linguistics. In this paper, we first propose the concept of figurative unit, which is the carrier of a figure. Then we select 12 types of figures commonly used in Chinese, and build a Chinese corpus for Contextualized Figure Recognition (ConFiguRe). Different from previous token-level or sentence-level counterparts, ConFiguRe aims at extracting a figurative unit from discourse-level context, and classifying the figurative unit into the right figure type. On ConFiguRe, three tasks, i.e., figure extraction, figure type classification and figure recognition, are designed and the state-of-the-art techniques are utilized to implement the benchmarks. We conduct thorough experiments and show that all three tasks are challenging for existing models, thus requiring further research. Our dataset and code are publicly available at https://github.com/pku-tangent/ConFiguRe.

Is BERT Robust to Label Noise? A Study on Learning with Noisy Labels in Text Classification
Dawei Zhu | Michael A. Hedderich | Fangzhou Zhai | David Ifeoluwa Adelani | Dietrich Klakow
Proceedings of the Third Workshop on Insights from Negative Results in NLP

Incorrect labels in training data occur when human annotators make mistakes or when the data is generated via weak or distant supervision. It has been shown that complex noise-handling techniques - by modeling, cleaning or filtering the noisy instances - are required to prevent models from fitting this label noise. However, we show in this work that, for text classification tasks with modern NLP models like BERT, over a variety of noise types, existing noise-handling methods do not always improve its performance, and may even deteriorate it, suggesting the need for further investigation. We also back our observations with a comprehensive analysis.

2021

Neural Data-to-Text Generation with LM-based Text Augmentation
Ernie Chang | Xiaoyu Shen | Dawei Zhu | Vera Demberg | Hui Su
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

For many new application domains for data-to-text generation, the main obstacle in training neural models consists of a lack of training data. While usually large numbers of instances are available on the data side, often only very few text samples are available. To address this problem, we here propose a novel few-shot approach for this setting. Our approach automatically augments the data available for training by (i) generating new text samples based on replacing specific values by alternative ones from the same category, (ii) generating new text samples based on GPT-2, and (iii) proposing an automatic method for pairing the new text samples with data samples. As the text augmentation can introduce noise to the training data, we use cycle consistency as an objective, in order to make sure that a given data sample can be correctly reconstructed after having been formulated as text (and that text samples can be reconstructed from data). On both the E2E and WebNLG benchmarks, we show that this weakly supervised training paradigm is able to outperform fully supervised sequence-to-sequence models with less than 10% of the training set. By utilizing all annotated data, our model can boost the performance of a standard sequence-to-sequence model by over 5 BLEU points, establishing a new state-of-the-art on both datasets.

2020

Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages
Michael A. Hedderich | David I. Adelani | Dawei Zhu | Jesujoba Alabi | Udia Markus | Dietrich Klakow
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Multilingual transformer models like mBERT and XLM-RoBERTa have obtained great improvements for many NLP tasks on a variety of languages. However, recent works also showed that results from high-resource languages could not be easily transferred to realistic, low-resource scenarios. In this work, we study trends in performance for different amounts of available resources for the three African languages Hausa, isiXhosa and on both NER and topic classification. We show that in combination with transfer learning or distant supervision, these models can achieve with as little as 10 or 100 labeled sentences the same performance as baselines with much more supervised training data. However, we also find settings where this does not hold. Our discussions and additional experiments on assumptions such as time and hardware restrictions highlight challenges and opportunities in low-resource learning.

Co-authors

David Ifeoluwa Adelani 3

Jesujoba Alabi 2

Michael A. Hedderich 2

Andreas Stephan 2

Peiyi Wang (王培懿) 2

Miaoran Zhang 2

Songyang Zhang 2

Guangxiang Zhao 2

David O. Ademuyiwa 1

Idris Akinade 1

Zena Al-Khalili 1

Israel Abebe Azime 1

Matthias Aßenmacher 1

Rachel Bawden 1

Nathaniel Berger 1

Eileen Bingert 1

Rexhina Blloshmi 1

Andrew Caines 1

Tobias Domhan 1

Cristina España-Bonet 1

Vagrant Gautam 1

Michael Hedderich 1

Alexander Koller 1

Lingpeng Kong 1

Chuqiao Kuang 1

Anne Lauscher 1

Marius Mosbach 1

Shamsuddeen Hassan Muhammad 1

Clement Oyeleke Odoje 1

Arturo Oncevay 1

Benjamin Roth 1

Kirill Semenov 1

Chenhao Xiong 1

Fangzhou Zhai 1

Xiangzheng Zhang 1

Vilém Zouhar 1

Venues