Yiduo Guo

2025

English as Defense Proxy: Mitigating Multilingual Jailbreak via Eliciting English Safety Knowledge
Zekai Zhang | Yiduo Guo | Jiuheng Lin | Shanghaoran Quan | Huishuai Zhang | Dongyan Zhao
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) excel in many tasks, but their safety guarantees vary by language, e.g., responses in English tend to be safer than those in low-resource languages. This inconsistency creates a vulnerability, since an attacker can circumvent safety measures by using a less-supported language as an intermediary, even without fluency in that language. Traditional solutions rely on multilingual safety alignment, which demands vast, per-language datasets and introduces significant trade-offs between usefulness and safety (the so-called “alignment tax”). To overcome these limitations, we introduce English as Defense Proxy (E-Proxy), a unified approach that leverages English, usually the advantage language of LLMs, as a universal safety anchor. During multilingual training, E-Proxy uses English jailbreak prompts to extract the model’s existing safety knowledge, then applies simple language-mapping prompts (e.g., “Please answer in target language”) to transfer that knowledge across languages. Our analysis shows that formulating prompts in a high-resource language preserves the model’s utility, while enforcing responses in the target language significantly enhances safety. We evaluate E-Proxy on extensive benchmarks of both attack resistance and task performance. On the MultiJail benchmark, E-Proxy blocks over 99 % of jailbreak attempts while retaining 95 % of average task performance, all with a simply constructed multilingual alignment data.

pdf bib abs

Efficient Domain Continual pretraining by Mitigating the Stability Gap
Yiduo Guo | Jie Fu | Huishuai Zhang | Dongyan Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Continual pretraining enables Large Language Models (LLMs) to adapt to specialized domains like medicine and law. However, we observe a consistent phenomenon across different model sizes and domains: a temporary performance drop at the start of the continual pretraining process, followed by a performance recovery phase. To gain a deeper understanding of this issue, we use the stability gap— a concept adapted from the visual domain—which explains this initial drop arises from instability in the model’s general abilities. We validate this hypothesis through a series of experiments. To address this initial instability and enhance LLM performance within a fixed compute budget, we propose a training strategy that mitigates instability by increasing the number of epochs, alongside two data sampling strategies targeting data domain relevance and corpus distribution. We conduct experiments on Llama-family models to validate the effectiveness of our strategies for continual pretraining and instruction tuning in medical and legal domains. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% using only 40% of the original training budget, while also enhancing general task performance without causing forgetting. Furthermore, we aPPLy our strategies to continually pre-train and instruction-tune the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among open-source models on several benchmarks and rivals GPT-4 on specific tasks. We release our models at https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct.

2024

pdf bib abs

Large Language Models Can Learn Representation in Natural Language
Yiduo Guo | Yaobo Liang | Dongyan Zhao | Nan Duan
Findings of the Association for Computational Linguistics: ACL 2024

One major challenge for Large Language Models (LLMs) is completing complex tasks involving multiple entities, such as tool APIs. To tackle this, one approach is to retrieve relevant entities to enhance LLMs in task completion. A crucial issue here is obtaining accurate natural language representations for each entity to aid in retriever precision. In this paper, we propose the Natural Language Representation Optimization Problem, which aims to refine entity descriptions for improved retrieval and LLM utilization. We introduce the Learning to Represent with Natural Language method, which utilizes LLMs to optimize entity representations consisting of text patterns based on environmental feedback. We iteratively prompt LLMs to enhance or adjust patterns based on entity samples and evaluate their effectiveness through environmental feedback. Our method successfully learns human-understandable representations for classification tasks (e.g., instructions and documents) and API call tasks (e.g., APIbench and Virtual Home), significantly improving GPT-4’s task performance.

pdf bib abs

Assessing foundation models’ abilities for human-level tasks is crucial for Artificial General Intelligence (AGI) development.Traditional benchmarks, which rely on artificial datasets, may not accurately represent these capabilities. In this paper, we introduce AGIEval, a novel bilingual benchmark designed to assess foundation models in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models on our benchmark. Impressively, we show that GPT-4 exceeds the average human performance in SAT, LSAT, and math contests, with 95% accuracy on SAT Math and 92.5% on the Chinese college entrance English exam. This demonstrates the exceptional performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks requiring complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal their strengths and limitations, providing valuable insights into future directions for enhancing general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a meaningful and robust evaluation of foundation models’ performance in real-world scenarios.

pdf bib abs

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion
Yiduo Guo | Zekai Zhang | Yaobo Liang | Dongyan Zhao | Nan Duan
Findings of the Association for Computational Linguistics: ACL 2024

Recent evaluations of Large Language Models (LLMs) have centered around testing their zero-shot/few-shot capabilities for basic natural language tasks and their ability to translate instructions into tool APIs. However, the evaluation of LLMs utilizing complex tools to finish multi-turn, multi-modal instructions in a complex multi-modal environment has not been investigated. To address this gap, we introduce the PowerPoint Task Completion (PPTC) benchmark to assess LLMs’ ability to create and edit PPT files based on user instructions. It contains 279 multi-turn sessions covering diverse topics and hundreds of instructions involving multi-modal operations. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence, thus it supports various LLM-generated API sequences. We measure 3 closed LLMs and 6 open-source LLMs. The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy. We find three main error causes in our benchmark: error accumulation in the multi-turn session, long PPT template processing, and multi-modality perception. These pose great challenges for future LLM and agent systems .

pdf bib abs

Large Language Models (LLMs) have shown remarkable performance in various basic natural language tasks. For completing the complex task, we still need a plan for the task to guide LLMs to generate the specific solutions step by step. LLMs can directly generate task plans, but these plans may still contain factual errors or are incomplete. A high-quality task plan contains correct step-by-step solutions for solving all situations and behavioral instructions for avoiding mistakes. To obtain it, we propose the Learning to Plan method, which involves two phases: (1) In the first learning task plan phase, it iteratively updates the task plan with new step-by-step solutions and behavioral instructions, which are obtained by prompting LLMs to derive from training error feedback. (2) In the subsequent test phase, the LLM uses the learned task plan to guide the inference of LLM on the test set. We demonstrate the effectiveness of our method on the five different reasoning type tasks (8 datasets). Further, our analysis experiment shows that the task plan learned by one LLM can directly guide another LLM to improve its performance, which reveals a new transfer learning paradigm.

pdf bib abs

PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion
Zekai Zhang | Yiduo Guo | Yaobo Liang | Dongyan Zhao | Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2024

The growing dependence on Large Language Models (LLMs) for finishing user instructions necessitates a comprehensive understanding of their robustness to complex task completion in real-world situations. To address this critical need, we propose the PowerPoint Task Completion-Robustness (PPTC-R) benchmark to measure LLMs’ robustness to the user PPT task instruction and software version (Powerpoint). Specifically, we construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. To assess the robustness of Language Models to software versions, we vary the number of provided APIs to simulate both the newest version and earlier version settings. Subsequently, we test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates these robustness settings, aiming to evaluate how deviations impact LLMs’ API calls for task completion. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark, particularly in the version update and the multilingual settings. However, we find that all LLMs lose their robustness when confronted with multiple challenges (e.g., multi-turn) simultaneously, leading to significant performance drops. We further analyze the robustness behavior and error reasons of LLMs in our benchmark, which provide valuable insights for researchers to understand the LLM’s robustness in task completion and develop more robust LLMs and agents.

2023

pdf bib abs

Class-Incremental Learning based on Label Generation
Yijia Shao | Yiduo Guo | Dongyan Zhao | Bing Liu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Despite the great success of pre-trained language models, it is still a challenge to use these models for continual learning, especially for the class-incremental learning (CIL) setting due to catastrophic forgetting (CF). This paper reports our finding that if we formulate CIL as a continual label generation problem, CF is drastically reduced and the generalizable representations of pre-trained models can be better retained. We thus propose a new CIL method (VAG) that also leverages the sparsity of vocabulary to focus the generation and creates pseudo-replay samples by using label semantics. Experimental results show that VAG outperforms baselines by a large margin.

pdf bib abs

Analyzing and Reducing the Performance Gap in Cross-Lingual Transfer with Fine-tuning Slow and Fast
Yiduo Guo | Yaobo Liang | Dongyan Zhao | Bing Liu | Nan Duan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing research has shown that a multilingual pre-trained language model fine-tuned with one (source) language also performs well on downstream tasks for non-source languages, even though no fine-tuning is done on these languages. However, there is a clear gap between the performance of the source language and that of the non-source languages. This paper analyzes the fine-tuning process, discovers when the performance gap changes and identifies which network weights affect the overall performance most. Additionally, the paper seeks to answer to what extent the gap can be reduced by reducing forgetting. Based on the analysis results, a method named Fine-tuning slow and fast with four training policies is proposed to address these issues. Experimental results show the proposed method outperforms baselines by a clear margin.