Zhicheng Guo

Tsinghua

Other people with similar names: Zhicheng Guo (xidian)


2024

pdf bib
Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
Pinzhen Chen | Simon Yu | Zhicheng Guo | Barry Haddow
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Multilingual large language models are designed, claimed, and expected to cater to speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on translation, which cannot cover language-specific knowledge but can introduce translation defects. It remains unknown whether the nature of the instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues using controlled native or translated data during the instruction tuning and evaluation stages. We show that native or generation benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.

pdf bib
StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models
Zhicheng Guo | Sijie Cheng | Hao Wang | Shihao Liang | Yujia Qin | Peng Li | Zhiyuan Liu | Maosong Sun | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2024

Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.

pdf bib
Iterative Translation Refinement with Large Language Models
Pinzhen Chen | Zhicheng Guo | Barry Haddow | Kenneth Heafield
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

We propose iteratively prompting a large language model to self-correct a translation, with inspiration from their strong language capability as well as a human-like translation approach. Interestingly, multi-turn querying reduces the output’s string-based metric scores, but neural metrics suggest comparable or improved quality after two or more iterations. Human evaluations indicate better fluency and naturalness compared to initial translations and even human references, all while maintaining quality. Ablation studies underscore the importance of anchoring the refinement to the source and a reasonable seed translation for quality considerations. We also discuss the challenges in evaluation and relation to human performance and translationese.

2023

pdf bib
Prompt-Guided Retrieval Augmentation for Non-Knowledge-Intensive Tasks
Zhicheng Guo | Sijie Cheng | Yile Wang | Peng Li | Yang Liu
Findings of the Association for Computational Linguistics: ACL 2023

Retrieval-augmented methods have received increasing attention to support downstream tasks by leveraging useful information from external resources. Recent studies mainly focus on exploring retrieval to solve knowledge-intensive (KI) tasks. However, the potential of retrieval for most non-knowledge-intensive (NKI) tasks remains under-explored. There are two main challenges to leveraging retrieval-augmented methods for NKI tasks: 1) the demand for diverse relevance score functions and 2) the dilemma between training cost and task performance. To address these challenges, we propose a two-stage framework for NKI tasks, named PGRA. In the first stage, we adopt a task-agnostic retriever to build a shared static index and select candidate evidence efficiently. In the second stage, we design a prompt-guided reranker to rerank the nearest evidence according to task-specific relevance for the reader. Experimental results show that PGRA outperforms other state-of-the-art retrieval-augmented methods. Our analyses further investigate the influence factors to model performance and demonstrate the generality of PGRA. The code and model will be released for further research.