Yudong Wang
2024
Achilles-Bench: A Challenging Benchmark for Low-Resource Evaluation
Yudong Wang
|
Chang Ma
|
Qingxiu Dong
|
Zhifang Sui
|
Lingpeng Kong
|
Jingjing Xu
Findings of the Association for Computational Linguistics: ACL 2024
With promising yet saturated results in high-resource settings, low-resource datasets have gradually become crucial benchmarks (e.g., BigBench Hard, superGLUE) for evaluating the learning ability of advanced neural networks. In this work, we find that there exists a set of “hard examples” in low-resource settings that challenge neural networks but are not well evaluated, which causes over-estimated performance. We first give a theoretical analysis on which factors bring the difficulty of low-resource learning. It then motivates us to propose a challenging benchmark Achilles-Bench to better evaluate the learning ability, which covers 11 datasets, including 8 natural language process (NLP) datasets and 3 computer vision (CV) datasets. Experiments on a wide range of models show that neural networks, even pre-trained language models, have sharp performance drops on our benchmark, demonstrating the effectiveness of evaluating the weaknesses of neural networks. On NLP tasks, we surprisingly find that despite better results on traditional low-resource benchmarks, pre-trained networks, does not show performance improvements on our benchmarks. there is still a large robustness gap between existing models and human-level performance, highlighting the need for robust low-resource learning models.
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation
Demin Song
|
Honglin Guo
|
Yunhua Zhou
|
Shuhao Xing
|
Yudong Wang
|
Zifan Song
|
Wenwei Zhang
|
Qipeng Guo
|
Hang Yan
|
Xipeng Qiu
|
Dahua Lin
Findings of the Association for Computational Linguistics: ACL 2024
The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs’ performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation.
Search
Co-authors
- Chang Ma 1
- Qingxiu Dong 1
- Zhifang Sui 1
- Lingpeng Kong 1
- Jingjing Xu 1
- show all...