2024
pdf
bib
abs
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
Jia Li
|
Ge Li
|
Yunfei Zhao
|
Yongmin Li
|
Huanyu Liu
|
Hao Zhu
|
Lecheng Wang
|
Kaibo Liu
|
Zheng Fang
|
Lanshen Wang
|
Jiazheng Ding
|
Xuanming Zhang
|
Yuqi Zhu
|
Yihong Dong
|
Zhi Jin
|
Binhua Li
|
Fei Huang
|
Yongbin Li
|
Bin Gu
|
Mengfei Yang
Findings of the Association for Computational Linguistics: ACL 2024
How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs.To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,825 testing samples from 115 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs’ coding abilities in real-world code repositories. For example, the highest Pass@1 of gpt-4 only is 53.04% in our experiments. We also analyze LLMs’ failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs’ predictions have been released.
pdf
bib
abs
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
Yihong Dong
|
Xue Jiang
|
Huanyu Liu
|
Zhi Jin
|
Bin Gu
|
Mengfei Yang
|
Ge Li
Findings of the Association for Computational Linguistics: ACL 2024
Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs’ training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mitigating data contamination for LLMs faces significant challenges. In this paper, we propose CDD, which stands for Contamination Detection via output Distribution for LLMs. CDD necessitates only the sampled texts to detect data contamination, by identifying the peakedness of LLM’s output distribution. To mitigate the impact of data contamination in evaluation, we also present TED: Trustworthy Evaluation via output Distribution, based on the correction of LLM’s output distribution. To facilitate this study, we introduce two benchmarks, i.e., DETCON and COMIEVAL, for data contamination detection and contamination mitigation evaluation tasks. Extensive experimental results show that CDD achieves the average relative improvements of 21.8%-30.2% over other contamination detection approaches in terms of Accuracy, F1 Score, and AUC metrics, and can effectively detect implicit contamination. TED substantially mitigates performance improvements up to 66.9% attributed to data contamination across various contamination setups. In real-world applications, we reveal that ChatGPT exhibits a high potential to suffer from data contamination on HumanEval benchmark.
2023
pdf
bib
abs
Program Translation via Code Distillation
Yufan Huang
|
Mengnan Qi
|
Yongqiang Yao
|
Maoquan Wang
|
Bin Gu
|
Colin Clement
|
Neel Sundaresan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Software version migration and program translation are an important and costly part of the lifecycle of large codebases. Traditional machine translation relies on parallel corpora for supervised translation, which is not feasible for program translation due to a dearth of aligned data. Recent unsupervised neural machine translation techniques have overcome data limitations by included techniques such as back translation and low level compiler intermediate representations (IR). These methods face significant challenges due to the noise in code snippet alignment and the diversity of IRs respectively. In this paper we propose a novel model called Code Distillation (CoDist) whereby we capture the semantic and structural equivalence of code in a language agnostic intermediate representation. Distilled code serves as a translation pivot for any programming language, leading by construction to parallel corpora which scale to all available source code by simply applying the distillation compiler. We demonstrate that our approach achieves state-of-the-art performance on CodeXGLUE and TransCoder GeeksForGeeks translation benchmarks, with an average absolute increase of 12.7% on the TransCoder GeeksforGeeks translation benchmark compare to TransCoder-ST.
pdf
bib
abs
SUT: Active Defects Probing for Transcompiler Models
Mengnan Qi
|
Yufan Huang
|
Maoquan Wang
|
Yongqiang Yao
|
Zihan Liu
|
Bin Gu
|
Colin Clement
|
Neel Sundaresan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Automatic Program translation has enormous application value and hence has been attracting significant interest from AI researchers. However, we observe that current program translation models still make elementary syntax errors, particularly, when the target language does not have syntax elements in the source language. Metrics like BLUE, CodeBLUE and computation accuracy may not expose these issues. In this paper we introduce a new metrics for programming language translation and these metrics address these basic syntax errors. We develop a novel active defects probing suite called Syntactic Unit Tests (SUT) which includes a highly interpretable evaluation harness for accuracy and test scoring. Experiments have shown that even powerful models like ChatGPT still make mistakes on these basic unit tests. Specifically, compared to previous program translation task evaluation dataset, its pass rate on our unit tests has decreased by 26.15%. Further our evaluation harness reveal syntactic element errors in which these models exhibit deficiencies.
2020
pdf
bib
abs
Graph-based Aspect Representation Learning for Entity Resolution
Zhenqi Zhao
|
Yuchen Guo
|
Dingxian Wang
|
Yufan Huang
|
Xiangnan He
|
Bin Gu
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
Entity Resolution (ER) identifies records that refer to the same real-world entity. Deep learning approaches improved the generalization ability of entity matching models, but hardly overcame the impact of noisy or incomplete data sources. In real scenes, an entity usually consists of multiple semantic facets, called aspects. In this paper, we focus on entity augmentation, namely retrieving the values of missing aspects. The relationship between aspects is naturally suitable to be represented by a knowledge graph, where entity augmentation can be modeled as a link prediction problem. Our paper proposes a novel graph-based approach to solve entity augmentation. Specifically, we apply a dedicated random walk algorithm, which uses node types to limit the traversal length, and encodes graph structure into low-dimensional embeddings. Thus, the missing aspects could be retrieved by a link prediction model. Furthermore, the augmented aspects with fixed orders are served as the input of a deep Siamese BiLSTM network for entity matching. We compared our method with state-of-the-art methods through extensive experiments on downstream ER tasks. According to the experiment results, our model outperforms other methods on evaluation metrics (accuracy, precision, recall, and f1-score) to a large extent, which demonstrates the effectiveness of our method.