Wasi U. Ahmad


2024

pdf bib
CoCoMIC: Code Completion by Jointly Modeling In-file and Cross-file Context
Yangruibo Ding | Zijian Wang | Wasi U. Ahmad | Murali Krishna Ramanathan | Ramesh Nallapati | Parminder Bhatia | Dan Roth | Bing Xiang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within the same project, i.e., project-level cross-file context, a critical source of information that is especially useful in modern modular software development. Such overlooking constrains code LMs’ capacity in code completion, leading to unexpected behaviors such as generating hallucinated class member functions or function calls with unexpected arguments. In this work, we propose CoCoMIC, a novel framework that jointly learns the in-file and cross-file context on top of code LMs. To empower CoCoMIC, we develop CCFinder, a static-analysis-based tool that locates and retrieves the most relevant project-level cross-file context for code completion. CoCoMIC successfully improves the existing code LM with a 33.94% relative increase in exact match and 28.69% in identifier matching for code completion when the cross-file context is provided. Finally, we perform a series of ablation studies and share valuable insights for future research on integrating cross-file context into code LMs.

pdf bib
On Leveraging Encoder-only Pre-trained Language Models for Effective Keyphrase Generation
Di Wu | Wasi U. Ahmad | Kai-Wei Chang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This study addresses the application of encoder-only Pre-trained Language Models (PLMs) in keyphrase generation (KPG) amidst the broader availability of domain-tailored encoder-only models compared to encoder-decoder models. We investigate three core inquiries: (1) the efficacy of encoder-only PLMs in KPG, (2) optimal architectural decisions for employing encoder-only PLMs in KPG, and (3) a performance comparison between in-domain encoder-only and encoder-decoder PLMs across varied resource settings. Our findings, derived from extensive experimentation in two domains reveal that with encoder-only PLMs, although keyphrase extraction with Conditional Random Fields slightly excels in identifying present keyphrases, the KPG formulation renders a broader spectrum of keyphrase predictions. Additionally, prefix-LM fine-tuning of encoder-only PLMs emerges as a strong and data-efficient strategy for KPG, outperforming general-domain seq2seq PLMs. We also identify a favorable parameter allocation towards model depth rather than width when employing encoder-decoder architectures initialized with encoder-only PLMs. The study sheds light on the potential of utilizing encoder-only PLMs for advancing KPG systems and provides a groundwork for future KPG methods. Our code and pre-trained checkpoints are released at https://github.com/uclanlp/DeepKPG.