Troy Feng
2024
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models
Ansong Ni
|
Pengcheng Yin
|
Yilun Zhao
|
Martin Riddell
|
Troy Feng
|
Rui Shen
|
Stephen Yin
|
Ye Liu
|
Semih Yavuz
|
Caiming Xiong
|
Shafiq Joty
|
Yingbo Zhou
|
Dragomir Radev
|
Arman Cohan
|
Arman Cohan
Transactions of the Association for Computational Linguistics, Volume 12
Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs. Despite promising results, there is a notable lack of a comprehensive evaluation of these models’ language-to-code generation capabilities. Existing studies often focus on specific tasks, model architectures, or learning paradigms, leading to a fragmented understanding of the overall landscape. In this work, we present L2CEval, a systematic evaluation of the language-to-code generation capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing, math reasoning, and Python programming, analyzing the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition, we assess confidence calibration, and conduct human evaluations to identify typical failures across different tasks and models. L2CEval offers a comprehensive understanding of the capabilities and limitations of LLMs in language-to-code generation. We release the evaluation framework1 and all model outputs, hoping to lay the groundwork for further future research. All future evaluations (e.g., LLaMA-3, StarCoder2, etc) will be updated on the project website: https://l2c-eval.github.io/.
2021
SummerTime: Text Summarization Toolkit for Non-experts
Ansong Ni
|
Zhangir Azerbayev
|
Mutethia Mutuma
|
Troy Feng
|
Yusen Zhang
|
Tao Yu
|
Ahmed Hassan Awadallah
|
Dragomir Radev
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Recent advances in summarization provide models that can generate summaries of higher quality. Such models now exist for a number of summarization tasks, including query-based summarization, dialogue summarization, and multi-document summarization. While such models and tasks are rapidly growing in the research field, it has also become challenging for non-experts to keep track of them. To make summarization methods more accessible to a wider audience, we develop SummerTime by rethinking the summarization task from the perspective of an NLP non-expert. SummerTime is a complete toolkit for text summarization, including various models, datasets, and evaluation metrics, for a full spectrum of summarization-related tasks. SummerTime integrates with libraries designed for NLP researchers, and enables users with easy-to-use APIs. With SummerTime, users can locate pipeline solutions and search for the best model with their own data, and visualize the differences, all with a few lines of code. We also provide explanations for models and evaluation metrics to help users understand the model behaviors and select models that best suit their needs. Our library, along with a notebook demo, is available at https://github.com/Yale-LILY/SummerTime.
Search
Fix data
Co-authors
- Arman Cohan 2
- Ansong Ni 2
- Dragomir Radev 2
- Zhangir Azerbayev 1
- Ahmed Hassan 1
- show all...