Xuanming Zhang


2024

pdf bib
DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting
Xuanming Zhang | Anthony Diaz | Zixun Chen | Qingyang Wu | Kun Qian | Erik Voss | Zhou Yu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Coherence in writing, an aspect that L2 English learners often struggle with, is crucial in assessing L2 English writing. Existing automated writing evaluation systems primarily use basic surface linguistic features to detect coherence in writing. However, little effort has been made to correct the detected incoherence, which could significantly benefit L2 language learners seeking to improve their writing. To bridge this gap, we introduce DECOR, a novel benchmark that includes expert annotations for detecting incoherence in L2 English writing, identifying the underlying reasons, and rewriting the incoherent sentences. To our knowledge, DECOR is the first coherence assessment dataset specifically designed for improving L2 English writing, featuring pairs of original incoherent sentences alongside their expert-rewritten counterparts. Additionally, we fine-tuned models to automatically detect and rewrite incoherence in student essays. We find that incorporating specific reasons for incoherence during fine-tuning consistently improves the quality of the rewrites, achieving a level that is favored in both automatic and human evaluations.

pdf bib
DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories
Jia Li | Ge Li | Yunfei Zhao | Yongmin Li | Huanyu Liu | Hao Zhu | Lecheng Wang | Kaibo Liu | Zheng Fang | Lanshen Wang | Jiazheng Ding | Xuanming Zhang | Yuqi Zhu | Yihong Dong | Zhi Jin | Binhua Li | Fei Huang | Yongbin Li | Bin Gu | Mengfei Yang
Findings of the Association for Computational Linguistics: ACL 2024

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs.To address the knowledge gap, we propose a new benchmark named DevEval, which has three advances. (1) DevEval aligns with real-world repositories in multiple dimensions, e.g., code and dependency distributions. (2) DevEval is annotated by 13 developers and contains comprehensive annotations (e.g., requirements, original repositories, reference code, and reference dependencies). (3) DevEval comprises 1,825 testing samples from 115 repositories, covering 10 popular domains (e.g., Internet, Database). Based on DevEval, we propose repository-level code generation and evaluate 8 popular LLMs on DevEval (e.g., gpt-4, gpt-3.5, StarCoder 2, DeepSeek Coder, CodeLLaMa). Our experiments reveal these LLMs’ coding abilities in real-world code repositories. For example, the highest Pass@1 of gpt-4 only is 53.04% in our experiments. We also analyze LLMs’ failed cases and summarize their shortcomings. We hope DevEval can facilitate the development of LLMs in real code repositories. DevEval, prompts, and LLMs’ predictions have been released.

pdf bib
ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution
Xuanming Zhang | Zixun Chen | Zhou Yu
Findings of the Association for Computational Linguistics: ACL 2024

Lexical Substitution discovers appropriate substitutes for a given target word in a context sentence. However, the task fails to consider substitutes that are of equal or higher proficiency than the target, an aspect that could be beneficial for language learners looking to improve their writing. To bridge this gap, we propose a new task — language proficiency-oriented lexical substitution. We also introduce ProLex, a novel benchmark designed to assess systems’ ability to generate not only appropriate substitutes but also substitutes that demonstrate better language proficiency. Besides the benchmark, we propose models that can automatically perform the new task. We show that our best model, a Llama2-13B model fine-tuned with task-specific synthetic data, outperforms ChatGPT by an average of 3.2% in F-score and achieves comparable results with GPT-4 on ProLex.

pdf bib
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
Kun Qian | Shunji Wan | Claudia Tang | Youzhi Wang | Xuanming Zhang | Maximillian Chen | Zhou Yu
Findings of the Association for Computational Linguistics: EMNLP 2024

As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model’s predictions for centralized processing and then publish the model’s result on their leaderboard. However, this submission process is inefficient and prevents effective error analysis. To address this issue, we propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time. We applied this variable perturbation method to four datasets: GSM8K, ARC, CommonsenseQA, and TruthfulQA, which cover mathematical generation and multiple-choice tasks. Our experimental results demonstrate that this approach provides a more accurate assessment of the true capabilities of language models, effectively mitigating the contamination problem.

2023

pdf bib
GrounDialog: A Dataset for Repair and Grounding in Task-oriented Spoken Dialogues for Language Learning
Xuanming Zhang | Rahul Divekar | Rutuja Ubale | Zhou Yu
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

Improving conversational proficiency is a key target for students learning a new language. While acquiring conversational proficiency, students must learn the linguistic mechanisms of Repair and Grounding (R\&G) to negotiate meaning and find common ground with their interlocutor so conversational breakdowns can be resolved. Task-oriented Spoken Dialogue Systems (SDS) have long been sought as a tool to hone conversational proficiency. However, the R&G patterns for language learners interacting with a task-oriented spoken dialogue system are not reflected explicitly in any existing datasets. Therefore, to move the needle in Spoken Dialogue Systems for language learning we present GrounDialog: an annotated dataset of spoken conversations where we elicit a rich set of R&G patterns.

2020

pdf bib
Aspect-Based Sentiment Analysis as Fine-Grained Opinion Mining
Gerardo Ocampo Diaz | Xuanming Zhang | Vincent Ng
Proceedings of the Twelfth Language Resources and Evaluation Conference

We show how the general fine-grained opinion mining concepts of opinion target and opinion expression are related to aspect-based sentiment analysis (ABSA) and discuss their benefits for resource creation over popular ABSA annotation schemes. Specifically, we first discuss why opinions modeled solely in terms of (entity, aspect) pairs inadequately captures the meaning of the sentiment originally expressed by authors and how opinion expressions and opinion targets can be used to avoid the loss of information. We then design a meaning-preserving annotation scheme and apply it to two popular ABSA datasets, the 2016 SemEval ABSA Restaurant and Laptop datasets. Finally, we discuss the importance of opinion expressions and opinion targets for next-generation ABSA systems. We make our datasets publicly available for download.