Yang Zhong

2025

pdf bib abs
From Information to Insight: Leveraging LLMs for Open Aspect-Based Educational Summarization
Yang Zhong | Diane Litman
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper addresses the challenge of aspect-based summarization in education by introducing Reflective ASPect-based summarization (ReflectASP), a novel dataset that summarizes student reflections on STEM lectures. Despite the promising performance of large language models in general summarization, their application to nuanced aspect-based summaries remains under-explored. ReflectASP eases the exploration of open-aspect-based summarization (OABS), overcoming the limitations of current datasets and comes with ample human annotations. We benchmarked different types of zero-shot summarization methods and proposed two refinement methods to improve summaries, supported by both automatic and human manual evaluations. Additionally, we analyzed suggestions and revisions made during the refinement process, offering a fine-grained study of the editing strategies employed by these methods. We make our models, dataset, and all human evaluation results available at https://github.com/cs329yangzhong/ReflectASP.

pdf bib abs
A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation
Yang Zhong | Diane Litman
Findings of the Association for Computational Linguistics: ACL 2025

Ensuring factual consistency in summarization remains a challenge, especially for long-document evaluation. While automated, reference-free evaluation models are essential given the impracticality of large-scale human assessment for lengthy texts, challenges persist in evaluating different systems on how to handle different summary granularities and evolving model generations. In this work, we conduct a systematic study on diverse factual-consistency evaluation systems across four long-document datasets, encompassing summaries generated by models from non-LLMs to proprietary LLMs. Our analysis reveals that fine-grained continuous scores can provide more reliable assessments of different evaluation systems’ capabilities than binary classification. We also examine the relationship between sentence-level and summary-level model performance, highlighting its dependency on dataset characteristics. Moreover, our study reveals that advanced systems can achieve higher recall in error detection for older summaries, yet struggle with false positives and fine-grained error detection. Our analysis and case studies provide further insights into designing robust factuality evaluation systems, which are becoming increasingly in demand as generative models advance rapidly.

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce AgentThink, a pioneering unified framework that, for the first time, integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink’s core innovations include: (i) Structured Data Generation, by establishing an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; (ii) A Two-stage Training Pipeline, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and (iii) Agent-style Tool-Usage Evaluation, introducing a novel multi-tool assessment protocol to rigorously evaluate the model’s tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate AgentThink significantly boosts overall reasoning scores by 53.91% and enhances answer accuracy by 33.54%, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models.

pdf bib abs
Discourse-Driven Evaluation: Unveiling Factual Inconsistency in Long Document Summarization
Yang Zhong | Diane Litman
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Detecting factual inconsistency for long document summarization remains challenging, given the complex structure of the source article and long summary length. In this work, we study factual inconsistency errors and connect them with a line of discourse analysis. We find that errors are more common in complex sentences and are associated with several discourse features. We propose a framework that decomposes long texts into discourse-inspired chunks and utilizes discourse information to better aggregate sentence-level scores predicted by NLI models. Our approach shows improved performance on top of different model baselines over several evaluation benchmarks, covering rich domains of texts, focusing on long document summarization. This underscores the significance of incorporating discourse features in developing models for scoring summaries for long document factual inconsistency.

pdf bib abs
How to Align Multiple Signed Language Corpora for Better Sign-to-Sign Translations?
Mert Inan | Yang Zhong | Vidya Ganesh | Malihe Alikhani
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

There are more than 300 documented signed languages worldwide, which are indispensable avenues for computational linguists to study cross-cultural and cross-linguistic factors that affect automatic sign understanding and generation. Yet, these are studied under critically low-resource settings, especially when examining multiple signed languages simultaneously. In this work, we hypothesize that a linguistically informed alignment algorithm can improve the results of sign-to-sign translation models. To this end, we first conduct a qualitative analysis of similarities and differences across three signed languages: American Sign Language (ASL), Chinese Sign Language (CSL), and German Sign Language (DGS). We then introduce a novel generation and alignment algorithm for translating one sign language to another, exploring Large Language Models (LLMs) as intermediary translators and paraphrasers. We also compile a dataset of sign-to-sign translation pairs between these signed languages. Our model trained on this dataset performs well on automatic metrics for sign-to-sign translation and generation. Our code and data will be available for the camera-ready version of the paper.

2024

pdf bib abs
ReflectSumm: A Benchmark for Course Reflection Summarization
Yang Zhong | Mohamed Elaraby | Diane Litman | Ahmed Ashraf Butt | Muhsin Menekse
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces ReflectSumm, a novel summarization dataset specifically designed for summarizing students’ reflective writing. The goal of ReflectSumm is to facilitate developing and evaluating novel summarization techniques tailored to real-world scenarios with little training data, with potential implications in the opinion summarization domain in general and the educational domain in particular. The dataset encompasses a diverse range of summarization tasks and includes comprehensive metadata, enabling the exploration of various research questions and supporting different applications. To showcase its utility, we conducted extensive evaluations using multiple state-of-the-art baselines. The results provide benchmarks for facilitating further research in this area.

2023

pdf bib abs
Overview of ImageArg-2023: The First Shared Task in Multimodal Argument Mining
Zhexiong Liu | Mohamed Elaraby | Yang Zhong | Diane Litman
Proceedings of the 10th Workshop on Argument Mining

This paper presents an overview of the ImageArg shared task, the first multimodal Argument Mining shared task co-located with the 10th Workshop on Argument Mining at EMNLP 2023. The shared task comprises two classification subtasks - (1) Subtask-A: Argument Stance Classification; (2) Subtask-B: Image Persuasiveness Classification. The former determines the stance of a tweet containing an image and a piece of text toward a controversial topic (e.g., gun control and abortion). The latter determines whether the image makes the tweet text more persuasive. The shared task received 31 submissions for Subtask-A and 21 submissions for Subtask-B from 9 different teams across 6 countries. The top submission in Subtask-A achieved an F1-score of 0.8647 while the best submission in Subtask-B achieved an F1-score of 0.5561.

pdf bib abs
Towards Argument-Aware Abstractive Summarization of Long Legal Opinions with Summary Reranking
Mohamed Elaraby | Yang Zhong | Diane Litman
Findings of the Association for Computational Linguistics: ACL 2023

We propose a simple approach for the abstractive summarization of long legal opinions that takes into account the argument structure of the document. Legal opinions often contain complex and nuanced argumentation, making it challenging to generate a concise summary that accurately captures the main points of the legal opinion. Our approach involves using argument role information to generate multiple candidate summaries, then reranking these candidates based on alignment with the document’s argument structure. We demonstrate the effectiveness of our approach on a dataset of long legal opinions and show that it outperforms several strong baselines.

pdf bib
STRONG – Structure Controllable Legal Opinion Summary Generation
Yang Zhong | Diane Litman
Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings)

2022

pdf bib abs
Modeling Intensification for Sign Language Generation: A Computational Approach
Mert Inan | Yang Zhong | Sabit Hassan | Lorna Quandt | Malihe Alikhani
Findings of the Association for Computational Linguistics: ACL 2022

End-to-end sign language generation models do not accurately represent the prosody in sign language. A lack of temporal and spatial variations leads to poor-quality generated presentations that confuse human interpreters. In this paper, we aim to improve the prosody in generated sign languages by modeling intensification in a data-driven manner. We present different strategies grounded in linguistics of sign language that inform how intensity modifiers can be represented in gloss annotations. To employ our strategies, we first annotate a subset of the benchmark PHOENIX-14T, a German Sign Language dataset, with different levels of intensification. We then use a supervised intensity tagger to extend the annotated dataset and obtain labels for the remaining portion of it. This enhanced dataset is then used to train state-of-the-art transformer models for sign language generation. We find that our efforts in intensification modeling yield better results when evaluated with automatic metrics. Human evaluation also indicates a higher preference of the videos generated using our model.

pdf bib abs
Computing and Exploiting Document Structure to Improve Unsupervised Extractive Summarization of Legal Case Decisions
Yang Zhong | Diane Litman
Proceedings of the Natural Legal Language Processing Workshop 2022

Though many algorithms can be used to automatically summarize legal case decisions, most fail to incorporate domain knowledge about how important sentences in a legal decision relate to a representation of its document structure. For example, analysis of a legal case sum- marization dataset demonstrates that sentences serving different types of argumentative roles in the decision appear in different sections of the document. In this work, we propose an unsupervised graph-based ranking model that uses a reweighting algorithm to exploit properties of the document structure of legal case decisions. We also explore the impact of using different methods to compute the document structure. Results on the Canadian Legal Case Law dataset show that our proposed method outperforms several strong baselines.

2021

pdf bib abs
WIKIBIAS: Detecting Multi-Span Subjective Biases in Language
Yang Zhong | Jingfeng Yang | Wei Xu | Diyi Yang
Findings of the Association for Computational Linguistics: EMNLP 2021

Biases continue to be prevalent in modern text and media, especially subjective bias – a special type of bias that introduces improper attitudes or presents a statement with the presupposition of truth. To tackle the problem of detecting and further mitigating subjective bias, we introduce a manually annotated parallel corpus WIKIBIAS with more than 4,000 sentence pairs from Wikipedia edits. This corpus contains annotations towards both sentence-level bias types and token-level biased segments. We present systematic analyses of our dataset and results achieved by a set of state-of-the-art baselines in terms of three tasks: bias classification, tagging biased segments, and neutralizing biased text. We find that current models still struggle with detecting multi-span biases despite their reasonable performances, suggesting that our dataset can serve as a useful research benchmark. We also demonstrate that models trained on our dataset can generalize well to multiple domains such as news and political speeches.

2020

pdf bib abs
Neural CRF Model for Sentence Alignment in Text Simplification
Chao Jiang | Mounica Maddela | Wuwei Lan | Yang Zhong | Wei Xu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1. We apply our CRF aligner to construct two new text simplification datasets, Newsela-Auto and Wiki-Auto, which are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.