2025
pdf
bib
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Wei Emma Zhang
|
Xiang Dai
|
Desmond Elliot
|
Byron Fang
|
Mongyuan Sim
|
Haojie Zhuang
|
Weitong Chen
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
pdf
bib
abs
Fine-Tuning Encoder-Decoder Models with Contrastive Learning for In-Context Distractor Generation
Elaf Alhazmi
|
Quan Z. Sheng
|
Wei Emma Zhang
|
Mohammed I. Thanoon
|
Haojie Zhuang
|
Behnaz Soltani
|
Munazza Zaib
Findings of the Association for Computational Linguistics: EMNLP 2025
Distractor generation is the task of automatically generating plausible yet incorrect options (i.e., distractors) for fill-in-the-blank and multiple-choice questions. In assessment, distractors must be contextually relevant to the given question and answer. Even though recent research works focus on fine-tuning pre-trained encoder-decoder models with data augmentation techniques to generate distractors, these models often fail to capture the full semantic representation of a given question-answer and related distractors. The augmentation methods often rely on expanding the quantity of proposed candidates (i.e., questions or distractors), which can introduce noise into the models without necessarily enhancing their understanding of the deeper semantic relationships between question-answer and related distractors. This paper introduces a novel distractor generation model based on contrastive learning to train the model to recognize essential semantic features necessary to generate in-context distractors. The extensive experiments on two public datasets indicate that contrastive learning introduces a strong baseline model to the distractor generation task. It significantly outperforms recent models, increasing the NDCG@3 score from 24.68 to 32.33 on the MCQ dataset and from 26.66 to 36.68 on the SciQ dataset.
pdf
bib
abs
The More, The Better? A Critical Study of Multimodal Context in Radiology Report Summarization
Mong Yuan Sim
|
Wei Emma Zhang
|
Xiang Dai
|
Biaoyan Fang
|
Sarbin Ranjitkar
|
Arjun Burlakoti
|
Jamie Taylor
|
Haojie Zhuang
Findings of the Association for Computational Linguistics: EMNLP 2025
The Impression section of a radiology report summarizes critical findings of a radiology report and thus plays a crucial role in communication between radiologists and physicians. Research on radiology report summarization mostly focuses on generating the Impression section by summarizing information from the Findings section, which typically details the radiologist’s observations in the radiology images. Recent work start to explore how to incorporate radiology images as input to multimodal summarization models, with the assumption that it can improve generated summary quality, as it contains richer information. However, the real effectiveness of radiology images remains unclear. To answer this, we conduct a thorough analysis to understand whether current multimodal models can utilize radiology images in summarizing Findings section. Our analysis reveals that current multimodal models often fail to effectively utilize radiology images. For example, masking the image input leads to minimal or no performance drop. Expert annotation study shows that radiology images are unnecessary when they write the Impression section.
2024
pdf
bib
abs
Trainable Hard Negative Examples in Contrastive Learning for Unsupervised Abstractive Summarization
Haojie Zhuang
|
Wei Emma Zhang
|
Chang Dong
|
Jian Yang
|
Quan Sheng
Findings of the Association for Computational Linguistics: EACL 2024
Contrastive learning has demonstrated promising results in unsupervised abstractive summarization. However, existing methods rely on manually crafted negative examples, demanding substantial human effort and domain knowledge. Moreover, these human-generated negative examples may be poor in quality and lack adaptability during model training. To address these issues, we propose a novel approach that learns trainable negative examples for contrastive learning in unsupervised abstractive summarization, which eliminates the need for manual negative example design. Our framework introduces an adversarial optimization process between a negative example network and a representation network (including the summarizer and encoders). The negative example network is trained to synthesize hard negative examples that are close to the positive examples, driving the representation network to improve the quality of the generated summaries. We evaluate our method on two benchmark datasets for unsupervised abstractive summarization and observe significant performance improvements compared to strong baseline models.
pdf
bib
abs
Automatic, Meta and Human Evaluation for Multimodal Summarization with Multimodal Output
Haojie Zhuang
|
Wei Emma Zhang
|
Leon Xie
|
Weitong Chen
|
Jian Yang
|
Quan Sheng
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Multimodal summarization with multimodal output (MSMO) has attracted increasing research interests recently as multimodal summary could provide more comprehensive information compared to text-only summary, effectively improving the user experience and satisfaction. As one of the most fundamental components for the development of MSMO, evaluation is an emerging yet underexplored research topic. In this paper, we fill this gap and propose a research framework that studies three research questions of MSMO evaluation: (1) Automatic Evaluation: We propose a novel metric mLLM-EVAL, which utilizes multimodal Large Language Model for MSMO EVALuation. (2) Meta-Evaluation: We create a meta-evaluation benchmark dataset by collecting human-annotated scores for multimodal summaries. With our benchmark, we conduct meta-evaluation analysis to assess the quality of different evaluation metrics and show the effectiveness of our proposed mLLM-EVAL. (3) Human Evaluation: To provide more objective and unbiased human annotations for meta-evaluation, we hypothesize and verify three types of cognitive biases in human evaluation. We also incorporate our findings into the human annotation process in the meta-evaluation benchmark. Overall, our research framework provides an evaluation metric, a meta-evaluation benchmark dataset annotated by humans and an analysis of cognitive biases in human evaluation, which we believe would serve as a valuable and comprehensive resource for the MSMO research community.
2022
pdf
bib
abs
Learning From the Source Document: Unsupervised Abstractive Summarization
Haojie Zhuang
|
Wei Emma Zhang
|
Jian Yang
|
Congbo Ma
|
Yutong Qu
|
Quan Z. Sheng
Findings of the Association for Computational Linguistics: EMNLP 2022
Most of the state-of-the-art methods for abstractive text summarization are under supervised learning settings, while heavily relying on high-quality and large-scale parallel corpora. In this paper, we remove the need for reference summaries and present an unsupervised learning method SCR (Summarize, Contrast and Review) for abstractive summarization, which leverages contrastive learning and is the first work to apply contrastive learning for unsupervised abstractive summarization. Particularly, we use the true source documents as positive source document examples, and strategically generated fake source documents as negative source document examples to train the model to generate good summaries. Furthermore, we consider and improve the writing quality of the generated summaries by guiding them to be similar to human-written texts. The promising results on extensive experiments show that SCR outperforms other unsupervised abstractive summarization baselines, which demonstrates its effectiveness.