Biaoyan Fang


2024

pdf bib
Understanding Faithfulness and Reasoning of Large Language Models on Plain Biomedical Summaries
Biaoyan Fang | Xiang Dai | Sarvnaz Karimi
Findings of the Association for Computational Linguistics: EMNLP 2024

Generating plain biomedical summaries with Large Language Models (LLMs) can enhance the accessibility of biomedical knowledge to the public. However, how faithful the generated summaries are remains an open yet critical question. To address this, we propose FaReBio, a benchmark dataset with expert-annotated Faithfulness and Reasoning on plain Biomedical Summaries. This dataset consists of 175 plain summaries ($,445 sentences) generated by seven different LLMs, paired with source articles. Using our dataset, we identify the performance gap of LLMs in generating faithful plain biomedical summaries and observe a negative correlation between abstractiveness and faithfulness. We also show that current faithfulness evaluation metrics do not work well in the biomedical domain and confirm the over-confident tendency of LLMs as faithfulness evaluators. To better understand the faithfulness judgements, we further benchmark LLMs in retrieving supporting evidence and show the gap of LLMs in reasoning faithfulness evaluation at different abstractiveness levels. Going beyond the binary faithfulness labels, coupled with the annotation of supporting sentences, our dataset could further contribute to the understanding of faithfulness evaluation and reasoning.

pdf bib
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai | Sarvnaz Karimi | Biaoyan Fang
Findings of the Association for Computational Linguistics: EMNLP 2024

Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically important research question. In this position paper, we review recent meta-evaluation practices for summarisation evaluation metrics and find that (1) evaluation metrics are primarily meta-evaluated on datasets consisting of examples from news summarisation datasets, and (2) there has been a noticeable shift in research focus towards evaluating the faithfulness of generated summaries. We argue that the time is ripe to build more diverse benchmarks that enable the development of more robust evaluation metrics and analyze the generalization ability of existing evaluation metrics. In addition, we call for research focusing on user-centric quality dimensions that consider the generated summary’s communicative goal and the role of summarisation in the workflow.

pdf bib
Born Differently Makes a Difference: Counterfactual Study of Bias in Biography Generation from a Data-to-Text Perspective
Biaoyan Fang | Ritvik Dinesh | Xiang Dai | Sarvnaz Karimi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

How do personal attributes affect biography generation? Addressing this question requires an identical pair of biographies where only the personal attributes of interest are different. However, it is rare in the real world. To address this, we propose a counterfactual methodology from a data-to-text perspective, manipulating the personal attributes of interest while keeping the co-occurring attributes unchanged. We first validate that the fine-tuned Flan-T5 model generates the biographies based on the given attributes. This work expands the analysis of gender-centered bias in text generation. Our results confirm the well-known bias in gender and also show the bias in regions, in both individual and its related co-occurring attributes in semantic machining and sentiment.

2023

pdf bib
More than Votes? Voting and Language based Partisanship in the US Supreme Court
Biaoyan Fang | Trevor Cohn | Timothy Baldwin | Lea Frermann
Findings of the Association for Computational Linguistics: EMNLP 2023

Understanding the prevalence and dynamics of justice partisanship and ideology in the US Supreme Court is critical in studying jurisdiction. Most research quantifies partisanship based on voting behavior, and oral arguments in the courtroom — the last essential procedure before the final case outcome — have not been well studied for this purpose. To address this gap, we present a framework for analyzing the language of justices in the courtroom for partisan signals, and study how partisanship in speech aligns with voting patterns. Our results show that the affiliated party of justices can be predicted reliably from their oral contributions. We further show a strong correlation between language partisanship and voting ideology.

pdf bib
Super-SCOTUS: A multi-sourced dataset for the Supreme Court of the US
Biaoyan Fang | Trevor Cohn | Timothy Baldwin | Lea Frermann
Proceedings of the Natural Legal Language Processing Workshop 2023

Given the complexity of the judiciary in the US Supreme Court, various procedures, along with various resources, contribute to the court system. However, most research focuses on a limited set of resources, e.g., court opinions or oral arguments, for analyzing a specific perspective in court, e.g., partisanship or voting. To gain a fuller understanding of these perspectives in the legal system of the US Supreme Court, a more comprehensive dataset, connecting different sources in different phases of the court procedure, is needed. To address this gap, we present a multi-sourced dataset for the Supreme Court, comprising court resources from different procedural phases, connecting language documents with extensive metadata. We showcase its utility through a case study on how different court documents reveal the decision direction (conservative vs. liberal) of the cases. We analyze performance differences across three protected attributes, indicating that different court resources encode different biases, and reinforcing that considering various resources provides a fuller picture of the court procedures. We further discuss how our dataset can contribute to future research directions.

pdf bib
It’s not only What You Say, It’s also Who It’s Said to: Counterfactual Analysis of Interactive Behavior in the Courtroom
Biaoyan Fang | Trevor Cohn | Timothy Baldwin | Lea Frermann
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)

2022

pdf bib
Context-Aware Sentence Classification in Evidence-Based Medicine
Biaoyan Fang | Fajri Koto
Proceedings of the 20th Annual Workshop of the Australasian Language Technology Association

pdf bib
What does it take to bake a cake? The RecipeRef corpus and anaphora resolution in procedural text
Biaoyan Fang | Timothy Baldwin | Karin Verspoor
Findings of the Association for Computational Linguistics: ACL 2022

Procedural text contains rich anaphoric phenomena, yet has not received much attention in NLP. To fill this gap, we investigate the textual properties of two types of procedural text, recipes and chemical patents, and generalize an anaphora annotation framework developed for the chemical domain for modeling anaphoric phenomena in recipes. We apply this framework to annotate the RecipeRef corpus with both bridging and coreference relations. Through comparison to chemical patents, we show the complexity of anaphora resolution in recipes. We demonstrate empirically that transfer learning from the chemical domain improves resolution of anaphora in recipes, suggesting transferability of general procedural knowledge.

2021

pdf bib
Handling Variance of Pretrained Language Models in Grading Evidence in the Medical Literature
Fajri Koto | Biaoyan Fang
Proceedings of the 19th Annual Workshop of the Australasian Language Technology Association

In this paper, we investigate the utility of modern pretrained language models for the evidence grading system in the medical literature based on the ALTA 2021 shared task. We benchmark 1) domain-specific models that are optimized for medical literature and 2) domain-generic models with rich latent discourse representation (i.e. ELECTRA, RoBERTa). Our empirical experiments reveal that these modern pretrained language models suffer from high variance, and the ensemble method can improve the model performance. We found that ELECTRA performs best with an accuracy of 53.6% on the test set, outperforming domain-specific models.1

pdf bib
ChEMU-Ref: A Corpus for Modeling Anaphora Resolution in the Chemical Domain
Biaoyan Fang | Christian Druckenbrodt | Saber A Akhondi | Jiayuan He | Timothy Baldwin | Karin Verspoor
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Chemical patents contain rich coreference and bridging links, which are the target of this research. Specially, we introduce a novel annotation scheme, based on which we create the ChEMU-Ref dataset from reaction description snippets in English-language chemical patents. We propose a neural approach to anaphora resolution, which we show to achieve strong results, especially when jointly trained over coreference and bridging links.