Jingxuan Tu - ACL Anthology

Jingxuan Tu

2025

Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization
Jin Zhao | Jingxuan Tu | Bingyang Ye | Xinrui Hu | Nianwen Xue | James Pustejovsky
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Cross-Document Event Coreference (CDEC) annotation is challenging and difficult to scale, resulting in existing datasets being small and lacking diversity. We introduce a new approach leveraging large language models (LLMs) to decontextualize event mentions, by simplifying the document-level annotation task to sentence pairs with enriched context, enabling the creation of Richer EventCorefBank (RECB), a denser and more expressive dataset annotated at faster speed. Decontextualization has been shown to improve annotation speed without compromising quality and to enhance model performance. Our baseline experiment indicates that systems trained on RECB achieve comparable results on the EventCorefBank(ECB+) test set, showing the high quality of our dataset and its generalizability on other CDEC datasets. In addition, our evaluation shows that the strong baseline models are still struggling with RECB comparing to other CDEC datasets, suggesting that the richness and diversity of RECB present significant challenges to current CDEC systems.

TRACE: Real-Time Multimodal Common Ground Tracking in Situated Collaborative Dialogues
Hannah VanderHoeven | Brady Bhalla | Ibrahim Khebour | Austin C. Youngren | Videep Venkatesha | Mariah Bradford | Jack Fitzgerald | Carlos Mabrey | Jingxuan Tu | Yifan Zhu | Kenneth Lai | Changsoo Jung | James Pustejovsky | Nikhil Krishnaswamy
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

We present TRACE, a novel system for live *common ground* tracking in situated collaborative tasks. With a focus on fast, real-time performance, TRACE tracks the speech, actions, gestures, and visual attention of participants, uses these multimodal inputs to determine the set of task-relevant propositions that have been raised as the dialogue progresses, and tracks the group’s epistemic position and beliefs toward them as the task unfolds. Amid increased interest in AI systems that can mediate collaborations, TRACE represents an important step forward for agents that can engage with multiparty, multimodal discourse.

ProPara-CRTS: Canonical Referent Tracking for Reliable Evaluation of Entity State Tracking in Process Narratives
Bingyang Ye | Timothy Obiso | Jingxuan Tu | James Pustejovsky
Proceedings of the 16th International Conference on Computational Semantics

Despite the abundance of datasets for procedural texts such as cooking recipes, resources that capture full process narratives, paragraph-long descriptions that follow how multiple entities evolve across a sequence of steps, remain scarce.Although synthetic resources offer useful toy settings, they fail to capture the linguistic variability of naturally occurring prose. ProPara remains the only sizeable, naturally occurring corpus of process narratives, yet ambiguities and inconsistencies in its schema and annotations hinder reliable evaluation of its core task Entity State Tracking (EST).In this paper, we introduce a Canonical Referent Tracking Schema (CRTS) that assigns every surface mention to a unique, immutable discourse referent and records that referent’s existence and location at each step. Applying CRTS to ProPara, we release the re-annotated result as ProPara-CRTS. The new corpus resolves ambiguous participant mentions in ProPara and consistently boosts performance across a variety of models.This suggests that principled schema design and targeted re-annotation can unlock measurable improvements in EST, providing a sharper diagnostic of model capacity in process narratives understanding without any changes to model architecture.

Enhanced Noun-Noun Compound Interpretation through Textual Enrichment
Bingyang Ye | Jingxuan Tu | James Pustejovsky
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Interpreting Noun-Noun Compounds remains a persistent challenge for Large Language Models (LLMs) because the semantic relation between the modifier and the head is rarely stated explicitly. Recent benchmarks frame Noun-Noun Compound Interpretation as a multiple-choice question. While this setting allows LLMs to produce more controlled results, it still faces two key limitations: vague relation descriptions as options and the inability to handle polysemous compounds. We introduce a dual-faceted textual enrichment framework that augments prompts. Description enrichment paraphrases relations into event‐oriented descriptions instantiated with the target compound to explicitly surface the hidden event connecting head and modifier. Conditioned context enrichment identifies polysemous compounds leveraging qualia-role binding and assigns each compound with condition cues for disambiguation. Our method yields consistently higher accuracy across three LLM families. These gains suggest that surfacing latent compositional structure and contextual constraint is a promising path toward deeper semantic understanding in language models.

2024

HaRMoNEE at SemEval-2024 Task 6: Tuning-based Approaches to Hallucination Recognition
Timothy Obiso | Jingxuan Tu | James Pustejovsky
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper presents the Hallucination Recognition Model for New Experiment Evaluation (HaRMoNEE) team’s winning (#1) and #10 submissions for SemEval-2024 Task 6: Shared- task on Hallucinations and Related Observable Overgeneration Mistakes (SHROOM)’s two subtasks. This task challenged its participants to design systems to detect hallucinations in Large Language Model (LLM) outputs. Team HaRMoNEE proposes two architectures: (1) fine-tuning an off-the-shelf transformer-based model and (2) prompt tuning large-scale Large Language Models (LLMs). One submission from the fine-tuning approach outperformed all other submissions for the model-aware subtask; one submission from the prompt-tuning approach is the 10th-best submission on the leaderboard for the model-agnostic subtask. Our systems also include pre-processing, system-specific tuning, post-processing, and evaluation.

GLAMR: Augmenting AMR with GL-VerbNet Event Structure
Jingxuan Tu | Timothy Obiso | Bingyang Ye | Kyeongmin Rim | Keer Xu | Liulu Yue | Susan Windisch Brown | Martha Palmer | James Pustejovsky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper introduces GLAMR, an Abstract Meaning Representation (AMR) interpretation of Generative Lexicon (GL) semantic components. It includes a structured subeventual interpretation of linguistic predicates, and encoding of the opposition structure of property changes of event arguments. Both of these features are recently encoded in VerbNet (VN), and form the scaffolding for the semantic form associated with VN frame files. We develop a new syntax, concepts, and roles for subevent structure based on VN for connecting subevents to atomic predicates. Our proposed extension is compatible with current AMR specification. We also present an approach to automatically augment AMR graphs by inserting subevent structure of the predicates and identifying the subevent arguments from the semantic roles. A pilot annotation of GLAMR graphs of 65 documents (486 sentences), based on procedural texts as a source, is presented as a public dataset. The annotation includes subevents, argument property change, and document-level anaphoric links. Finally, we provide baseline models for converting text to GLAMR and vice versa, along with the application of GLAMR for generating enriched paraphrases with details on subevent transformation and arguments that are not present in the surface form of the texts.

Common Ground Tracking in Multimodal Dialogue
Ibrahim Khalil Khebour | Kenneth Lai | Mariah Bradford | Yifan Zhu | Richard A. Brutti | Christopher Tam | Jingxuan Tu | Benjamin A. Ibarra | Nathaniel Blanchard | Nikhil Krishnaswamy | James Pustejovsky
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on “dialogue state tracking” (DST), which is the ability to update the representations of the speaker’s needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is “common ground tracking” (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and ”questions under discussion” (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.

Media Attitude Detection via Framing Analysis with Events and their Relations
Jin Zhao | Jingxuan Tu | Han Du | Nianwen Xue
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Framing is used to present some selective aspects of an issue and making them more salient, which aims to promote certain values, interpretations, or solutions (Entman, 1993). This study investigates the nuances of media framing on public perception and understanding by examining how events are presented within news articles. Unlike previous research that primarily focused on word choice as a framing device, this work explores the comprehensive narrative construction through events and their relations. Our method integrates event extraction, cross-document event coreference, and causal relationship mapping among events to extract framing devices employed by media to assess their role in framing the narrative. We evaluate our approach with a media attitude detection task and show that the use of event mentions, event cluster descriptors, and their causal relations effectively captures the subtle nuances of framing, thereby providing deeper insights into the attitudes conveyed by news articles. The experimental results show the framing device models surpass the baseline models and offers a more detailed and explainable analysis of media framing effects. We make the source code and dataset publicly available.

Linguistically Conditioned Semantic Textual Similarity
Jingxuan Tu | Keer Xu | Liulu Yue | Bingyang Ye | Kyeongmin Rim | James Pustejovsky
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Semantic textual similarity (STS) is a fundamental NLP task that measures the semantic similarity between a pair of sentences. In order to reduce the inherent ambiguity posed from the sentences, a recent work called Conditional STS (C-STS) has been proposed to measure the sentences’ similarity conditioned on a certain aspect. Despite the popularity of C-STS, we find that the current C-STS dataset suffers from various issues that could impede proper evaluation on this task. In this paper, we reannotate the C-STS validation set and observe an annotator discrepancy on 55% of the instances resulting from the annotation errors in the original label, ill-defined conditions, and the lack of clarity in the task definition. After a thorough dataset analysis, we improve the C-STS task by leveraging the models’ capability to understand the conditions under a QA task setting. With the generated answers, we present an automatic error identification pipeline that is able to identify annotation errors from the C-STS data with over 80% F1 score. We also propose a new method that largely improves the performance over baselines on the C-STS data by training the models with the answers. Finally we discuss the conditionality annotation based on the typed-feature structure (TFS) of entity types. We show in examples that the TFS is able to provide a linguistic foundation for constructing C-STS data with new conditions.

2023

Dense Paraphrasing for Textual Enrichment
Jingxuan Tu | Kyeongmin Rim | Eben Holderness | Bingyang Ye | James Pustejovsky
Proceedings of the 15th International Conference on Computational Semantics

Understanding inferences from text requires more than merely recovering surface arguments, adjuncts, or strings associated with the query terms. As humans, we interpret sentences as contextualized components of a narrative or discourse, by both filling in missing information, and reasoning about event consequences. In this paper, we define the process of rewriting a textual expression (lexeme or phrase) such that it reduces ambiguity while also making explicit the underlying semantics that is not (necessarily) expressed in the economy of sentence structure as Dense Paraphrasing (DP). We apply the DP techniques on the English procedural texts from the cooking recipe domain, and provide the scope and design of the application that involves creating a graph representation of events and generating hidden arguments through paraphrasing. We provide insights on how this DP process can enrich a source text by showing that the dense-paraphrased event graph is a good resource to large LLMs such as GPT-3 to generate reliable paraphrases; and by experimenting baselines for automaticDP generation. Finally, we demonstrate the utility of the dataset and event graph structure by providing a case study on the out-of-domain modeling and different DP prompts and GPT models for paraphrasing.

The Coreference under Transformation Labeling Dataset: Entity Tracking in Procedural Texts Using Event Models
Kyeongmin Rim | Jingxuan Tu | Bingyang Ye | Marc Verhagen | Eben Holderness | James Pustejovsky
Findings of the Association for Computational Linguistics: ACL 2023

We demonstrate that coreference resolution in procedural texts is significantly improved when performing transformation-based entity linking prior to coreference relation identification. When events in the text introduce changes to the state of participating entities, it is often impossible to accurately link entities in anaphoric and coreference relations without an understanding of the transformations those entities undergo. We show how adding event semantics helps to better model entity coreference. We argue that all transformation predicates, not just creation verbs, introduce a new entity into the discourse, as a kind of generalized Result Role, which is typically not textually mentioned. This allows us to model procedural texts as process graphs and to compute the coreference type for any two entities in the recipe. We present our annotation methodology and the corpus generated as well as describe experiments on coreference resolution of entity mentions under a process-oriented model of events.

Scalar Anaphora: Annotating Degrees of Coreference in Text
Bingyang Ye | Jingxuan Tu | James Pustejovsky
Proceedings of the Sixth Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2023)

2022

SemEval-2022 Task 9: R2VQ – Competence-based Multimodal Question Answering
Jingxuan Tu | Eben Holderness | Marco Maru | Simone Conia | Kyeongmin Rim | Kelley Lynch | Richard Brutti | Roberto Navigli | James Pustejovsky
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

In this task, we identify a challenge that is reflective of linguistic and cognitive competencies that humans have when speaking and reasoning. Particularly, given the intuition that textual and visual information mutually inform each other for semantic reasoning, we formulate a Competence-based Question Answering challenge, designed to involve rich semantic annotation and aligned text-video objects. The task is to answer questions from a collection of cooking recipes and videos, where each question belongs to a “question family” reflecting a specific reasoning competence. The data and task result is publicly available.

Evaluating Retrieval for Multi-domain Scientific Publications
Nancy Ide | Keith Suderman | Jingxuan Tu | Marc Verhagen | Shanan Peters | Ian Ross | John Lawson | Andrew Borg | James Pustejovsky
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This paper provides an overview of the xDD/LAPPS Grid framework and provides results of evaluating the AskMe retrievalengine using the BEIR benchmark datasets. Our primary goal is to determine a solid baseline of performance to guide furtherdevelopment of our retrieval capabilities. Beyond this, we aim to dig deeper to determine when and why certain approachesperform well (or badly) on both in-domain and out-of-domain data, an issue that has to date received relatively little attention.

Competence-based Question Generation
Jingxuan Tu | Kyeongmin Rim | James Pustejovsky
Proceedings of the 29th International Conference on Computational Linguistics

Models of natural language understanding often rely on question answering and logical inference benchmark challenges to evaluate the performance of a system. While informative, such task-oriented evaluations do not assess the broader semantic abilities that humans have as part of their linguistic competence when speaking and interpreting language. We define competence-based (CB) question generation, and focus on queries over lexical semantic knowledge involving implicit argument and subevent structure of verbs. We present a method to generate such questions and a dataset of English cooking recipes we use for implementing the generation method. Our primary experiment shows that even large pretrained language models perform poorly on CB questions until they are provided with additional contextualized semantic information. The data and the source code is available at: https: //github.com/brandeis-llc/CompQG.

2021

Exploration and Discovery of the COVID-19 Literature through Semantic Visualization
Jingxuan Tu | Marc Verhagen | Brent Cochran | James Pustejovsky
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

We propose semantic visualization as a linguistic visual analytic method. It can enable exploration and discovery over large datasets of complex networks by exploiting the semantics of the relations in them. This involves extracting information, applying parameter reduction operations, building hierarchical data representation and designing visualization. We also present the accompanying COVID-SemViz a searchable and interactive visualization system for knowledge exploration of COVID-19 data to demonstrate the application of our proposed method. In the user studies, users found that semantic visualization-powered COVID-SemViz is helpful in terms of finding relevant information and discovering unknown associations.

To combat COVID-19, both clinicians and scientists need to digest the vast amount of relevant biomedical knowledge in literature to understand the disease mechanism and the related biological functions. We have developed a novel and comprehensive knowledge discovery framework, COVID-KG to extract fine-grained multimedia knowledge elements (entities, relations and events) from scientific literature. We then exploit the constructed multimedia knowledge graphs (KGs) for question answering and report generation, using drug repurposing as a case study. Our framework also provides detailed contextual sentences, subfigures, and knowledge subgraphs as evidence. All of the data, KGs, reports.

TMR: Evaluating NER Recall on Tough Mentions
Jingxuan Tu | Constantine Lignos
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We propose the Tough Mentions Recall (TMR) metrics to supplement traditional named entity recognition (NER) evaluation by examining recall on specific subsets of ”tough” mentions: unseen mentions, those whose tokens or token/type combination were not observed in training, and type-confusable mentions, token sequences with multiple entity types in the test data. We demonstrate the usefulness of these metrics by evaluating corpora of English, Spanish, and Dutch using five recent neural architectures. We identify subtle differences between the performance of BERT and Flair on two English NER corpora and identify a weak spot in the performance of current models in Spanish. We conclude that the TMR metrics enable differentiation between otherwise similar-scoring systems and identification of patterns in performance that would go unnoticed from overall precision, recall, and F1.

2020

Reproducing Neural Ensemble Classifier for Semantic Relation Extraction inScientific Papers
Kyeongmin Rim | Jingxuan Tu | Kelley Lynch | James Pustejovsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

Within the natural language processing (NLP) community, shared tasks play an important role. They define a common goal and allowthe the comparison of different methods on the same data. SemEval-2018 Task 7 involves the identification and classification of relationsin abstracts from computational linguistics (CL) publications. In this paper we describe an attempt to reproduce the methods and resultsfrom the top performing system at for SemEval-2018 Task 7. We describe challenges we encountered in the process, report on the resultsof our system, and discuss the ways that our attempt at reproduction can inform best practices.

Co-authors

Marc Verhagen 3

Mariah Bradford 2

Nikhil Krishnaswamy 2

Martha Palmer 2

Nathaniel Blanchard 1

Susan Windisch Brown 1

Richard Brutti 1

Richard A. Brutti 1

Shih-Fu Chang 1

Aabhas Chauhan 1

Brent Cochran 1

Ahmed Elsayed 1

Jack Fitzgerald 1

Guangxing Han 1

Benjamin A. Ibarra 1

Changsoo Jung 1

Ibrahim Khebour 1

Ibrahim Khalil Khebour 1

Constantine Lignos 1

Carlos Mabrey 1

Roberto Navigli 1

Boyan Onyshkevych 1

Nikolaus Parulian 1

Shanan Peters 1

Cynthia Schneider 1

Xiangchen Song 1

Keith Suderman 1

Christopher Tam 1

Hannah VanderHoeven 1

Videep Venkatesha 1

Austin C. Youngren 1

Ranran Haoran Zhang 1

Venues