Tom Hope - ACL Anthology

Tom Hope

2025

Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities
Peter Jansen | Bhavana Dalvi Mishra | Harsh Trivedi | Bodhisattwa Prasad Majumder | Tom Hope | Tushar Khot | Doug Downey | Eric Horvitz
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities

We present ScIRIFF (Scientific Resource for Instruction-Following and Finetuning), a dataset of 137K instruction-following instances for training and evaluation, covering 54 tasks. These tasks span five core scientific literature understanding capabilities: information extraction, summarization, question answering, claim verification, and classification. ScIRIFF is unique in being the only entirely expert-written, high-quality instruction-following dataset designed for extracting and synthesizing information from research literature across diverse scientific fields. It features complex instructions with long input contexts, detailed task descriptions, and structured outputs. To demonstrate its utility, we finetune a series of large language models (LLMs) using a mix of general domain and ScIRIFF instructions. On nine out-of-distribution held-out tasks (referred to as SciRIFF-Eval), LLMs finetuned on SciRIFF achieve 70.6% average improvement over our baselines trained only on general-domain instructions. ScIRIFF facilitates the development and evaluation of LLMs to help researchers navigate the rapidly growing body of scientific literature.

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
Noy Sternlicht | Ariel Gera | Roy Bar-Haim | Tom Hope | Noam Slonim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.

CodeScientist: End-to-End Semi-Automated Scientific Discovery with Code-based Experimentation
Peter Jansen | Oyvind Tafjord | Marissa Radensky | Pao Siangliulue | Tom Hope | Bhavana Dalvi Mishra | Bodhisattwa Prasad Majumder | Daniel S Weld | Peter Clark
Findings of the Association for Computational Linguistics: ACL 2025

Despite the surge of interest in autonomous scientific discovery (ASD) of software artifacts (e.g., improved ML algorithms), current ASD systems face two key limitations: (1) they largely explore variants of existing codebases or similarly constrained design spaces, and (2) they produce large volumes of research artifacts (such as automatically generated papers and code) that are typically evaluated using conference-style paper review with limited evaluation of code. In this work we introduce CodeScientist, a novel ASD system that frames ideation and experiment construction as a form of genetic search jointly over combinations of research articles and codeblocks defining common actions in a domain (like prompting a language model). We use this paradigm to conduct hundreds of automated experiments on machine-generated ideas broadly in the domain of agents and virtual environments, with the system returning 19 discoveries, 6 of which were judged as being both at least minimally sound and incrementally novel after a multi-faceted evaluation beyond that typically conducted in prior work, including external (conference-style) review, code review, and replication attempts. Moreover, the discoveries span new tasks, agents, metrics, and data, suggesting a qualitative shift from benchmark optimization to broader discoveries.

Literature-Grounded Novelty Assessment of Scientific Ideas
Simra Shahid | Marissa Radensky | Raymond Fok | Pao Siangliulue | Daniel S Weld | Tom Hope
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

Automated scientific idea generation systems have made remarkable progress, yet the automatic evaluation of idea novelty remains a critical and underexplored challenge. Manual evaluation of novelty through literature review is labor-intensive, prone to error due to subjectivity, and impractical at scale. To address these issues, we propose the **Idea Novelty Checker**, an LLM-based retrieval-augmented generation (RAG) framework that leverages a two-stage retrieve-then-rerank approach. The Idea Novelty Checker first collects a broad set of relevant papers using keyword and snippet-based retrieval, then refines this collection through embedding-based filtering followed by facet-based LLM re-ranking. It incorporates expert-labeled examples to guide the system in comparing papers for novelty evaluation and in generating literature-grounded reasoning. Our extensive experiments demonstrate that our novelty checker achieves approximately 13% higher agreement than existing approaches. Ablation studies further showcases the importance of the facet-based re-ranker in identifying the most relevant literature for novelty evaluation.

2024

SciMON: Scientific Inspiration Machines Optimized for Novelty
Qingyun Wang | Doug Downey | Heng Ji | Tom Hope
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We explore and enhance the ability of neural language models to generate novel scientific directions grounded in literature. Work on literature-based hypothesis generation has traditionally focused on binary link prediction—severely limiting the expressivity of hypotheses. This line of work also does not focus on optimizing novelty. We take a dramatic departure with a novel setting in which models use as input background contexts (e.g., problems, experimental settings, goals), and output natural language ideas grounded in literature. We present SciMON, a modeling framework that uses retrieval of “inspirations” from past scientific papers, and explicitly optimizes for novelty by iteratively comparing to prior papers and updating idea suggestions until sufficient novelty is achieved. Comprehensive evaluations reveal that GPT-4 tends to generate ideas with overall low technical depth and novelty, while our methods partially mitigate this issue. Our work represents a first step toward evaluating and developing language models that generate new ideas derived from the scientific literature. Code, data, and resources are publicly available for research purposes: https://github.com/eaglew/clbd.

ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
Mike D’Arcy | Alexis Ross | Erin Bransom | Bailey Kuehl | Jonathan Bragg | Tom Hope | Doug Downey
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce the task of automatically revising scientific papers based on peer feedback and release ARIES, a dataset of review comments and their corresponding paper edits. The data is drawn from real reviewer-author interactions from computer science, and we provide labels linking each reviewer comment to the specific paper edits made by the author in response. We automatically create a high-precision silver training set, as well as an expert-labeled test set that shows high inter-annotator agreement. In experiments with 10 models covering the state of the art, we find that they struggle even to identify which edits correspond to a comment—especially when the relationship between the edit and the comment is indirect and requires reasoning to uncover. We also extensively analyze GPT-4’s ability to generate edits given a comment and the original paper. We find that it often succeeds on a superficial level, but tends to rigidly follow the wording of the feedback rather than the underlying intent, and lacks technical details compared to human-written edits.

Parameter-Efficient Fine-Tuning of LLaMA for the Clinical Domain
Aryo Gema | Pasquale Minervini | Luke Daines | Tom Hope | Beatrice Alex
Proceedings of the 6th Clinical Natural Language Processing Workshop

Adapting pretrained language models to novel domains, such as clinical applications, traditionally involves retraining their entire set of parameters. Parameter-Efficient Fine-Tuning (PEFT) techniques for fine-tuning language models significantly reduce computational requirements by selectively fine-tuning small subsets of parameters. In this study, we propose a two-step PEFT framework and evaluate it in the clinical domain. Our approach combines a specialised PEFT adapter layer designed for clinical domain adaptation with another adapter specialised for downstream tasks. We evaluate the framework on multiple clinical outcome prediction datasets, comparing it to clinically trained language models. Our framework achieves a better AUROC score averaged across all clinical downstream tasks compared to clinical language models. In particular, we observe large improvements of 4-5% AUROC in large-scale multilabel classification tasks, such as diagnoses and procedures classification. To our knowledge, this study is the first to provide an extensive empirical analysis of the interplay between PEFT techniques and domain adaptation in an important real-world domain of clinical applications.

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Delia Irazu Hernandez Farias | Tom Hope | Manling Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

CARE: Extracting Experimental Findings From Clinical Literature
Aakanksha Naik | Bailey Kuehl | Erin Bransom | Doug Downey | Tom Hope
Findings of the Association for Computational Linguistics: NAACL 2024

Extracting fine-grained experimental findings from literature can provide dramatic utility for scientific applications. Prior work has developed annotation schemas and datasets for limited aspects of this problem, failing to capture the real-world complexity and nuance required. Focusing on biomedicine, this work presents CARE—a new IE dataset for the task of extracting clinical findings. We develop a new annotation schema capturing fine-grained findings as n-ary relations between entities and attributes, which unifies phenomena challenging for current IE systems such as discontinuous entity spans, nested relations, variable arity n-ary relations and numeric results in a single schema. We collect extensive annotations for 700 abstracts from two sources: clinical trials and case reports. We also demonstrate the generalizability of our schema to the computer science and materials science domains. We benchmark state-of-the-art IE systems on CARE, showing that even models such as GPT4 struggle. We release our resources to advance research on extracting and aggregating literature findings.

Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)
Carl Edwards | Qingyun Wang | Manling Li | Lawrence Zhao | Tom Hope | Heng Ji
Proceedings of the 1st Workshop on Language + Molecules (L+M 2024)

Towards a Human-Computer Collaborative Scientific Paper Lifecycle: A Pilot Study and Hands-On Tutorial
Qingyun Wang | Carl Edwards | Heng Ji | Tom Hope
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries

Due to the rapid growth of publications varying in quality, there exists a pressing need to help scientists digest and evaluate relevant papers, thereby facilitating scientific discovery. This creates a number of urgent questions; however, computer-human collaboration in the scientific paper lifecycle is still in the exploratory stage and lacks a unified framework for analyzing the relevant tasks. Additionally, with the recent significant success of large language models (LLMs), they have increasingly played an important role in academic writing. In this cutting-edge tutorial, we aim to provide an all-encompassing overview of the paper lifecycle, detailing how machines can augment every stage of the research process for the scientist, including scientific literature understanding, experiment development, manuscript draft writing, and finally draft evaluation. This tutorial is devised for researchers interested in this rapidly-developing field of NLP-augmented paper writing. The tutorial will also feature a session of hands-on exercises during which participants can guide machines in generating ideas and automatically composing key paper elements. Furthermore, we will address current challenges, explore future directions, and discuss potential ethical issues. A toolkit designed for human-computer collaboration throughout the paper lifecycle will also be made publically available.

On-the-fly Definition Augmentation of LLMs for Biomedical NER
Monica Munnangi | Sergey Feldman | Byron Wallace | Silvio Amir | Tom Hope | Aakanksha Naik
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Despite their general capabilities, LLMs still struggle on biomedicalNER tasks, which are difficult due to the presence of specialized terminology and lack of training data. In this work we set out to improve LLM performance on biomedical NER in limited data settings via a new knowledge augmentation approach which incorporates definitions of relevant concepts on-the-fly. During this process, to provide a test bed for knowledge augmentation, we perform a comprehensive exploration of prompting strategies. Our experiments show that definition augmentation is useful for both open source and closed LLMs.For example, it leads to a relative improvement of 15% (on average) in GPT-4 performance (F1) across all (six) of our test datasets. We conduct extensive ablations and analyses to demonstrate that our performance improvements stem from adding relevant definitional knowledge. We find that careful prompting strategies also improve LLM performance, allowing them to outperform fine-tuned language models in few-shot settings. To facilitate future research in this direction, we release our code at https://github.com/allenai/beacon.

2023

CHAMP: Efficient Annotation and Consolidation of Cluster Hierarchies
Arie Cattan | Tom Hope | Doug Downey | Roy Bar-Haim | Lilach Eden | Yoav Kantor | Ido Dagan
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Various NLP tasks require a complex hierarchical structure over nodes, where each node is a cluster of items. Examples include generating entailment graphs, hierarchical cross-document coreference resolution, annotating event and subevent relations, etc. To enable efficient annotation of such hierarchical structures, we release CHAMP, an open source tool allowing to incrementally construct both clusters and hierarchy simultaneously over any type of texts. This incremental approach significantly reduces annotation time compared to the common pairwise annotation approach and also guarantees maintaining transitivity at the cluster and hierarchy levels. Furthermore, CHAMP includes a consolidation mode, where an adjudicator can easily compare multiple cluster hierarchy annotations and resolve disagreements.

Beyond Good Intentions: Reporting the Research Landscape of NLP for Social Good
Fernando Adauto | Zhijing Jin | Bernhard Schölkopf | Tom Hope | Mrinmaya Sachan | Rada Mihalcea
Findings of the Association for Computational Linguistics: EMNLP 2023

With the recent advances in natural language processing (NLP), a vast number of applications have emerged across various use cases. Among the plethora of NLP applications, many academic researchers are motivated to do work that has a positive social impact, in line with the recent initiatives of NLP for Social Good (NLP4SG). However, it is not always obvious to researchers how their research efforts are tackling today’s big social problems. Thus, in this paper, we introduce NLP4SGPapers, a scientific dataset with three associated tasks that can help identify NLP4SG papers and characterize the NLP4SG landscape by: (1) identifying the papers that address a social problem, (2) mapping them to the corresponding UN Sustainable Development Goals (SDGs), and (3) identifying the task they are solving and the methods they are using. Using state-of-the-art NLP models, we address each of these tasks and use them on the entire ACL Anthology, resulting in a visualization workspace that gives researchers a comprehensive overview of the field of NLP4SG. Our website is available at https://nlp4sg.vercel.app . We released our data at https://huggingface.co/datasets/feradauto/NLP4SGPapers and code at https://github.com/feradauto/nlp4sg

2022

ACCoRD: A Multi-Document Approach to Generating Diverse Descriptions of Scientific Concepts
Sonia Murthy | Kyle Lo | Daniel King | Chandra Bhagavatula | Bailey Kuehl | Sophie Johnson | Jonathan Borchardt | Daniel Weld | Tom Hope | Doug Downey
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Systems that automatically define unfamiliar terms hold the promise of improving the accessibility of scientific texts, especially for readers who may lack prerequisite background knowledge. However, current systems assume a single “best” description per concept, which fails to account for the many ways a concept can be described. We present ACCoRD, an end-to-end system tackling the novel task of generating sets of descriptions of scientific concepts. Our system takes advantage of the myriad ways a concept is mentioned across the scientific literature to produce distinct, diverse descriptions oftarget concepts in terms of different reference concepts. In a user study, we find that users prefer (1) descriptions produced by our end-to-end system, and (2) multiple descriptions to a single “best” description. We release the ACCoRD corpus which includes 1,275 labeled contexts and 1,787 expert-authored concept descriptions to support research on our task.

Literature-Augmented Clinical Outcome Prediction
Aakanksha Naik | Sravanthi Parasa | Sergey Feldman | Lucy Lu Wang | Tom Hope
Findings of the Association for Computational Linguistics: NAACL 2022

We present BEEP (Biomedical Evidence-Enhanced Predictions), a novel approach for clinical outcome prediction that retrieves patient-specific medical literature and incorporates it into predictive models. Based on each individual patient’s clinical notes, we train language models (LMs) to find relevant papers and fuse them with information from notes to predict outcomes such as in-hospital mortality. We develop methods to retrieve literature based on noisy, information-dense patient notes, and to augment existing outcome prediction models with retrieved papers in a manner that maximizes predictive accuracy. Our approach boosts predictive performance on three important clinical tasks in comparison to strong recent LM baselines, increasing F1 by up to 5 points and precision@Top-K by a large margin of over 25%.

A Dataset for N-ary Relation Extraction of Drug Combinations
Aryeh Tiktinsky | Vijay Viswanathan | Danna Niezni | Dana Meron Azagury | Yosi Shamay | Hillel Taub-Tabib | Tom Hope | Yoav Goldberg
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Combination therapies have become the standard of care for diseases such as cancer, tuberculosis, malaria and HIV. However, the combinatorial set of available multi-drug treatments creates a challenge in identifying effective combination therapies available in a situation. To assist medical professionals in identifying beneficial drug-combinations, we construct an expert-annotated dataset for extracting information about the efficacy of drug combinations from the scientific literature. Beyond its practical utility, the dataset also presents a unique NLP challenge, as the first relation extraction dataset consisting of variable-length relations. Furthermore, the relations in this dataset predominantly require language understanding beyond the sentence level, adding to the challenge of this task. We provide a promising baseline model and identify clear areas for further improvement. We release our dataset (https://huggingface.co/datasets/allenai/drug-combo-extraction), code (https://github.com/allenai/drug-combo-extraction) and baseline models (https://huggingface.co/allenai/drug-combo-classifier-pubmedbert-dapt) publicly to encourage the NLP community to participate in this task.

Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
Sheshera Mysore | Arman Cohan | Tom Hope
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a new scientific document similarity model based on matching fine-grained aspects of texts. To train our model, we exploit a naturally-occurring source of supervision: sentences in the full-text of papers that cite multiple papers together (co-citations). Such co-citations not only reflect close paper relatedness, but also provide textual descriptions of how the co-cited papers are related. This novel form of textual supervision is used for learning to match aspects across papers. We develop multi-vector representations where vectors correspond to sentence-level aspects of documents, and present two methods for aspect matching: (1) A fast method that only matches single aspects, and (2) a method that makes sparse multiple matches with an Optimal Transport mechanism that computes an Earth Mover’s Distance between aspects. Our approach improves performance on document similarity tasks in four datasets. Further, our fast single-match method achieves competitive results, paving the way for applying fine-grained similarity to large scientific corpora.

2021

Extracting a Knowledge Base of Mechanisms from COVID-19 Papers
Tom Hope | Aida Amini | David Wadden | Madeleine van Zuylen | Sravanthi Parasa | Eric Horvitz | Daniel Weld | Roy Schwartz | Hannaneh Hajishirzi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The COVID-19 pandemic has spawned a diverse body of scientific literature that is challenging to navigate, stimulating interest in automated tools to help find useful knowledge. We pursue the construction of a knowledge base (KB) of mechanisms—a fundamental concept across the sciences, which encompasses activities, functions and causal relations, ranging from cellular processes to economic impacts. We extract this information from the natural language of scientific papers by developing a broad, unified schema that strikes a balance between relevance and breadth. We annotate a dataset of mechanisms with our schema and train a model to extract mechanism relations from papers. Our experiments demonstrate the utility of our KB in supporting interdisciplinary scientific search over COVID-19 literature, outperforming the prominent PubMed search in a study with clinical experts. Our search engine, dataset and code are publicly available.

2020

Language (Re)modelling: Towards Embodied Language Understanding
Ronen Tamari | Chen Shani | Tom Hope | Miriam R L Petruck | Omri Abend | Dafna Shahaf
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

While natural language understanding (NLU) is advancing rapidly, today’s technology differs from human-like language understanding in fundamental ways, notably in its inferior efficiency, interpretability, and generalization. This work proposes an approach to representation and learning based on the tenets of embodied cognitive linguistics (ECL). According to ECL, natural language is inherently executable (like programming languages), driven by mental simulation and metaphoric mappings over hierarchical compositions of structures and schemata learned through embodied interaction. This position paper argues that the use of grounding by metaphoric reasoning and simulation will greatly benefit NLU systems, and proposes a system architecture along with a roadmap towards realizing this vision.

SciSight: Combining faceted navigation and research group detection for COVID-19 exploratory scientific search
Tom Hope | Jason Portenoy | Kishore Vasan | Jonathan Borchardt | Eric Horvitz | Daniel Weld | Marti Hearst | Jevin West
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

The COVID-19 pandemic has sparked unprecedented mobilization of scientists, generating a deluge of papers that makes it hard for researchers to keep track and explore new directions. Search engines are designed for targeted queries, not for discovery of connections across a corpus. In this paper, we present SciSight, a system for exploratory search of COVID-19 research integrating two key capabilities: first, exploring associations between biomedical facets automatically extracted from papers (e.g., genes, drugs, diseases, patient outcomes); second, combining textual and network information to search and visualize groups of researchers and their ties. SciSight has so far served over 15K users with over 42K page views and 13% returns.

Co-authors

Jonathan Borchardt 2

Bhavana Dalvi 2

Sergey Feldman 2

Hannaneh Hajishirzi 2

Bodhisattwa Prasad Majumder 2

Sravanthi Parasa 2

Marissa Radensky 2

Pao Siangliulue 2

Fernando Adauto 1

Beatrice Alex 1

Nitzan Barzilay 1

Chandra Bhagavatula 1

Jonathan Bragg 1

Mike D’Arcy 1

Yoav Goldberg 1

Marti A. Hearst 1

Delia Irazú Hernández Farías 1

Sophie Johnson 1

Dana Meron Azagury 1

Rada Mihalcea 1

Pasquale Minervini 1

Jacob Morrison 1

Monica Munnangi 1

Sheshera Mysore 1

Miriam R. L. Petruck 1

Jason Portenoy 1

Mrinmaya Sachan 1

Bernhard Schölkopf 1

Shannon Zejiang Shen 1

Luca Soldaini 1

Noy Sternlicht 1

Oyvind Tafjord 1

Hillel Taub-Tabib 1

Aryeh Tiktinsky 1

Harsh Trivedi 1

Kishore Vasan 1

Vijay Viswanathan 1

Byron C. Wallace 1

Lawrence Zhao 1

Madeleine van Zuylen 1

Venues