2024
pdf
bib
abs
When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context
Enrique Noriega-Atala
|
Robert Vacareanu
|
Salena Torres Ashton
|
Adarsh Pyarelal
|
Clayton T Morrison
|
Mihai Surdeanu
Findings of the Association for Computational Linguistics: EMNLP 2024
We introduce a neural architecture finetuned for the task of scenario context generation: The relevant location and time of an event or entity mentioned in text. Contextualizing information extraction helps to scope the validity of automated finings when aggregating them as knowledge graphs. Our approach uses a high-quality curated dataset of time and location annotations in a corpus of epidemiology papers to train an encoder-decoder architecture. We also explored the use of data augmentation techniques during training. Our findings suggest that a relatively small fine-tuned encoder-decoder model performs better than out-of-the-box LLMs and semantic role labeling parsers to accurate predict the relevant scenario information of a particular entity or event.
pdf
bib
abs
Active Learning Design Choices for NER with Transformers
Robert Vacareanu
|
Enrique Noriega-Atala
|
Gus Hahn-Powell
|
Marco A. Valenzuela-Escarcega
|
Mihai Surdeanu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We explore multiple important choices that have not been analyzed in conjunction regarding active learning for token classification using transformer networks. These choices are: (i) how to select what to annotate, (ii) decide whether to annotate entire sentences or smaller sentence fragments, (iii) how to train with incomplete annotations at token-level, and (iv) how to select the initial seed dataset. We explore whether annotating at sub-sentence level can translate to an improved downstream performance by considering two different sub-sentence annotation strategies: (i) entity-level, and (ii) token-level. These approaches result in some sentences being only partially annotated. To address this issue, we introduce and evaluate multiple strategies to deal with partially-annotated sentences during the training process. We show that annotating at the sub-sentence level achieves comparable or better performance than sentence-level annotations with a smaller number of annotated tokens. We then explore the extent to which the performance gap remains once accounting for the annotation time and found that both annotation schemes perform similarly.
2023
pdf
bib
Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Mihai Surdeanu
|
Ellen Riloff
|
Laura Chiticariu
|
Dayne Frietag
|
Gus Hahn-Powell
|
Clayton T. Morrison
|
Enrique Noriega-Atala
|
Rebecca Sharp
|
Marco Valenzuela-Escarcega
Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
2022
pdf
bib
abs
A Human-machine Interface for Few-shot Rule Synthesis for Information Extraction
Robert Vacareanu
|
George C.G. Barbosa
|
Enrique Noriega-Atala
|
Gus Hahn-Powell
|
Rebecca Sharp
|
Marco A. Valenzuela-Escárcega
|
Mihai Surdeanu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations
We propose a system that assists a user in constructing transparent information extraction models, consisting of patterns (or rules) written in a declarative language, through program synthesis. Users of our system can specify their requirements through the use of examples,which are collected with a search interface. The rule-synthesis system proposes rule candidates and the results of applying them on a textual corpus; the user has the option to accept the candidate, request another option, or adjust the examples provided to the system. Through an interactive evaluation, we show that our approach generates high-precision rules even in a 1-shot setting. On a second evaluation on a widely-used relation extraction dataset (TACRED), our method generates rules that outperform considerably manually written patterns. Our code, demo, and documentation is available at
https://clulab.github.io/odinsynth.
pdf
bib
abs
Low Resource Causal Event Detection from Biomedical Literature
Zhengzhong Liang
|
Enrique Noriega-Atala
|
Clayton Morrison
|
Mihai Surdeanu
Proceedings of the 21st Workshop on Biomedical Language Processing
Recognizing causal precedence relations among the chemical interactions in biomedical literature is crucial to understanding the underlying biological mechanisms. However, detecting such causal relation can be hard because: (1) many times, such causal relations among events are not explicitly expressed by certain phrases but implicitly implied by very diverse expressions in the text, and (2) annotating such causal relation detection datasets requires considerable expert knowledge and effort. In this paper, we propose a strategy to address both challenges by training neural models with in-domain pre-training and knowledge distillation. We show that, by using very limited amount of labeled data, and sufficient amount of unlabeled data, the neural models outperform previous baselines on the causal precedence detection task, and are ten times faster at inference compared to the BERT base model.
pdf
bib
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Laura Chiticariu
|
Yoav Goldberg
|
Gus Hahn-Powell
|
Clayton T. Morrison
|
Aakanksha Naik
|
Rebecca Sharp
|
Mihai Surdeanu
|
Marco Valenzuela-Escárcega
|
Enrique Noriega-Atala
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
pdf
bib
abs
Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision
Enrique Noriega-Atala
|
Robert Vacareanu
|
Gus Hahn-Powell
|
Marco A. Valenzuela-Escárcega
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
We propose a neural-based approach for rule synthesis designed to help bridge the gap between the interpretability, precision and maintainability exhibited by rule-based information extraction systems with the scalability and convenience of statistical information extraction systems. This is achieved by avoiding placing the burden of learning another specialized language on domain experts and instead asking them to provide a small set of examples in the form of highlighted spans of text. We introduce a transformer-based architecture that drives a rule synthesis system that leverages a self-supervised approach for pre-training a large-scale language model complemented by an analysis of different loss functions and aggregation mechanisms for variable length sequences of user-annotated spans of text. The results are encouraging and point to different desirable properties, such as speed and quality, depending on the choice of loss and aggregation method.
pdf
bib
abs
Learning Open Domain Multi-hop Search Using Reinforcement Learning
Enrique Noriega-Atala
|
Mihai Surdeanu
|
Clayton Morrison
Proceedings of the Workshop on Structured and Unstructured Knowledge Integration (SUKI)
We propose a method to teach an automated agent to learn how to search for multi-hop paths of relations between entities in an open domain. The method learns a policy for directing existing information retrieval and machine reading resources to focus on relevant regions of a corpus. The approach formulates the learning problem as a Markov decision process with a state representation that encodes the dynamics of the search process and a reward structure that minimizes the number of documents that must be processed while still finding multi-hop paths. We implement the method in an actor-critic reinforcement learning algorithm and evaluate it on a dataset of search problems derived from a subset of English Wikipedia. The algorithm finds a family of policies that succeeds in extracting the desired information while processing fewer documents compared to several baseline heuristic algorithms.
2019
pdf
bib
abs
Understanding the Polarity of Events in the Biomedical Literature: Deep Learning vs. Linguistically-informed Methods
Enrique Noriega-Atala
|
Zhengzhong Liang
|
John Bachman
|
Clayton Morrison
|
Mihai Surdeanu
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
An important task in the machine reading of biochemical events expressed in biomedical texts is correctly reading the polarity, i.e., attributing whether the biochemical event is a promotion or an inhibition. Here we present a novel dataset for studying polarity attribution accuracy. We use this dataset to train and evaluate several deep learning models for polarity identification, and compare these to a linguistically-informed model. The best performing deep learning architecture achieves 0.968 average F1 performance in a five-fold cross-validation study, a considerable improvement over the linguistically informed model average F1 of 0.862.
2017
pdf
bib
abs
Learning what to read: Focused machine reading
Enrique Noriega-Atala
|
Marco A. Valenzuela-Escárcega
|
Clayton Morrison
|
Mihai Surdeanu
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Recent efforts in bioinformatics have achieved tremendous progress in the machine reading of biomedical literature, and the assembly of the extracted biochemical interactions into large-scale models such as protein signaling pathways. However, batch machine reading of literature at today’s scale (PubMed alone indexes over 1 million papers per year) is unfeasible due to both cost and processing overhead. In this work, we introduce a focused reading approach to guide the machine reading of biomedical literature towards what literature should be read to answer a biomedical query as efficiently as possible. We introduce a family of algorithms for focused reading, including an intuitive, strong baseline, and a second approach which uses a reinforcement learning (RL) framework that learns when to explore (widen the search) or exploit (narrow it). We demonstrate that the RL approach is capable of answering more queries than the baseline, while being more efficient, i.e., reading fewer documents.