Gus Hahn-Powell

Also published as: Gustave Hahn-Powell


2024

pdf bib
Active Learning Design Choices for NER with Transformers
Robert Vacareanu | Enrique Noriega-Atala | Gus Hahn-Powell | Marco A. Valenzuela-Escarcega | Mihai Surdeanu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We explore multiple important choices that have not been analyzed in conjunction regarding active learning for token classification using transformer networks. These choices are: (i) how to select what to annotate, (ii) decide whether to annotate entire sentences or smaller sentence fragments, (iii) how to train with incomplete annotations at token-level, and (iv) how to select the initial seed dataset. We explore whether annotating at sub-sentence level can translate to an improved downstream performance by considering two different sub-sentence annotation strategies: (i) entity-level, and (ii) token-level. These approaches result in some sentences being only partially annotated. To address this issue, we introduce and evaluate multiple strategies to deal with partially-annotated sentences during the training process. We show that annotating at the sub-sentence level achieves comparable or better performance than sentence-level annotations with a smaller number of annotated tokens. We then explore the extent to which the performance gap remains once accounting for the annotation time and found that both annotation schemes perform similarly.

2023

pdf bib
Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Mihai Surdeanu | Ellen Riloff | Laura Chiticariu | Dayne Frietag | Gus Hahn-Powell | Clayton T. Morrison | Enrique Noriega-Atala | Rebecca Sharp | Marco Valenzuela-Escarcega
Proceedings of the 2nd Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning

2022

pdf bib
From Examples to Rules: Neural Guided Rule Synthesis for Information Extraction
Robert Vacareanu | Marco A. Valenzuela-Escárcega | George Caique Gouveia Barbosa | Rebecca Sharp | Gustave Hahn-Powell | Mihai Surdeanu
Proceedings of the Thirteenth Language Resources and Evaluation Conference

While deep learning approaches to information extraction have had many successes, they can be difficult to augment or maintain as needs shift. Rule-based methods, on the other hand, can be more easily modified. However, crafting rules requires expertise in linguistics and the domain of interest, making it infeasible for most users. Here we attempt to combine the advantages of these two directions while mitigating their drawbacks. We adapt recent advances from the adjacent field of program synthesis to information extraction, synthesizing rules from provided examples. We use a transformer-based architecture to guide an enumerative search, and show that this reduces the number of steps that need to be explored before a rule is found. Further, we show that without training the synthesis algorithm on the specific domain, our synthesized rules achieve state-of-the-art performance on the 1-shot scenario of a task that focuses on few-shot learning for relation classification, and competitive performance in the 5-shot scenario.

pdf bib
A Human-machine Interface for Few-shot Rule Synthesis for Information Extraction
Robert Vacareanu | George C.G. Barbosa | Enrique Noriega-Atala | Gus Hahn-Powell | Rebecca Sharp | Marco A. Valenzuela-Escárcega | Mihai Surdeanu
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: System Demonstrations

We propose a system that assists a user in constructing transparent information extraction models, consisting of patterns (or rules) written in a declarative language, through program synthesis. Users of our system can specify their requirements through the use of examples,which are collected with a search interface. The rule-synthesis system proposes rule candidates and the results of applying them on a textual corpus; the user has the option to accept the candidate, request another option, or adjust the examples provided to the system. Through an interactive evaluation, we show that our approach generates high-precision rules even in a 1-shot setting. On a second evaluation on a widely-used relation extraction dataset (TACRED), our method generates rules that outperform considerably manually written patterns. Our code, demo, and documentation is available at https://clulab.github.io/odinsynth.

pdf bib
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning
Laura Chiticariu | Yoav Goldberg | Gus Hahn-Powell | Clayton T. Morrison | Aakanksha Naik | Rebecca Sharp | Mihai Surdeanu | Marco Valenzuela-Escárcega | Enrique Noriega-Atala
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning

pdf bib
Syntax-driven Data Augmentation for Named Entity Recognition
Arie Sutiono | Gus Hahn-Powell
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning

In low resource settings, data augmentation strategies are commonly leveraged to improve performance. Numerous approaches have attempted document-level augmentation (e.g., text classification), but few studies have explored token-level augmentation. Performed naively, data augmentation can produce semantically incongruent and ungrammatical examples. In this work, we compare simple masked language model replacement and an augmentation method using constituency tree mutations to improve the performance of named entity recognition in low-resource settings with the aim of preserving linguistic cohesion of the augmented sentences.

pdf bib
Neural-Guided Program Synthesis of Information Extraction Rules Using Self-Supervision
Enrique Noriega-Atala | Robert Vacareanu | Gus Hahn-Powell | Marco A. Valenzuela-Escárcega
Proceedings of the First Workshop on Pattern-based Approaches to NLP in the Age of Deep Learning

We propose a neural-based approach for rule synthesis designed to help bridge the gap between the interpretability, precision and maintainability exhibited by rule-based information extraction systems with the scalability and convenience of statistical information extraction systems. This is achieved by avoiding placing the burden of learning another specialized language on domain experts and instead asking them to provide a small set of examples in the form of highlighted spans of text. We introduce a transformer-based architecture that drives a rule synthesis system that leverages a self-supervised approach for pre-training a large-scale language model complemented by an analysis of different loss functions and aggregation mechanisms for variable length sequences of user-annotated spans of text. The results are encouraging and point to different desirable properties, such as speed and quality, depending on the choice of loss and aggregation method.

2021

pdf bib
Country-level Arabic Dialect Identification using RNNs with and without Linguistic Features
Elsayed Issa | Mohammed AlShakhori1 | Reda Al-Bahrani | Gus Hahn-Powell
Proceedings of the Sixth Arabic Natural Language Processing Workshop

This work investigates the value of augmenting recurrent neural networks with feature engineering for the Second Nuanced Arabic Dialect Identification (NADI) Subtask 1.2: Country-level DA identification. We compare the performance of a simple word-level LSTM using pretrained embeddings with one enhanced using feature embeddings for engineered linguistic features. Our results show that the addition of explicit features to the LSTM is detrimental to performance. We attribute this performance loss to the bivalency of some linguistic items in some text, ubiquity of topics, and participant mobility.

2020

pdf bib
Exploring Interpretability in Event Extraction: Multitask Learning of a Neural Event Classifier and an Explanation Decoder
Zheng Tang | Gus Hahn-Powell | Mihai Surdeanu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

We propose an interpretable approach for event extraction that mitigates the tension between generalization and interpretability by jointly training for the two goals. Our approach uses an encoder-decoder architecture, which jointly trains a classifier for event extraction, and a rule decoder that generates syntactico-semantic rules that explain the decisions of the event classifier. We evaluate the proposed approach on three biomedical events and show that the decoder generates interpretable rules that serve as accurate explanations for the event classifier’s decisions, and, importantly, that the joint training generally improves the performance of the event classifier. Lastly, we show that our approach can be used for semi-supervised learning, and that its performance improves when trained on automatically-labeled data generated by a rule-based system.

pdf bib
Odinson: A Fast Rule-based Information Extraction Framework
Marco A. Valenzuela-Escárcega | Gus Hahn-Powell | Dane Bell
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present Odinson, a rule-based information extraction framework, which couples a simple yet powerful pattern language that can operate over multiple representations of text, with a runtime system that operates in near real time. In the Odinson query language, a single pattern may combine regular expressions over surface tokens with regular expressions over graphs such as syntactic dependencies. To guarantee the rapid matching of these patterns, our framework indexes most of the necessary information for matching patterns, including directed graphs such as syntactic dependencies, into a custom Lucene index. Indexing minimizes the amount of expensive pattern matching that must take place at runtime. As a result, the runtime system matches a syntax-based graph traversal in 2.8 seconds in a corpus of over 134 million sentences, nearly 150,000 times faster than its predecessor.

2019

pdf bib
Enabling Search and Collaborative Assembly of Causal Interactions Extracted from Multilingual and Multi-domain Free Text
George C. G. Barbosa | Zechy Wong | Gus Hahn-Powell | Dane Bell | Rebecca Sharp | Marco A. Valenzuela-Escárcega | Mihai Surdeanu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

Many of the most pressing current research problems (e.g., public health, food security, or climate change) require multi-disciplinary collaborations. In order to facilitate this process, we propose a system that incorporates multi-domain extractions of causal interactions into a single searchable knowledge graph. Our system enables users to search iteratively over direct and indirect connections in this knowledge graph, and collaboratively build causal models in real time. To enable the aggregation of causal information from multiple languages, we extend an open-domain machine reader to Portuguese. The new Portuguese reader extracts over 600 thousand causal statements from 120 thousand Portuguese publications with a precision of 62%, which demonstrates the value of mining multilingual scientific information.

2018

pdf bib
Text Annotation Graphs: Annotating Complex Natural Language Phenomena
Angus Forbes | Kristine Lee | Gus Hahn-Powell | Marco A. Valenzuela-Escárcega | Mihai Surdeanu
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Scientific Discovery as Link Prediction in Influence and Citation Graphs
Fan Luo | Marco A. Valenzuela-Escárcega | Gus Hahn-Powell | Mihai Surdeanu
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)

We introduce a machine learning approach for the identification of “white spaces” in scientific knowledge. Our approach addresses this task as link prediction over a graph that contains over 2M influence statements such as “CTCF activates FOXA1”, which were automatically extracted using open-domain machine reading. We model this prediction task using graph-based features extracted from the above influence graph, as well as from a citation graph that captures scientific communities. We evaluated the proposed approach through backtesting. Although the data is heavily unbalanced (50 times more negative examples than positives), our approach predicts which influence links will be discovered in the “near future” with a F1 score of 27 points, and a mean average precision of 68%.

2017

pdf bib
Swanson linking revisited: Accelerating literature-based discovery across domains using a conceptual influence graph
Gus Hahn-Powell | Marco A. Valenzuela-Escárcega | Mihai Surdeanu
Proceedings of ACL 2017, System Demonstrations

2016

pdf bib
Sieve-based Coreference Resolution in the Biomedical Domain
Dane Bell | Gus Hahn-Powell | Marco A. Valenzuela-Escárcega | Mihai Surdeanu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe challenges and advantages unique to coreference resolution in the biomedical domain, and a sieve-based architecture that leverages domain knowledge for both entity and event coreference resolution. Domain-general coreference resolution algorithms perform poorly on biomedical documents, because the cues they rely on such as gender are largely absent in this domain, and because they do not encode domain-specific knowledge such as the number and type of participants required in chemical reactions. Moreover, it is difficult to directly encode this knowledge into most coreference resolution algorithms because they are not rule-based. Our rule-based architecture uses sequentially applied hand-designed “sieves”, with the output of each sieve informing and constraining subsequent sieves. This architecture provides a 3.2% increase in throughput to our Reach event extraction system with precision parallel to that of the stricter system that relies solely on syntactic patterns for extraction.

pdf bib
Odin’s Runes: A Rule Language for Information Extraction
Marco A. Valenzuela-Escárcega | Gus Hahn-Powell | Mihai Surdeanu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Odin is an information extraction framework that applies cascades of finite state automata over both surface text and syntactic dependency graphs. Support for syntactic patterns allow us to concisely define relations that are otherwise difficult to express in languages such as Common Pattern Specification Language (CPSL), which are currently limited to shallow linguistic features. The interaction of lexical and syntactic automata provides robustness and flexibility when writing extraction rules. This paper describes Odin’s declarative language for writing these cascaded automata.

pdf bib
SnapToGrid: From Statistical to Interpretable Models for Biomedical Information Extraction
Marco A. Valenzuela-Escárcega | Gus Hahn-Powell | Dane Bell | Mihai Surdeanu
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf bib
This before That: Causal Precedence in the Biomedical Domain
Gus Hahn-Powell | Dane Bell | Marco A. Valenzuela-Escárcega | Mihai Surdeanu
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

2015

pdf bib
A Domain-independent Rule-based Framework for Event Extraction
Marco A. Valenzuela-Escárcega | Gus Hahn-Powell | Mihai Surdeanu | Thomas Hicks
Proceedings of ACL-IJCNLP 2015 System Demonstrations

pdf bib
Higher-order Lexical Semantic Models for Non-factoid Answer Reranking
Daniel Fried | Peter Jansen | Gustave Hahn-Powell | Mihai Surdeanu | Peter Clark
Transactions of the Association for Computational Linguistics, Volume 3

Lexical semantic models provide robust performance for question answering, but, in general, can only capitalize on direct evidence seen during training. For example, monolingual alignment models acquire term alignment probabilities from semi-structured data such as question-answer pairs; neural network language models learn term embeddings from unstructured text. All this knowledge is then used to estimate the semantic similarity between question and answer candidates. We introduce a higher-order formalism that allows all these lexical semantic models to chain direct evidence to construct indirect associations between question and answer texts, by casting the task as the traversal of graphs that encode direct term associations. Using a corpus of 10,000 questions from Yahoo! Answers, we experimentally demonstrate that higher-order methods are broadly applicable to alignment and language models, across both word and syntactic representations. We show that an important criterion for success is controlling for the semantic drift that accumulates during graph traversal. All in all, the proposed higher-order approach improves five out of the six lexical semantic models investigated, with relative gains of up to +13% over their first-order variants.