Steven Bethard

2025

Proceedings of the 7th Clinical Natural Language Processing Workshop
Asma Ben Abacha | Steven Bethard | Danielle Bitterman | Tristan Naumann | Kirk Roberts
Proceedings of the 7th Clinical Natural Language Processing Workshop

pdf bib abs

Transformer-Based Temporal Information Extraction and Application: A Review
Xin Su | Phillip Howard | Steven Bethard
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Temporal information extraction (IE) aims to extract structured temporal information from unstructured text, thereby uncovering the implicit timelines within. This technique is applied across domains such as healthcare, newswire, and intelligence analysis, aiding models in these areas to perform temporal reasoning and enabling human users to grasp the temporal structure of text. Transformer-based pre-trained language models have produced revolutionary advancements in natural language processing, demonstrating exceptional performance across a multitude of tasks. Despite the achievements garnered by Transformer-based approaches in temporal IE, there is a lack of comprehensive reviews on these endeavors. In this paper, we aim to bridge this gap by systematically summarizing and analyzing the body of work on temporal IE using Transformers while highlighting potential future research directions.

pdf bib abs

Applying Transformer Architectures to Detect Cynical Comments in Spanish Social Media
Samuel Gonzalez-Lopez | Steven Bethard | Rogelio Platt-Molina | Francisca Orozco
Proceedings of the Tenth Workshop on Noisy and User-generated Text

Detecting cynical comments in online communication poses a significant challenge in human-computer interaction, especially given the massive proliferation of discussions on platforms like YouTube. These comments often include offensive or disruptive patterns, such as sarcasm, negative feelings, specific reasons, and an attitude of being right. To address this problem, we present a web platform for the Spanish language that has been developed and leverages natural language processing and machine learning techniques. The platform detects comments and provides valuable information to users by focusing on analyzing comments. The core models are based on pre-trained architectures, including BETO, SpanBERTa, Multilingual BERT, RoBERTuito, and BERT, enabling robust detection of cynical comments. Our platform was trained and tested with Spanish comments from car analysis channels on YouTube. The results show that models achieve performance above 0.8 F1 for all types of cynical comments in the text classification task but achieve lower performance (around 0.6-0.7 F1) for the more arduous token classification task.

2024

pdf bib

Proceedings of the 6th Clinical Natural Language Processing Workshop
Tristan Naumann | Asma Ben Abacha | Steven Bethard | Kirk Roberts | Danielle Bitterman
Proceedings of the 6th Clinical Natural Language Processing Workshop

pdf bib

Findings of the Association for Computational Linguistics: NAACL 2024
Kevin Duh | Helena Gomez | Steven Bethard
Findings of the Association for Computational Linguistics: NAACL 2024

pdf bib

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Kevin Duh | Helena Gomez | Steven Bethard
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

pdf bib abs

Semi-Structured Chain-of-Thought: Integrating Multiple Sources of Knowledge for Improved Language Model Reasoning
Xin Su | Tiep Le | Steven Bethard | Phillip Howard
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

An important open question in the use of large language models for knowledge-intensive tasks is how to effectively integrate knowledge from three sources: the model’s parametric memory, external structured knowledge, and external unstructured knowledge. Most existing prompting methods either rely on one or two of these sources, or require repeatedly invoking large language models to generate similar or identical content. In this work, we overcome these limitations by introducing a novel semi-structured prompting approach that seamlessly integrates the model’s parametric memory with unstructured knowledge from text documents and structured knowledge from knowledge graphs. Experimental results on open-domain multi-hop question answering datasets demonstrate that our prompting method significantly surpasses existing techniques, even exceeding those that require fine-tuning.

pdf bib

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
Kevin Duh | Helena Gomez | Steven Bethard
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

pdf bib abs

Improving Toponym Resolution by Predicting Attributes to Constrain Geographical Ontology Entries
Zeyu Zhang | Egoitz Laparra | Steven Bethard
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Geocoding is the task of converting location mentions in text into structured geospatial data.We propose a new prompt-based paradigm for geocoding, where the machine learning algorithm encodes only the location mention and its context.We design a transformer network for predicting the country, state, and feature class of a location mention, and a deterministic algorithm that leverages the country, state, and feature class predictions as constraints in a search for compatible entries in the ontology.Our architecture, GeoPLACE, achieves new state-of-the-art performance on multiple datasets.Code and models are available at https://github.com/clulab/geonorm.

pdf bib abs

Metadata Enhancement Using Large Language Models
Hyunju Song | Steven Bethard | Andrea Thomer
Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)

In the natural sciences, a common form of scholarly document is a physical sample record, which provides categorical and textual metadata for specimens collected and analyzed for scientific research. Physical sample archives like museums and repositories publish these records in data repositories to support reproducible science and enable the discovery of physical samples. However, the success of resource discovery in such interfaces depends on the completeness of the sample records. We investigate approaches for automatically completing the scientific metadata fields of sample records. We apply large language models in zero and few-shot settings and incorporate the hierarchical structure of the taxonomy. We show that a combination of record summarization, bottom-up taxonomy traversal, and few-shot prompting yield F1 as high as 0.928 on metadata completion in the Earth science domain.

pdf bib abs

hinoki at SemEval-2024 Task 7: Numeral-Aware Headline Generation (English)
Hinoki Crum | Steven Bethard
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Numerical reasoning is challenging even for large pre-trained language models. We show that while T5 models are capable of generating relevant headlines with proper numerical values, they can also make mistakes in reading comprehension and miscalculate numerical values. To overcome these issues, we propose a two-step training process: first train models to read text and generate formal representations of calculations, then train models to read calculations and generate numerical values. On the SemEval 2024 Task 7 headline fill-in-the-blank task, our two-stage Flan-T5-based approach achieved 88% accuracy. On the headline generation task, our T5-based approach achieved RougeL of 0.390, BERT F1 Score of 0.453, and MoverScore of 0.587.

pdf bib abs

CLULab-UofA at SemEval-2024 Task 8: Detecting Machine-Generated Text Using Triplet-Loss-Trained Text Similarity and Text Classification
Mohammadhossein Rezaei | Yeaeun Kwon | Reza Sanayei | Abhyuday Singh | Steven Bethard
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Detecting machine-generated text is a critical task in the era of large language models. In this paper, we present our systems for SemEval-2024 Task 8, which focuses on multi-class classification to discern between human-written and maching-generated texts by five state-of-the-art large language models. We propose three different systems: unsupervised text similarity, triplet-loss-trained text similarity, and text classification. We show that the triplet-loss trained text similarity system outperforms the other systems, achieving 80% accuracy on the test set and surpassing the baseline model for this subtask. Additionally, our text classification system, which takes into account sentence paraphrases generated by the candidate models, also outperforms the unsupervised text similarity system, achieving 74% accuracy.

pdf bib abs

MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity
Reza Sanayei | Abhyuday Singh | Mohammadhossein Rezaei | Steven Bethard
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

The advent of large language models (LLMs) has revolutionized Natural Language Generation (NLG), offering unmatched text generation capabilities. However, this progress introduces significant challenges, notably hallucinations—semantically incorrect yet fluent outputs. This phenomenon undermines content reliability, as traditional detection systems focus more on fluency than accuracy, posing a risk of misinformation spread.Our study addresses these issues by proposing a unified strategy for detecting hallucinations in neural model-generated text, focusing on the SHROOM task in SemEval 2024. We employ diverse methodologies to identify output divergence from the source content. We utilized Sentence Transformers to measure cosine similarity between source-hypothesis and source-target embeddings, experimented with omitting source content in the cosine similarity computations, and Leveragied LLMs’ In-Context Learning with detailed task prompts as our methodologies. The varying performance of our different approaches across the subtasks underscores the complexity of Natural Language Understanding tasks, highlighting the importance of addressing the nuances of semantic correctness in the era of advanced language models.

2023

pdf bib abs

Two-Stage Fine-Tuning for Improved Bias and Variance for Large Pretrained Language Models
Lijing Wang | Yingya Li | Timothy Miller | Steven Bethard | Guergana Savova
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The bias-variance tradeoff is the idea that learning methods need to balance model complexity with data size to minimize both under-fitting and over-fitting. Recent empirical work and theoretical analysis with over-parameterized neural networks challenges the classic bias-variance trade-off notion suggesting that no such trade-off holds: as the width of the network grows, bias monotonically decreases while variance initially increases followed by a decrease. In this work, we first provide a variance decomposition-based justification criteria to examine whether large pretrained neural models in a fine-tuning setting are generalizable enough to have low bias and variance. We then perform theoretical and empirical analysis using ensemble methods explicitly designed to decrease variance due to optimization. This results in essentially a two-stage fine-tuning algorithm that first ratchets down bias and variance iteratively, and then uses a selected fixed-bias model to further reduce variance due to optimization by ensembling. We also analyze the nature of variance change with the ensemble size in low- and high-resource classes. Empirical results show that this two-stage method obtains strong results on SuperGLUE tasks and clinical information extraction tasks. Code and settings are available: https://github.com/christa60/bias-var-fine-tuning-plms.git

pdf bib abs

End-to-end clinical temporal information extraction with multi-head attention
Timothy Miller | Steven Bethard | Dmitriy Dligach | Guergana Savova
Proceedings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks

Understanding temporal relationships in text from electronic health records can be valuable for many important downstream clinical applications. Since Clinical TempEval 2017, there has been little work on end-to-end systems for temporal relation extraction, with most work focused on the setting where gold standard events and time expressions are given. In this work, we make use of a novel multi-headed attention mechanism on top of a pre-trained transformer encoder to allow the learning process to attend to multiple aspects of the contextualized embeddings. Our system achieves state of the art results on the THYME corpus by a wide margin, in both the in-domain and cross-domain settings.

pdf bib

Proceedings of the 5th Clinical Natural Language Processing Workshop
Tristan Naumann | Asma Ben Abacha | Steven Bethard | Kirk Roberts | Anna Rumshisky
Proceedings of the 5th Clinical Natural Language Processing Workshop

pdf bib abs

clulab at MEDIQA-Chat 2023: Summarization and classification of medical dialogues
Kadir Bulut Ozler | Steven Bethard
Proceedings of the 5th Clinical Natural Language Processing Workshop

Clinical Natural Language Processing has been an increasingly popular research area in the NLP community. With the rise of large language models (LLMs) and their impressive abilities in NLP tasks, it is crucial to pay attention to their clinical applications. Sequence to sequence generative approaches with LLMs have been widely used in recent years. To be a part of the research in clinical NLP with recent advances in the field, we participated in task A of MEDIQA-Chat at ACL-ClinicalNLP Workshop 2023. In this paper, we explain our methods and findings as well as our comments on our results and limitations.

pdf bib abs

We explore temporal dependency graph (TDG) parsing in the clinical domain. We leverage existing annotations on the THYME dataset to semi-automatically construct a TDG corpus. Then we propose a new natural language inference (NLI) approach to TDG parsing, and evaluate it both on general domain TDGs from wikinews and the newly constructed clinical TDG corpus. We achieve competitive performance on general domain TDGs with a much simpler model than prior work. On the clinical TDGs, our method establishes the first result of TDG parsing on clinical data with 0.79/0.88 micro/macro F1.

pdf bib abs

Fusing Temporal Graphs into Transformers for Time-Sensitive Question Answering
Xin Su | Phillip Howard | Nagib Hakim | Steven Bethard
Findings of the Association for Computational Linguistics: EMNLP 2023

Answering time-sensitive questions from long documents requires temporal reasoning over the times in questions and documents. An important open question is whether large language models can perform such reasoning solely using a provided text document, or whether they can benefit from additional temporal information extracted using other systems. We address this research question by applying existing temporal information extraction systems to construct temporal graphs of events, times, and temporal relations in questions and documents. We then investigate different approaches for fusing these graphs into Transformer models. Experimental results show that our proposed approach for fusing temporal graphs into input text substantially enhances the temporal reasoning capabilities of Transformer models with or without fine-tuning. Additionally, our proposed method outperforms various graph convolution-based approaches and establishes a new state-of-the-art performance on SituatedQA and three splits of TimeQA.

pdf bib abs

Gallagher at SemEval-2023 Task 5: Tackling Clickbait with Seq2Seq Models
Tugay Bilgis | Nimet Beyza Bozdag | Steven Bethard
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper presents the systems and approaches of the Gallagher team for the SemEval-2023 Task 5: Clickbait Spoiling. We propose a method to classify the type of spoiler (phrase, passage, multi) and a question-answering method to generate spoilers that satisfy the curiosity caused by clickbait posts. We experiment with the state-of-the-art Seq2Seq model T5. To identify the spoiler types we used a fine-tuned T5 classifier (Subtask 1). A mixture of T5 and Flan-T5 was used to generate the spoilers for clickbait posts (Subtask 2). Our system officially ranks first in generating phrase type spoilers in Subtask 2, and achieves the highest precision score for passage type spoilers in Subtask 1.

pdf bib abs

Arizonans at SemEval-2023 Task 9: Multilingual Tweet Intimacy Analysis with XLM-T
Nimet Beyza Bozdag | Tugay Bilgis | Steven Bethard
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper presents the systems and approaches of the Arizonans team for the SemEval 2023 Task 9: Multilingual Tweet Intimacy Analysis. We finetune the Multilingual RoBERTa model trained with about 200M tweets, XLM-T. Our final model ranked 9th out of 45 overall, 13th in seen languages, and 8th in unseen languages.

pdf bib abs

Improving Toponym Resolution with Better Candidate Generation, Transformer-based Reranking, and Two-Stage Resolution
Zeyu Zhang | Steven Bethard
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Geocoding is the task of converting location mentions in text into structured data that encodes the geospatial semantics. We propose a new architecture for geocoding, GeoNorm. GeoNorm first uses information retrieval techniques to generate a list of candidate entries from the geospatial ontology. Then it reranks the candidate entries using a transformer-based neural network that incorporates information from the ontology such as the entry’s population. This generate-and-rerank process is applied twice: first to resolve the less ambiguous countries, states, and counties, and second to resolve the remaining location mentions, using the identified countries, states, and counties as context. Our proposed toponym resolution framework achieves state-of-the-art performance on multiple datasets. Code and models are available at \url{https://github.com/clulab/geonorm}.

pdf bib abs

Transformer-based cynical expression detection in a corpus of Spanish YouTube reviews
Samuel Gonzalez-Lopez | Steven Bethard
Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Consumers of services and products exhibit a wide range of behaviors on social networks when they are dissatisfied. In this paper, we consider three types of cynical expressions negative feelings, specific reasons, and attitude of being right and annotate a corpus of 3189 comments in Spanish on car analysis channels from YouTube. We evaluate both token classification and text classification settings for this problem, and compare performance of different pre-trained models including BETO, SpanBERTa, Multilingual Bert, and RoBERTuito. The results show that models achieve performance above 0.8 F1 for all types of cynical expressions in the text classification setting, but achieve lower performance (around 0.6-0.7 F1) for the harder token classification setting.

2022

pdf bib abs

A Comparison of Strategies for Source-Free Domain Adaptation
Xin Su | Yiyun Zhao | Steven Bethard
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Data sharing restrictions are common in NLP, especially in the clinical domain, but there is limited research on adapting models to new domains without access to the original training data, a setting known as source-free domain adaptation. We take algorithms that traditionally assume access to the source-domain training data—active learning, self-training, and data augmentation—and adapt them for source free domain adaptation. Then we systematically compare these different strategies across multiple tasks and domains. We find that active learning yields consistent gains across all SemEval 2021 Task 10 tasks and domains, but though the shared task saw successful self-trained and data augmented models, our systematic comparison finds these strategies to be unreliable for source-free domain adaptation.

pdf bib

Proceedings of the 4th Clinical Natural Language Processing Workshop
Tristan Naumann | Steven Bethard | Kirk Roberts | Anna Rumshisky
Proceedings of the 4th Clinical Natural Language Processing Workshop

pdf bib abs

Ensemble-based Fine-Tuning Strategy for Temporal Relation Extraction from the Clinical Narrative
Lijing Wang | Timothy Miller | Steven Bethard | Guergana Savova
Proceedings of the 4th Clinical Natural Language Processing Workshop

In this paper, we investigate ensemble methods for fine-tuning transformer-based pretrained models for clinical natural language processing tasks, specifically temporal relation extraction from the clinical narrative. Our experimental results on the THYME data show that ensembling as a fine-tuning strategy can further boost model performance over single learners optimized for hyperparameters. Dynamic snapshot ensembling is particularly beneficial as it fine-tunes a wide array of parameters and results in a 2.8% absolute improvement in F1 over the base single learner.

pdf bib abs

Exploring Text Representations for Generative Temporal Relation Extraction
Dmitriy Dligach | Steven Bethard | Timothy Miller | Guergana Savova
Proceedings of the 4th Clinical Natural Language Processing Workshop

Sequence-to-sequence models are appealing because they allow both encoder and decoder to be shared across many tasks by formulating those tasks as text-to-text problems. Despite recently reported successes of such models, we find that engineering input/output representations for such text-to-text models is challenging. On the Clinical TempEval 2016 relation extraction task, the most natural choice of output representations, where relations are spelled out in simple predicate logic statements, did not lead to good performance. We explore a variety of input/output representations, with the most successful prompting one event at a time, and achieving results competitive with standard pairwise temporal relation extraction systems.

pdf bib abs

Exploring transformers and time lag features for predicting changes in mood over time
John Culnan | Damian Romero Diaz | Steven Bethard
Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology

This paper presents transformer-based models created for the CLPsych 2022 shared task. Using posts from Reddit users over a period of time, we aim to predict changes in mood from post to post. We test models that preserve timeline information through explicit ordering of posts as well as those that do not order posts but preserve features on the length of time between a user’s posts. We find that a model with temporal information may provide slight benefits over the same model without such information, although a RoBERTa transformer model provides enough information to make similar predictions without custom-encoded time information.

An existing domain taxonomy for normalizing content is often assumed when discussing approaches to information extraction, yet often in real-world scenarios there is none. When one does exist, as the information needs shift, it must be continually extended. This is a slow and tedious task, and one which does not scale well. Here we propose an interactive tool that allows a taxonomy to be built or extended rapidly and with a human in the loop to control precision. We apply insights from text summarization and information extraction to reduce the search space dramatically, then leverage modern pretrained language models to perform contextualized clustering of the remaining concepts to yield candidate nodes for the user to review. We show this allows a user to consider as many as 200 taxonomy concept candidates an hour, to quickly build or extend a taxonomy to better fit information needs.

pdf bib abs

TEAM-Atreides at SemEval-2022 Task 11: On leveraging data augmentation and ensemble to recognize complex Named Entities in Bangla
Nazia Tasnim | Md. Istiak Shihab | Asif Shahriyar Sushmit | Steven Bethard | Farig Sadeque
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Many areas, such as the biological and healthcare domain, artistic works, and organization names, have nested, overlapping, discontinuous entity mentions that may even be syntactically or semantically ambiguous in practice. Traditional sequence tagging algorithms are unable to recognize these complex mentions because they may violate the assumptions upon which sequence tagging schemes are founded. In this paper, we describe our contribution to SemEval 2022 Task 11 on identifying such complex Named Entities. We have leveraged the ensemble of multiple ELECTRA-based models that were exclusively pretrained on the Bangla language with the performance of ELECTRA-based models pretrained on English to achieve competitive performance on the Track-11. Besides providing a system description, we will also present the outcomes of our experiments on architectural decisions, dataset augmentations, and post-competition findings.

pdf bib abs

UA-KO at SemEval-2022 Task 11: Data Augmentation and Ensembles for Korean Named Entity Recognition
Hyunju Song | Steven Bethard
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

This paper presents the approaches and systems of the UA-KO team for the Korean portion of SemEval-2022 Task 11 on Multilingual Complex Named Entity Recognition.We fine-tuned Korean and multilingual BERT and RoBERTA models, conducted experiments on data augmentation, ensembles, and task-adaptive pretraining. Our final system ranked 8th out of 17 teams with an F1 score of 0.6749 F1.

2021

pdf bib abs

Domain adaptation in practice: Lessons from a real-world information extraction pipeline
Timothy Miller | Egoitz Laparra | Steven Bethard
Proceedings of the Second Workshop on Domain Adaptation for NLP

Advances in transfer learning and domain adaptation have raised hopes that once-challenging NLP tasks are ready to be put to use for sophisticated information extraction needs. In this work, we describe an effort to do just that – combining state-of-the-art neural methods for negation detection, document time relation extraction, and aspectual link prediction, with the eventual goal of extracting drug timelines from electronic health record text. We train on the THYME colon cancer corpus and test on both the THYME brain cancer corpus and an internal corpus, and show that performance of the combined systems is unacceptable despite good performance of individual systems. Although domain adaptation shows improvements on each individual system, the model selection problem is a barrier to improving overall pipeline performance.

pdf bib abs

Triplet-Trained Vector Space and Sieve-Based Search Improve Biomedical Concept Normalization
Dongfang Xu | Steven Bethard
Proceedings of the 20th Workshop on Biomedical Language Processing

Concept normalization, the task of linking textual mentions of concepts to concepts in an ontology, is critical for mining and analyzing biomedical texts. We propose a vector-space model for concept normalization, where mentions and concepts are encoded via transformer networks that are trained via a triplet objective with online hard triplet mining. The transformer networks refine existing pre-trained models, and the online triplet mining makes training efficient even with hundreds of thousands of concepts by sampling training triples within each mini-batch. We introduce a variety of strategies for searching with the trained vector-space model, including approaches that incorporate domain-specific synonyms at search time with no model retraining. Across five datasets, our models that are trained only once on their corresponding ontologies are within 3 points of state-of-the-art models that are retrained for each new domain. Our models can also be trained for each domain, achieving new state-of-the-art on multiple datasets.

pdf bib abs

EntityBERT: Entity-centric Masking Strategy for Model Pretraining for the Clinical Domain
Chen Lin | Timothy Miller | Dmitriy Dligach | Steven Bethard | Guergana Savova
Proceedings of the 20th Workshop on Biomedical Language Processing

Transformer-based neural language models have led to breakthroughs for a variety of natural language processing (NLP) tasks. However, most models are pretrained on general domain data. We propose a methodology to produce a model focused on the clinical domain: continued pretraining of a model with a broad representation of biomedical terminology (PubMedBERT) on a clinical corpus along with a novel entity-centric masking strategy to infuse domain knowledge in the learning process. We show that such a model achieves superior results on clinical extraction tasks by comparing our entity-centric masking strategy with classic random masking on three clinical NLP tasks: cross-domain negation detection, document time relation (DocTimeRel) classification, and temporal relation extraction. We also evaluate our models on the PubMedQA dataset to measure the models’ performance on a non-entity-centric task in the biomedical domain. The language addressed in this work is English.

pdf bib abs

Do pretrained transformers infer telicity like humans?
Yiyun Zhao | Jian Gang Ngui | Lucy Hall Hartley | Steven Bethard
Proceedings of the 25th Conference on Computational Natural Language Learning

Pretrained transformer-based language models achieve state-of-the-art performance in many NLP tasks, but it is an open question whether the knowledge acquired by the models during pretraining resembles the linguistic knowledge of humans. We present both humans and pretrained transformers with descriptions of events, and measure their preference for telic interpretations (the event has a natural endpoint) or atelic interpretations (the event does not have a natural endpoint). To measure these preferences and determine what factors influence them, we design an English test and a novel-word test that include a variety of linguistic cues (noun phrase quantity, resultative structure, contextual information, temporal units) that bias toward certain interpretations. We find that humans’ choice of telicity interpretation is reliably influenced by theoretically-motivated cues, transformer models (BERT and RoBERTa) are influenced by some (though not all) of the cues, and transformer models often rely more heavily on temporal units than humans do.

pdf bib abs

Simplifying annotation of intersections in time normalization annotation: exploring syntactic and semantic validation
Peiwen Su | Steven Bethard
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

While annotating normalized times in food security documents, we found that the semantically compositional annotation for time normalization (SCATE) scheme required several near-duplicate annotations to get the correct semantics for expressions like Nov. 7th to 11th 2021. To reduce this problem, we explored replacing SCATE’s Sub-Interval property with a Super-Interval property, that is, making the smallest units (e.g., 7th and 11th) rather than the largest units (e.g., 2021) the heads of the intersection chains. To ensure that the semantics of annotated time intervals remained unaltered despite our changes to the syntax of the annotation scheme, we applied several different techniques to validate our changes. These validation techniques detected and allowed us to resolve several important bugs in our automated translation from Sub-Interval to Super-Interval syntax.

pdf bib

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Kristina Toutanova | Anna Rumshisky | Luke Zettlemoyer | Dilek Hakkani-Tur | Iz Beltagy | Steven Bethard | Ryan Cotterell | Tanmoy Chakraborty | Yichao Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib abs

Explainable Multi-hop Verbal Reasoning Through Internal Monologue
Zhengzhong Liang | Steven Bethard | Mihai Surdeanu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Many state-of-the-art (SOTA) language models have achieved high accuracy on several multi-hop reasoning problems. However, these approaches tend to not be interpretable because they do not make the intermediate reasoning steps explicit. Moreover, models trained on simpler tasks tend to fail when directly tested on more complex problems. We propose the Explainable multi-hop Verbal Reasoner (EVR) to solve these limitations by (a) decomposing multi-hop reasoning problems into several simple ones, and (b) using natural language to guide the intermediate reasoning hops. We implement EVR by extending the classic reasoning paradigm General Problem Solver (GPS) with a SOTA generative language model to generate subgoals and perform inference in natural language at each reasoning step. Evaluation of EVR on the RuleTaker synthetic question answering (QA) dataset shows that EVR achieves SOTA performance while being able to generate all reasoning steps in natural language. Furthermore, EVR generalizes better than other strong methods when trained on simpler tasks or less training data (up to 35.7% and 7.7% absolute improvement respectively).

pdf bib abs

If You Want to Go Far Go Together: Unsupervised Joint Candidate Evidence Retrieval for Multi-hop Question Answering
Vikas Yadav | Steven Bethard | Mihai Surdeanu
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multi-hop reasoning requires aggregation and inference from multiple facts. To retrieve such facts, we propose a simple approach that retrieves and reranks set of evidence facts jointly. Our approach first generates unsupervised clusters of sentences as candidate evidence by accounting links between sentences and coverage with the given query. Then, a RoBERTa-based reranker is trained to bring the most representative evidence cluster to the top. We specifically emphasize on the importance of retrieving evidence jointly by showing several comparative analyses to other methods that retrieve and rerank evidence sentences individually. First, we introduce several attention- and embedding-based analyses, which indicate that jointly retrieving and reranking approaches can learn compositional knowledge required for multi-hop reasoning. Second, our experiments show that jointly retrieving candidate evidence leads to substantially higher evidence retrieval performance when fed to the same supervised reranker. In particular, our joint retrieval and then reranking approach achieves new state-of-the-art evidence retrieval performance on two multi-hop question answering (QA) datasets: 30.5 Recall@2 on QASC, and 67.6% F1 on MultiRC. When the evidence text from our joint retrieval approach is fed to a RoBERTa-based answer selection classifier, we achieve new state-of-the-art QA performance on MultiRC and second best result on QASC.

pdf bib abs

This paper presents the Source-Free Domain Adaptation shared task held within SemEval-2021. The aim of the task was to explore adaptation of machine-learning models in the face of data sharing constraints. Specifically, we consider the scenario where annotations exist for a domain but cannot be shared. Instead, participants are provided with models trained on that (source) data. Participants also receive some labeled data from a new (development) domain on which to explore domain adaptation algorithms. Participants are then tested on data representing a new (target) domain. We explored this scenario with two different semantic tasks: negation detection (a text classification task) and time expression recognition (a sequence tagging task).

pdf bib abs

The University of Arizona at SemEval-2021 Task 10: Applying Self-training, Active Learning and Data Augmentation to Source-free Domain Adaptation
Xin Su | Yiyun Zhao | Steven Bethard
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes our systems for negation detection and time expression recognition in SemEval 2021 Task 10, Source-Free Domain Adaptation for Semantic Processing. We show that self-training, active learning and data augmentation techniques can improve the generalization ability of the model on the unlabeled target domain data without accessing source domain data. We also perform detailed ablation studies and error analyses for our time expression recognition systems to identify the source of the performance improvement and give constructive feedback on the temporal normalization annotation guidelines.

pdf bib abs

Detection of Puffery on the English Wikipedia
Amanda Bertsch | Steven Bethard
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

On Wikipedia, an online crowdsourced encyclopedia, volunteers enforce the encyclopedia’s editorial policies. Wikipedia’s policy on maintaining a neutral point of view has inspired recent research on bias detection, including “weasel words” and “hedges”. Yet to date, little work has been done on identifying “puffery,” phrases that are overly positive without a verifiable source. We demonstrate that collecting training data for this task requires some care, and construct a dataset by combining Wikipedia editorial annotations and information retrieval techniques. We compare several approaches to predicting puffery, and achieve 0.963 f1 score by incorporating citation features into a RoBERTa model. Finally, we demonstrate how to integrate our model with Wikipedia’s public infrastructure to give back to the Wikipedia editor community.

2020

pdf bib abs

Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering
Vikas Yadav | Steven Bethard | Mihai Surdeanu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Evidence retrieval is a critical stage of question answering (QA), necessary not only to improve performance, but also to explain the decisions of the QA method. We introduce a simple, fast, and unsupervised iterative evidence retrieval method, which relies on three ideas: (a) an unsupervised alignment approach to soft-align questions and answers with justification sentences using only GloVe embeddings, (b) an iterative process that reformulates queries focusing on terms that are not covered by existing justifications, which (c) stops when the terms in the given question and candidate answers are covered by the retrieved justifications. Despite its simplicity, our approach outperforms all the previous methods (including supervised methods) on the evidence selection task on two datasets: MultiRC and QASC. When these evidence sentences are fed into a RoBERTa answer classification component, we achieve state-of-the-art QA performance on these two datasets.

pdf bib abs

How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope
Yiyun Zhao | Steven Bethard
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Large pretrained language models like BERT, after fine-tuning to a downstream task, have achieved high performance on a variety of NLP problems. Yet explaining their decisions is difficult despite recent work probing their internal representations. We propose a procedure and analysis methods that take a hypothesis of how a transformer-based model might encode a linguistic phenomenon, and test the validity of that hypothesis based on a comparison between knowledge-related downstream tasks with downstream control tasks, and measurement of cross-dataset consistency. We apply this methodology to test BERT and RoBERTa on a hypothesis that some attention heads will consistently attend from a word in negation scope to the negation cue. We find that after fine-tuning BERT and RoBERTa on a negation scope task, the average attention head improves its sensitivity to negation and its attention consistency across negation datasets compared to the pre-trained models. However, only the base models (not the large models) improve compared to a control task, indicating there is evidence for a shallow encoding of negation only in the base models.

pdf bib abs

A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization
Dongfang Xu | Zeyu Zhang | Steven Bethard
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Concept normalization, the task of linking textual mentions of concepts to concepts in an ontology, is challenging because ontologies are large. In most cases, annotated datasets cover only a small sample of the concepts, yet concept normalizers are expected to predict all concepts in the ontology. In this paper, we propose an architecture consisting of a candidate generator and a list-wise ranker based on BERT. The ranker considers pairings of concept mentions and candidate concepts, allowing it to make predictions for any concept, not just those seen during training. We further enhance this list-wise approach with a semantic type regularizer that allows the model to incorporate semantic type information from the ontology during training. Our proposed concept normalization framework achieves state-of-the-art performance on multiple datasets.

pdf bib abs

Incivility is a problem on social media, and it comes in many forms (name-calling, vulgarity, threats, etc.) and domains (microblog posts, online news comments, Wikipedia edits, etc.). Training machine learning models to detect such incivility must handle the multi-label and multi-domain nature of the problem. We present a BERT-based model for incivility detection and propose several approaches for training it for multi-label and multi-domain datasets. We find that individual binary classifiers outperform a joint multi-label classifier, and that simply combining multiple domains of training data outperforms other recently-proposed fine tuning strategies. We also establish new state-of-the-art performance on several incivility detection datasets.

pdf bib abs

Assisting Undergraduate Students in Writing Spanish Methodology Sections
Samuel González-López | Steven Bethard | Aurelio Lopez-Lopez
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

In undergraduate theses, a good methodology section should describe the series of steps that were followed in performing the research. To assist students in this task, we develop machine-learning models and an app that uses them to provide feedback while students write. We construct an annotated corpus that identifies sentences representing methodological steps and labels when a methodology contains a logical sequence of such steps. We train machine-learning models based on language modeling and lexical features that can identify sentences representing methodological steps with 0.939 f-measure, and identify methodology sections containing a logical sequence of steps with an accuracy of 87%. We incorporate these models into a Microsoft Office Add-in, and show that students who improved their methodologies according to the model feedback received better grades on their methodologies.

pdf bib abs

Recently BERT has achieved a state-of-the-art performance in temporal relation extraction from clinical Electronic Medical Records text. However, the current approach is inefficient as it requires multiple passes through each input sequence. We extend a recently-proposed one-pass model for relation classification to a one-pass model for relation extraction. We augment this framework by introducing global embeddings to help with long-distance relation inference, and by multi-task learning to increase model performance and generalizability. Our proposed model produces results on par with the state-of-the-art in temporal relation extraction on the THYME corpus and is much “greener” in computational cost.

pdf bib

Proceedings of the 3rd Clinical Natural Language Processing Workshop
Anna Rumshisky | Kirk Roberts | Steven Bethard | Tristan Naumann
Proceedings of the 3rd Clinical Natural Language Processing Workshop

pdf bib abs

A Dataset and Evaluation Framework for Complex Geographical Description Parsing
Egoitz Laparra | Steven Bethard
Proceedings of the 28th International Conference on Computational Linguistics

Much previous work on geoparsing has focused on identifying and resolving individual toponyms in text like Adrano, S.Maria di Licodia or Catania. However, geographical locations occur not only as individual toponyms, but also as compositions of reference geolocations joined and modified by connectives, e.g., “. . . between the towns of Adrano and S.Maria di Licodia, 32 kilometres northwest of Catania”. Ideally, a geoparser should be able to take such text, and the geographical shapes of the toponyms referenced within it, and parse these into a geographical shape, formed by a set of coordinates, that represents the location described. But creating a dataset for this complex geoparsing task is difficult and, if done manually, would require a huge amount of effort to annotate the geographical shapes of not only the geolocation described but also the reference toponyms. We present an approach that automates most of the process by combining Wikipedia and OpenStreetMap. As a result, we have gathered a collection of 360,187 uncurated complex geolocation descriptions, from which we have manually curated 1,000 examples intended to be used as a test set. To accompany the data, we define a new geoparsing evaluation framework along with a scoring methodology and a set of baselines.

pdf bib abs

We present refinements over existing temporal relation annotations in the Electronic Medical Record clinical narrative. We refined the THYME corpus annotations to more faithfully represent nuanced temporality and nuanced temporal-coreferential relations. The main contributions are in re-defining CONTAINS and OVERLAP relations into CONTAINS, CONTAINS-SUBEVENT, OVERLAP and NOTED-ON. We demonstrate that these refinements lead to substantial gains in learnability for state-of-the-art transformer models as compared to previously reported results on the original THYME corpus. We thus establish a baseline for the automatic extraction of these refined temporal relations. Although our study is done on clinical narrative, we believe it addresses far-reaching challenges that are corpus- and domain- agnostic.

pdf bib abs

TTUI at SemEval-2020 Task 11: Propaganda Detection with Transfer Learning and Ensembles
Moonsung Kim | Steven Bethard
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we describe our approaches and systems for the SemEval-2020 Task 11 on propaganda technique detection. We fine-tuned BERT and RoBERTa pre-trained models then merged them with an average ensemble. We conducted several experiments for input representations dealing with long texts and preserving context as well as for the imbalanced class problem. Our system ranked 20th out of 36 teams with 0.398 F1 in the SI task and 14th out of 31 teams with 0.556 F1 in the TC task.

2019

pdf bib abs

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering
Vikas Yadav | Steven Bethard | Mihai Surdeanu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We propose an unsupervised strategy for the selection of justification sentences for multi-hop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection can be coupled with any supervised QA model. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2’s Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among systems that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.

pdf bib abs

Alignment over Heterogeneous Embeddings for Question Answering
Vikas Yadav | Steven Bethard | Mihai Surdeanu
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a simple, fast, and mostly-unsupervised approach for non-factoid question answering (QA) called Alignment over Heterogeneous Embeddings (AHE). AHE simply aligns each word in the question and candidate answer with the most similar word in the retrieved supporting paragraph, and weighs each alignment score with the inverse document frequency of the corresponding question/answer term. AHE’s similarity function operates over embeddings that model the underlying text at different levels of abstraction: character (FLAIR), word (BERT and GloVe), and sentence (InferSent), where the latter is the only supervised component in the proposed approach. Despite its simplicity and lack of supervision, AHE obtains a new state-of-the-art performance on the “Easy” partition of the AI2 Reasoning Challenge (ARC) dataset (64.6% accuracy), top-two performance on the “Challenge” partition of ARC (34.1%), and top-three performance on the WikiQA dataset (74.08% MRR), outperforming many other complex, supervised approaches. Our error analysis indicates that alignments over character, word, and sentence embeddings capture substantially different semantic information. We exploit this with a simple meta-classifier that learns how much to trust the predictions over each representation, which further improves the performance of unsupervised AHE.

Building causal models of complicated phenomena such as food insecurity is currently a slow and labor-intensive manual process. In this paper, we introduce an approach that builds executable probabilistic models from raw, free text. The proposed approach is implemented through three systems: Eidos, INDRA, and Delphi. Eidos is an open-domain machine reading system designed to extract causal relations from natural language. It is rule-based, allowing for rapid domain transfer, customizability, and interpretability. INDRA aggregates multiple sources of causal information and performs assembly to create a coherent knowledge base and assess its reliability. This assembled knowledge serves as the starting point for modeling. Delphi is a modeling framework that assembles quantified causal fragments and their contexts into executable probabilistic models that respect the semantics of the original text, and can be used to support decision making.

pdf bib abs

Pre-trained Contextualized Character Embeddings Lead to Major Improvements in Time Normalization: a Detailed Analysis
Dongfang Xu | Egoitz Laparra | Steven Bethard
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

Recent studies have shown that pre-trained contextual word embeddings, which assign the same word different vectors in different contexts, improve performance in many tasks. But while contextual embeddings can also be trained at the character level, the effectiveness of such embeddings has not been studied. We derive character-level contextual embeddings from Flair (Akbik et al., 2018), and apply them to a time normalization task, yielding major performance improvements over the previous state-of-the-art: 51% error reduction in news and 33% in clinical notes. We analyze the sources of these improvements, and find that pre-trained contextual character embeddings are more robust to term variations, infrequent terms, and cross-domain changes. We also quantify the size of context that pre-trained contextual character embeddings take advantage of, and show that such embeddings capture features like part-of-speech and capitalization.

pdf bib abs

Incivility in public discourse has been a major concern in recent times as it can affect the quality and tenacity of the discourse negatively. In this paper, we present neural models that can learn to detect name-calling and vulgarity from a newspaper comment section. We show that in contrast to prior work on detecting toxic language, fine-grained incivilities like namecalling cannot be accurately detected by simple models like logistic regression. We apply the models trained on the newspaper comments data to detect uncivil comments in a Russian troll dataset, and find that despite the change of domain, the model makes accurate predictions.

pdf bib abs

University of Arizona at SemEval-2019 Task 12: Deep-Affix Named Entity Recognition of Geolocation Entities
Vikas Yadav | Egoitz Laparra | Ti-Tai Wang | Mihai Surdeanu | Steven Bethard
Proceedings of the 13th International Workshop on Semantic Evaluation

We present the Named Entity Recognition (NER) and disambiguation model used by the University of Arizona team (UArizona) for the SemEval 2019 task 12. We achieved fourth place on tasks 1 and 3. We implemented a deep-affix based LSTM-CRF NER model for task 1, which utilizes only character, word, pre- fix and suffix information for the identification of geolocation entities. Despite using just the training data provided by task organizers and not using any lexicon features, we achieved 78.85% strict micro F-score on task 1. We used the unsupervised population heuristics for task 3 and achieved 52.99% strict micro-F1 score in this task.

pdf bib

Proceedings of the 2nd Clinical Natural Language Processing Workshop
Anna Rumshisky | Kirk Roberts | Steven Bethard | Tristan Naumann
Proceedings of the 2nd Clinical Natural Language Processing Workshop

pdf bib abs

A BERT-based Universal Model for Both Within- and Cross-sentence Clinical Temporal Relation Extraction
Chen Lin | Timothy Miller | Dmitriy Dligach | Steven Bethard | Guergana Savova
Proceedings of the 2nd Clinical Natural Language Processing Workshop

Classic methods for clinical temporal relation extraction focus on relational candidates within a sentence. On the other hand, break-through Bidirectional Encoder Representations from Transformers (BERT) are trained on large quantities of arbitrary spans of contiguous text instead of sentences. In this study, we aim to build a sentence-agnostic framework for the task of CONTAINS temporal relation extraction. We establish a new state-of-the-art result for the task, 0.684F for in-domain (0.055-point improvement) and 0.565F for cross-domain (0.018-point improvement), by fine-tuning BERT and pre-training domain-specific BERT models on sentence-agnostic temporal relation instances with WordPiece-compatible encodings, and augmenting the labeled data with automatically generated “silver” instances.

pdf bib abs

Inferring missing metadata from environmental policy texts
Steven Bethard | Egoitz Laparra | Sophia Wang | Yiyun Zhao | Ragheb Al-Ghezi | Aaron Lien | Laura López-Hoffman
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

The National Environmental Policy Act (NEPA) provides a trove of data on how environmental policy decisions have been made in the United States over the last 50 years. Unfortunately, there is no central database for this information and it is too voluminous to assess manually. We describe our efforts to enable systematic research over US environmental policy by extracting and organizing metadata from the text of NEPA documents. Our contributions include collecting more than 40,000 NEPA-related documents, and evaluating rule-based baselines that establish the difficulty of three important tasks: identifying lead agencies, aligning document versions, and detecting reused text.

2018

pdf bib abs

A Survey on Recent Advances in Named Entity Recognition from Deep Learning models
Vikas Yadav | Steven Bethard
Proceedings of the 27th International Conference on Computational Linguistics

Named Entity Recognition (NER) is a key component in NLP systems for question answering, information retrieval, relation extraction, etc. NER systems have been studied and developed widely for decades, but accurate systems using deep neural networks (NN) have only been introduced in the last few years. We present a comprehensive survey of deep neural network architectures for NER, and contrast them with previous approaches to NER based on feature engineering and other supervised or semi-supervised learning algorithms. Our results highlight the improvements achieved by neural networks, and show how incorporating some of the lessons learned from past work on feature-based NER systems can yield further improvements.

pdf bib abs

From Characters to Time Intervals: New Paradigms for Evaluation and Neural Parsing of Time Normalizations
Egoitz Laparra | Dongfang Xu | Steven Bethard
Transactions of the Association for Computational Linguistics, Volume 6

This paper presents the first model for time normalization trained on the SCATE corpus. In the SCATE schema, time expressions are annotated as a semantic composition of time entities. This novel schema favors machine learning approaches, as it can be viewed as a semantic parsing task. In this work, we propose a character level multi-output neural network that outperforms previous state-of-the-art built on the TimeML schema. To compare predictions of systems that follow both SCATE and TimeML, we present a new scoring metric for time intervals. We also apply this new metric to carry out a comparative analysis of the annotations of both schemes in the same corpus.

pdf bib

pdf bib abs

SemEval 2018 Task 6: Parsing Time Normalizations
Egoitz Laparra | Dongfang Xu | Ahmed Elsayed | Steven Bethard | Martha Palmer
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper presents the outcomes of the Parsing Time Normalization shared task held within SemEval-2018. The aim of the task is to parse time expressions into the compositional semantic graphs of the Semantically Compositional Annotation of Time Expressions (SCATE) schema, which allows the representation of a wider variety of time expressions than previous approaches. Two tracks were included, one to evaluate the parsing of individual components of the produced graphs, in a classic information extraction way, and another one to evaluate the quality of the time intervals resulting from the interpretation of those graphs. Though 40 participants registered for the task, only one team submitted output, achieving 0.55 F1 in Track 1 (parsing) and 0.70 F1 in Track 2 (intervals).

pdf bib abs

Deep Affix Features Improve Neural Named Entity Recognizers
Vikas Yadav | Rebecca Sharp | Steven Bethard
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics

We propose a practical model for named entity recognition (NER) that combines word and character-level information with a specific learned representation of the prefixes and suffixes of the word. We apply this approach to multilingual and multi-domain NER and show that it achieves state of the art results on the CoNLL 2002 Spanish and Dutch and CoNLL 2003 German NER datasets, consistently achieving 1.5-2.3 percent over the state of the art without relying on any dictionary features. Additionally, we show improvement on SemEval 2013 task 9.1 DrugNER, achieving state of the art results on the MedLine dataset and the second best results overall (-1.3% from state of the art). We also establish a new benchmark on the I2B2 2010 Clinical NER dataset with 84.70 F-score.

pdf bib abs

Self-training improves Recurrent Neural Networks performance for Temporal Relation Extraction
Chen Lin | Timothy Miller | Dmitriy Dligach | Hadi Amiri | Steven Bethard | Guergana Savova
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

Neural network models are oftentimes restricted by limited labeled instances and resort to advanced architectures and features for cutting edge performance. We propose to build a recurrent neural network with multiple semantically heterogeneous embeddings within a self-training framework. Our framework makes use of labeled, unlabeled, and social media data, operates on basic features, and is scalable and generalizable. With this method, we establish the state-of-the-art result for both in- and cross-domain for a clinical temporal relation extraction task.

2017

pdf bib abs

Neural Temporal Relation Extraction
Dmitriy Dligach | Timothy Miller | Chen Lin | Steven Bethard | Guergana Savova
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We experiment with neural architectures for temporal relation extraction and establish a new state-of-the-art for several scenarios. We find that neural models with only tokens as input outperform state-of-the-art hand-engineered feature-based models, that convolutional neural networks outperform LSTM models, and that encoding relation arguments with XML tags outperforms a traditional position-based encoding.

pdf bib abs

Improving Implicit Semantic Role Labeling by Predicting Semantic Frame Arguments
Quynh Ngoc Thi Do | Steven Bethard | Marie-Francine Moens
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Implicit semantic role labeling (iSRL) is the task of predicting the semantic roles of a predicate that do not appear as explicit arguments, but rather regard common sense knowledge or are mentioned earlier in the discourse. We introduce an approach to iSRL based on a predictive recurrent neural semantic frame model (PRNSFM) that uses a large unannotated corpus to learn the probability of a sequence of semantic arguments given a predicate. We leverage the sequence probabilities predicted by the PRNSFM to estimate selectional preferences for predicates and their arguments. On the NomBank iSRL test set, our approach improves state-of-the-art performance on implicit semantic role labeling with less reliance than prior work on manually constructed language resources.

pdf bib

Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
Steven Bethard | Marine Carpuat | Marianna Apidianaki | Saif M. Mohammad | Daniel Cer | David Jurgens
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

pdf bib abs

SemEval-2017 Task 12: Clinical TempEval
Steven Bethard | Guergana Savova | Martha Palmer | James Pustejovsky
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Clinical TempEval 2017 aimed to answer the question: how well do systems trained on annotated timelines for one medical condition (colon cancer) perform in predicting timelines on another medical condition (brain cancer)? Nine sub-tasks were included, covering problems in time expression identification, event expression identification and temporal relation identification. Participant systems were evaluated on clinical and pathology notes from Mayo Clinic cancer patients, annotated with an extension of TimeML for the clinical domain. 11 teams participated in the tasks, with the best systems achieving F1 scores above 0.55 for time expressions, above 0.70 for event expressions, and above 0.40 for temporal relations. Most tasks observed about a 20 point drop over Clinical TempEval 2016, where systems were trained and evaluated on the same domain (colon cancer).

pdf bib abs

Unsupervised Domain Adaptation for Clinical Negation Detection
Timothy Miller | Steven Bethard | Hadi Amiri | Guergana Savova
Proceedings of the 16th BioNLP Workshop

Detecting negated concepts in clinical texts is an important part of NLP information extraction systems. However, generalizability of negation systems is lacking, as cross-domain experiments suffer dramatic performance losses. We examine the performance of multiple unsupervised domain adaptation algorithms on clinical negation detection, finding only modest gains that fall well short of in-domain performance.

pdf bib abs

Representations of Time Expressions for Temporal Relation Extraction with Convolutional Neural Networks
Chen Lin | Timothy Miller | Dmitriy Dligach | Steven Bethard | Guergana Savova
Proceedings of the 16th BioNLP Workshop

Token sequences are often used as the input for Convolutional Neural Networks (CNNs) in natural language processing. However, they might not be an ideal representation for time expressions, which are long, highly varied, and semantically complex. We describe a method for representing time expressions with single pseudo-tokens for CNNs. With this method, we establish a new state-of-the-art result for a clinical temporal relation extraction task.

2016

pdf bib abs

Facing the most difficult case of Semantic Role Labeling: A collaboration of word embeddings and co-training
Quynh Ngoc Thi Do | Steven Bethard | Marie-Francine Moens
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

We present a successful collaboration of word embeddings and co-training to tackle in the most difficult test case of semantic role labeling: predicting out-of-domain and unseen semantic frames. Despite the fact that co-training is a successful traditional semi-supervised method, its application in SRL is very limited especially when a huge amount of labeled data is available. In this work, co-training is used together with word embeddings to improve the performance of a system trained on a large training dataset. We also introduce a semantic role labeling system with a simple learning architecture and effective inference that is easily adaptable to semi-supervised settings with new training data and/or new features. On the out-of-domain testing set of the standard benchmark CoNLL 2009 data our simple approach achieves high performance and improves state-of-the-art results.

pdf bib abs

Health support forums have become a rich source of data that can be used to improve health care outcomes. A user profile, including information such as age and gender, can support targeted analysis of forum data. But users might not always disclose their age and gender. It is desirable then to be able to automatically extract this information from users’ content. However, to the best of our knowledge there is no such resource for author profiling of health forum data. Here we present a large corpus, with close to 85,000 users, for profiling and also outline our approach and benchmark results to automatically detect a user’s age and gender from their forum posts. We use a mix of features from a user’s text as well as forum specific features to obtain accuracy well above the baseline, thus showing that both our dataset and our method are useful and valid.

pdf bib abs

A Semantically Compositional Annotation Scheme for Time Normalization
Steven Bethard | Jonathan Parker
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a new annotation scheme for normalizing time expressions, such as “three days ago”, to computer-readable forms, such as 2016-03-07. The annotation scheme addresses several weaknesses of the existing TimeML standard, allowing the representation of time expressions that align to more than one calendar unit (e.g., “the past three summers”), that are defined relative to events (e.g., “three weeks postoperative”), and that are unions or intersections of smaller time expressions (e.g., “Tuesdays and Thursdays”). It achieves this by modeling time expression interpretation as the semantic composition of temporal operators like UNION, NEXT, and AFTER. We have applied the annotation scheme to 34 documents so far, producing 1104 annotations, and achieving inter-annotator agreement of 0.821.

pdf bib

Domain Adaptation for Authorship Attribution: Improved Structural Correspondence Learning
Upendra Sapkota | Thamar Solorio | Manuel Montes | Steven Bethard
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib

Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)
Steven Bethard | Marine Carpuat | Daniel Cer | David Jurgens | Preslav Nakov | Torsten Zesch
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib

DLS@CU at SemEval-2016 Task 1: Supervised Models of Sentence Similarity
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

pdf bib

pdf bib

pdf bib

Improving Temporal Relation Extraction with Training Instance Augmentation
Chen Lin | Timothy Miller | Dmitriy Dligach | Steven Bethard | Guergana Savova
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf bib

Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)
Anna Rumshisky | Kirk Roberts | Steven Bethard | Tristan Naumann
Proceedings of the Clinical Natural Language Processing Workshop (ClinicalNLP)

pdf bib

Visualizing the Content of a Children’s Story in a Virtual World: Lessons Learned
Quynh Ngoc Thi Do | Steven Bethard | Marie-Francine Moens
Proceedings of the Workshop on Uphill Battles in Language Processing: Scaling Early Achievements to Robust Methods

pdf bib

pdf bib

Why Do They Leave: Modeling Participation in Online Depression Forums
Farig Sadeque | Ted Pedersen | Thamar Solorio | Prasha Shrestha | Nicolas Rey-Villamizar | Steven Bethard
Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media

2015

pdf bib

Feature-Rich Two-Stage Logistic Regression for Monolingual Alignment
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib

Adapting Coreference Resolution for Narrative Processing
Quynh Ngoc Thi Do | Steven Bethard | Marie-Francine Moens
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution
Upendra Sapkota | Steven Bethard | Manuel Montes | Thamar Solorio
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib

DLS@CU: Sentence Similarity from Word Alignment and Semantic Vector Composition
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib

CUAB: Supervised Learning of Disorders and their Attributes using Relations
James Gung | John David Osborne | Steven Bethard
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib

SemEval-2015 Task 6: Clinical TempEval
Steven Bethard | Leon Derczynski | Guergana Savova | James Pustejovsky | Marc Verhagen
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib

Developing Language-tagged Corpora for Code-switching Tweets
Suraj Maharjan | Elizabeth Blair | Steven Bethard | Thamar Solorio
Proceedings of the 9th Linguistic Annotation Workshop

pdf bib

Predicting Continued Participation in Online Health Forums
Farig Sadeque | Thamar Solorio | Ted Pedersen | Prasha Shrestha | Steven Bethard
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib

Extracting Time Expressions from Clinical Text
Timothy Miller | Steven Bethard | Dmitriy Dligach | Chen Lin | Guergana Savova
Proceedings of BioNLP 15

2014

pdf bib

Cross-Topic Authorship Attribution: Will Out-Of-Topic Data Help?
Upendra Sapkota | Thamar Solorio | Manuel Montes | Steven Bethard | Paolo Rosso
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib abs

ClearTK 2.0: Design Patterns for Machine Learning in UIMA
Steven Bethard | Philip Ogren | Lee Becker
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

ClearTK adds machine learning functionality to the UIMA framework, providing wrappers to popular machine learning libraries, a rich feature extraction library that works across different classifiers, and utilities for applying and evaluating machine learning models. Since its inception in 2008, ClearTK has evolved in response to feedback from developers and the community. This evolution has followed a number of important design principles including: conceptually simple annotator interfaces, readable pipeline descriptions, minimal collection readers, type system agnostic code, modules organized for ease of import, and assisting user comprehension of the complex UIMA framework.

pdf bib

pdf bib

An Annotation Framework for Dense Event Ordering
Taylor Cassidy | Bill McDowell | Nathanael Chambers | Steven Bethard
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib

The Stanford CoreNLP Natural Language Processing Toolkit
Christopher Manning | Mihai Surdeanu | John Bauer | Jenny Finkel | Steven Bethard | David McClosky
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib abs

This article discusses the requirements of a formal specification for the annotation of temporal information in clinical narratives. We discuss the implementation and extension of ISO-TimeML for annotating a corpus of clinical notes, known as the THYME corpus. To reflect the information task and the heavily inference-based reasoning demands in the domain, a new annotation guideline has been developed, “the THYME Guidelines to ISO-TimeML (THYME-TimeML)”. To clarify what relations merit annotation, we distinguish between linguistically-derived and inferentially-derived temporal orderings in the text. We also apply a top performing TempEval 2013 system against this new resource to measure the difficulty of adapting systems to the clinical domain. The corpus is available to the community and has been proposed for use in a SemEval 2015 task.

pdf bib abs

Back to Basics for Monolingual Alignment: Exploiting Word Similarity and Contextual Evidence
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Transactions of the Association for Computational Linguistics, Volume 2

We present a simple, easy-to-replicate monolingual aligner that demonstrates state-of-the-art performance while relying on almost no supervision and a very small number of external resources. Based on the hypothesis that words with similar meanings represent potential pairs for alignment if located in similar contexts, we propose a system that operates by finding such pairs. In two intrinsic evaluations on alignment test data, our system achieves F1 scores of 88–92%, demonstrating 1–3% absolute improvement over the previous best system. Moreover, in two extrinsic evaluations our aligner outperforms existing aligners, and even a naive application of the aligner approaches state-of-the-art performance in each extrinsic task.

pdf bib abs

Dense Event Ordering with a Multi-Pass Architecture
Nathanael Chambers | Taylor Cassidy | Bill McDowell | Steven Bethard
Transactions of the Association for Computational Linguistics, Volume 2

The past 10 years of event ordering research has focused on learning partial orderings over document events and time expressions. The most popular corpus, the TimeBank, contains a small subset of the possible ordering graph. Many evaluations follow suit by only testing certain pairs of events (e.g., only main verbs of neighboring sentences). This has led most research to focus on specific learners for partial labelings. This paper attempts to nudge the discussion from identifying some relations to all relations. We present new experiments on strongly connected event graphs that contain ∼10 times more relations per document than the TimeBank. We also describe a shift away from the single learner to a sieve-based architecture that naturally blends multiple learners into a precision-ranked cascade of sieves. Each sieve adds labels to the event graph one at a time, and earlier sieves inform later ones through transitive closure. This paper thus describes innovations in both approach and task. We experiment on the densest event graphs to date and show a 14% gain over state-of-the-art.

pdf bib

DLS@CU: Sentence Similarity from Word Alignment
Md Arafat Sultan | Steven Bethard | Tamara Sumner
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

pdf bib

Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL)
Oleksandr Kolomiyets | Marie-Francine Moens | Martha Palmer | James Pustejovsky | Steven Bethard
Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL)

pdf bib

2013

pdf bib

Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
David Yarowsky | Timothy Baldwin | Anna Korhonen | Karen Livescu | Steven Bethard
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib

A Synchronous Context Free Grammar for Time Normalization
Steven Bethard
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib

51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop
Anik Dey | Sebastian Krause | Ivelina Nikolova | Eva Vecchi | Steven Bethard | Preslav I. Nakov | Feiyu Xu
51st Annual Meeting of the Association for Computational Linguistics Proceedings of the Student Research Workshop

pdf bib

DLS@CU-CORE: A Simple Machine Learning Model of Semantic Textual Similarity
Md. Sultan | Steven Bethard | Tamara Sumner
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

pdf bib

ClearTK-TimeML: A minimalist approach to TempEval 2013
Steven Bethard
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib

SemEval-2013 Task 3: Spatial Role Labeling
Oleksandr Kolomiyets | Parisa Kordjamshidi | Marie-Francine Moens | Steven Bethard
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib

CU : Computational Assessment of Short Free Text Answers - A Tool for Evaluating Students’ Understanding
Ifeyinwa Okoye | Steven Bethard | Tamara Sumner
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib

2012

pdf bib

Skip N-grams and Ranking Functions for Predicting Script Events
Bram Jans | Steven Bethard | Ivan Vulić | Marie Francine Moens
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib abs

Annotating Story Timelines as Temporal Dependency Structures
Steven Bethard | Oleksandr Kolomiyets | Marie-Francine Moens
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an approach to annotating timelines in stories where events are linked together by temporal relations into a temporal dependency tree. This approach avoids the disconnected timeline problems of prior work, and results in timelines that are more suitable for temporal reasoning. We show that annotating timelines as temporal dependency trees is possible with high levels of inter-annotator agreement - Krippendorff's Alpha of 0.822 on selecting event pairs, and of 0.700 on selecting temporal relation labels - even with the moderately sized relation set of BEFORE, AFTER, INCLUDES, IS-INCLUDED, IDENTITY and OVERLAP. We also compare several annotation schemes for identifying story events, and show that higher inter-annotator agreement can be reached by focusing on only the events that are essential to forming the timeline, skipping words in negated contexts, modal contexts and quoted speech.

pdf bib

Extracting Narrative Timelines as Temporal Dependency Structures
Oleksandr Kolomiyets | Steven Bethard | Marie-Francine Moens
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib

SemEval-2012 Task 3: Spatial Role Labeling
Parisa Kordjamshidi | Steven Bethard | Marie-Francine Moens
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib

Identifying science concepts and student misconceptions in an interactive essay writing tutor
Steven Bethard | Ifeyinwa Okoye | Md. Arafat Sultan | Haojie Hang | James H. Martin | Tamara Sumner
Proceedings of the Seventh Workshop on Building Educational Applications Using NLP

While recent corpus annotation efforts cover a wide variety of semantic structures, work on temporal and causal relations is still in its early stages. Annotation efforts have typically considered either temporal relations or causal relations, but not both, and no corpora currently exist that allow the relation between temporals and causals to be examined empirically. We have annotated a corpus of 1000 event pairs for both temporal and causal relations, focusing on a relatively frequent construction in which the events are conjoined by the word and. Temporal relations were annotated using an extension of the BEFORE and AFTER scheme used in the TempEval competition, and causal relations were annotated using a scheme based on connective phrases like and as a result. The annotators achieved 81.2% agreement on temporal relations and 77.8% agreement on causal relations. Analysis of the resulting corpus revealed some interesting findings, for example, that over 30% of CAUSAL relations do not have an underlying BEFORE relation. The corpus was also explored using machine learning methods, and while model performance exceeded all baselines, the results suggested that simple grammatical cues may be insufficient for identifying the more difficult temporal and causal relations.

pdf bib

Learning Semantic Links from a Corpus of Parallel Temporal and Causal Relations
Steven Bethard | James H. Martin
Proceedings of ACL-08: HLT, Short Papers