Arman Cohan


2022

pdf bib
MultiVerS: Improving scientific claim verification with weak supervision and full-document context
David Wadden | Kyle Lo | Lucy Lu Wang | Arman Cohan | Iz Beltagy | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: NAACL 2022

The scientific claim verification task requires an NLP system to label scientific documents which Support or Refute an input claim, and to select evidentiary sentences (or rationales) justifying each predicted label. In this work, we present MultiVerS, which predicts a fact-checking label and identifies rationales in a multitask fashion based on a shared encoding of the claim and full document context. This approach accomplishes two key modeling goals. First, it ensures that all relevant contextual information is incorporated into each labeling decision. Second, it enables the model to learn from instances annotated with a document-level fact-checking label, but lacking sentence-level rationales. This allows MultiVerS to perform weakly-supervised domain adaptation by training on scientific documents labeled using high-precision heuristics. Our approach outperforms two competitive baselines on three scientific claim verification datasets, with particularly strong performance in zero / few-shot domain adaptation experiments. Our code and data are available at https://github.com/dwadden/multivers.

pdf bib
MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting
Anne Lauscher | Brandon Ko | Bailey Kuehl | Sophie Johnson | Arman Cohan | David Jurgens | Kyle Lo
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Citation context analysis (CCA) is an important task in natural language processing that studies how and why scholars discuss each others’ work. Despite decades of study, computational methods for CCA have largely relied on overly-simplistic assumptions of how authors cite, which ignore several important phenomena. For instance, scholarly papers often contain rich discussions of cited work that span multiple sentences and express multiple intents concurrently. Yet, recent work in CCA is often approached as a single-sentence, single-label classification task, and thus many datasets used to develop modern computational approaches fail to capture this interesting discourse. To address this research gap, we highlight three understudied phenomena for CCA and release MULTICITE, a new dataset of 12.6K citation contexts from 1.2K computational linguistics papers that fully models these phenomena. Not only is it the largest collection of expert-annotated citation contexts to-date, MULTICITE contains multi-sentence, multi-label citation contexts annotated through-out entire full paper texts. We demonstrate how MULTICITE can enable the development of new computational methods on three important CCA tasks. We release our code and dataset at https://github.com/allenai/multicite.

pdf bib
Long Context Question Answering via Supervised Contrastive Learning
Avi Caciularu | Ido Dagan | Jacob Goldberger | Arman Cohan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Long-context question answering (QA) tasks require reasoning over a long document or multiple documents. Addressing these tasks often benefits from identifying a set of evidence spans (e.g., sentences), which provide supporting evidence for answering the question.In this work, we propose a novel method for equipping long-context QA models with an additional sequence-level objective for better identification of the supporting evidence.We achieve this via an additional contrastive supervision signal in finetuning, where the model is encouraged to explicitly discriminate supporting evidence sentences from negative ones by maximizing question-evidence similarity. The proposed additional loss exhibits consistent improvements on three different strong long-context transformer models, across two challenging question answering benchmarks – HotpotQA and QAsper.

pdf bib
Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity
Sheshera Mysore | Arman Cohan | Tom Hope
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a new scientific document similarity model based on matching fine-grained aspects of texts. To train our model, we exploit a naturally-occurring source of supervision: sentences in the full-text of papers that cite multiple papers together (co-citations). Such co-citations not only reflect close paper relatedness, but also provide textual descriptions of how the co-cited papers are related. This novel form of textual supervision is used for learning to match aspects across papers. We develop multi-vector representations where vectors correspond to sentence-level aspects of documents, and present two methods for aspect matching: (1) A fast method that only matches single aspects, and (2) a method that makes sparse multiple matches with an Optimal Transport mechanism that computes an Earth Mover’s Distance between aspects. Our approach improves performance on document similarity tasks in four datasets. Further, our fast single-match method achieves competitive results, paving the way for applying fine-grained similarity to large scientific corpora.

pdf bib
Proceedings of the Third Workshop on Scholarly Document Processing
Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard | Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing

pdf bib
Overview of the Third Workshop on Scholarly Document Processing
Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Michal Shmueli-Scheuer | Anita de Waard | Lucy Lu Wang
Proceedings of the Third Workshop on Scholarly Document Processing

With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 3rd Workshop on Scholarly Document Processing (SDP) at COLING as a hybrid event (https://sdproc.org/2022/). The SDP workshop consisted of a research track, three invited talks and five Shared Tasks: 1) MSLR22: Multi-Document Summarization for Literature Reviews, 2) DAGPap22: Detecting automatically generated scientific papers, 3) SV-Ident 2022: Survey Variable Identification in Social Science Publications, 4) SKGG: Scholarly Knowledge Graph Generation, 5) MuP 2022: Multi Perspective Scientific Document Summarization. The program was geared towards NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.

pdf bib
Overview of the First Shared Task on Multi Perspective Scientific Document Summarization (MuP)
Arman Cohan | Guy Feigenblat | Tirthankar Ghosal | Michal Shmueli-Scheuer
Proceedings of the Third Workshop on Scholarly Document Processing

We present the main findings of MuP 2022 shared task, the first shared task on multi-perspective scientific document summarization. The task provides a testbed representing challenges for summarization of scientific documents, and facilitates development of better models to leverage summaries generated from multiple perspectives. We received 139 total submissions from 9 teams. We evaluated submissions both by automated metrics (i.e., Rouge) and human judgments on faithfulness, coverage, and readability which provided a more nuanced view of the differences between the systems. While we observe encouraging results from the participating teams, we conclude that there is still significant room left for improving summarization leveraging multiple references. Our dataset is available at https://github.com/allenai/mup.

pdf bib
ABNIRML: Analyzing the Behavior of Neural IR Models
Sean MacAvaney | Sergey Feldman | Nazli Goharian | Doug Downey | Arman Cohan
Transactions of the Association for Computational Linguistics, Volume 10

Pretrained contextualized language models such as BERT and T5 have established a new state-of-the-art for ad-hoc search. However, it is not yet well understood why these methods are so effective, what makes some variants more effective than others, and what pitfalls they may have. We present a new comprehensive framework for Analyzing the Behavior of Neural IR ModeLs (ABNIRML), which includes new types of diagnostic probes that allow us to test several characteristics—such as writing styles, factuality, sensitivity to paraphrasing and word order—that are not addressed by previous techniques. To demonstrate the value of the framework, we conduct an extensive empirical study that yields insights into the factors that contribute to the neural model’s gains, and identify potential unintended biases the models exhibit. Some of our results confirm conventional wisdom, for example, that recent neural ranking models rely less on exact term overlap with the query, and instead leverage richer linguistic information, evidenced by their higher sensitivity to word and sentence order. Other results are more surprising, such as that some models (e.g., T5 and ColBERT) are biased towards factually correct (rather than simply relevant) texts. Further, some characteristics vary even for the same base language model, and other characteristics can appear due to random variations during model training.1

pdf bib
Generating Scientific Claims for Zero-Shot Scientific Fact Checking
Dustin Wright | David Wadden | Kyle Lo | Bailey Kuehl | Arman Cohan | Isabelle Augenstein | Lucy Lu Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automated scientific fact checking is difficult due to the complexity of scientific language and a lack of significant amounts of training data, as annotation requires domain expertise. To address this challenge, we propose scientific claim generation, the task of generating one or more atomic and verifiable claims from scientific sentences, and demonstrate its usefulness in zero-shot fact checking for biomedical claims. We propose CLAIMGEN-BART, a new supervised method for generating claims supported by the literature, as well as KBIN, a novel method for generating claim negations. Additionally, we adapt an existing unsupervised entity-centric method of claim generation to biomedical claims, which we call CLAIMGEN-ENTITY. Experiments on zero-shot fact checking demonstrate that both CLAIMGEN-ENTITY and CLAIMGEN-BART, coupled with KBIN, achieve up to 90% performance of fully supervised models trained on manually annotated claims and evidence. A rigorous evaluation study demonstrates significant improvement in generated claim and negation quality over existing baselines

pdf bib
PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization
Wen Xiao | Iz Beltagy | Giuseppe Carenini | Arman Cohan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce PRIMERA, a pre-trained model for multi-document representation with a focus on summarization that reduces the need for dataset-specific architectures and large amounts of fine-tuning labeled data. PRIMERA uses our newly proposed pre-training objective designed to teach the model to connect and aggregate information across documents. It also uses efficient encoder-decoder transformers to simplify the processing of concatenated input documents. With extensive experiments on 6 multi-document summarization datasets from 3 different domains on zero-shot, few-shot and full-supervised settings, PRIMERA outperforms current state-of-the-art dataset-specific and pre-trained models on most of these settings with large margins.

pdf bib
Improving the Generalizability of Depression Detection by Leveraging Clinical Questionnaires
Thong Nguyen | Andrew Yates | Ayah Zirikly | Bart Desmet | Arman Cohan
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automated methods have been widely used to identify and analyze mental health conditions (e.g., depression) from various sources of information, including social media. Yet, deployment of such models in real-world healthcare applications faces challenges including poor out-of-domain generalization and lack of trust in black box models. In this work, we propose approaches for depression detection that are constrained to different degrees by the presence of symptoms described in PHQ9, a questionnaire used by clinicians in the depression screening process. In dataset-transfer experiments on three social media datasets, we find that grounding the model in PHQ9’s symptoms substantially improves its ability to generalize to out-of-distribution data compared to a standard BERT-based approach. Furthermore, this approach can still perform competitively on in-domain data. These results and our qualitative analyses suggest that grounding model predictions in clinically-relevant symptoms can improve generalizability while producing a model that is easier to inspect.

pdf bib
Zero- and Few-Shot NLP with Pretrained Language Models
Iz Beltagy | Arman Cohan | Robert Logan IV | Sewon Min | Sameer Singh
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

The ability to efficiently learn from little-to-no data is critical to applying NLP to tasks where data collection is costly or otherwise difficult. This is a challenging setting both academically and practically—particularly because training neutral models typically require large amount of labeled data. More recently, advances in pretraining on unlabelled data have brought up the potential of better zero-shot or few-shot learning (Devlin et al., 2019; Brown et al., 2020). In particular, over the past year, a great deal of research has been conducted to better learn from limited data using large-scale language models. In this tutorial, we aim at bringing interested NLP researchers up to speed about the recent and ongoing techniques for zero- and few-shot learning with pretrained language models. Additionally, our goal is to reveal new research opportunities to the audience, which will hopefully bring us closer to address existing challenges in this domain.

2021

pdf bib
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing

pdf bib
Overview of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang
Proceedings of the Second Workshop on Scholarly Document Processing

With the ever-increasing pace of research and high volume of scholarly communication, scholars face a daunting task. Not only must they keep up with the growing literature in their own and related fields, scholars increasingly also need to rebut pseudo-science and disinformation. These needs have motivated an increasing focus on computational methods for enhancing search, summarization, and analysis of scholarly documents. However, the various strands of research on scholarly document processing remain fragmented. To reach out to the broader NLP and AI/ML community, pool distributed efforts in this area, and enable shared access to published research, we held the 2nd Workshop on Scholarly Document Processing (SDP) at NAACL 2021 as a virtual event (https://sdproc.org/2021/). The SDP workshop consisted of a research track, three invited talks, and three Shared Tasks (LongSumm 2021, SCIVER, and 3C). The program was geared towards the application of NLP, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges.

pdf bib
CDLM: Cross-Document Language Modeling
Avi Caciularu | Arman Cohan | Iz Beltagy | Matthew Peters | Arie Cattan | Ido Dagan
Findings of the Association for Computational Linguistics: EMNLP 2021

We introduce a new pretraining approach geared for multi-document language modeling, incorporating two key ideas into the masked language modeling self-supervised objective. First, instead of considering documents in isolation, we pretrain over sets of multiple related documents, encouraging the model to learn cross-document relationships. Second, we improve over recent long-range transformers by introducing dynamic global attention that has access to the entire input to predict masked tokens. We release CDLM (Cross-Document Language Model), a new general language model for multi-document setting that can be easily applied to downstream tasks. Our extensive analysis shows that both ideas are essential for the success of CDLM, and work in synergy to set new state-of-the-art results for several multi-text tasks.

pdf bib
ParsiNLU: A Suite of Language Understanding Challenges for Persian
Daniel Khashabi | Arman Cohan | Siamak Shakeri | Pedram Hosseini | Pouya Pezeshkpour | Malihe Alikhani | Moin Aminnaseri | Marzieh Bitaab | Faeze Brahman | Sarik Ghazarian | Mozhdeh Gheini | Arman Kabiri | Rabeeh Karimi Mahabagdi | Omid Memarrast | Ahmadreza Mosallanezhad | Erfan Noury | Shahab Raji | Mohammad Sadegh Rasooli | Sepideh Sadeghi | Erfan Sadeqi Azer | Niloofar Safi Samghabadi | Mahsa Shafaei | Saber Sheybani | Ali Tazarv | Yadollah Yaghoobzadeh
Transactions of the Association for Computational Linguistics, Volume 9

Abstract Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce ParsiNLU, the first benchmark in Persian language that includes a range of language understanding tasks—reading comprehension, textual entailment, and so on. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Additionally, we present the first results on state-of-the-art monolingual and multilingual pre-trained language models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope ParsiNLU fosters further research and advances in Persian language understanding.1

pdf bib
A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Pradeep Dasigi | Kyle Lo | Iz Beltagy | Arman Cohan | Noah A. Smith | Matt Gardner
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present Qasper, a dataset of 5049 questions over 1585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.

pdf bib
Beyond Paragraphs: NLP for Long Sequences
Iz Beltagy | Arman Cohan | Hannaneh Hajishirzi | Sewon Min | Matthew E. Peters
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials

In this tutorial, we aim at bringing interested NLP researchers up to speed about the recent and ongoing techniques for document-level representation learning. Additionally, our goal is to reveal new research opportunities to the audience, which will hopefully bring us closer to address existing challenges in this domain.

2020

pdf bib
SPECTER: Document-level Representation Learning using Citation-informed Transformers
Arman Cohan | Sergey Feldman | Iz Beltagy | Doug Downey | Daniel Weld
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, accurate embeddings of documents are a necessity. We propose SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, Specter can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that Specter outperforms a variety of competitive baselines on the benchmark.

pdf bib
SUPP.AI: finding evidence for supplement-drug interactions
Lucy Lu Wang | Oyvind Tafjord | Arman Cohan | Sarthak Jain | Sam Skjonsberg | Carissa Schoenick | Nick Botner | Waleed Ammar
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Dietary supplements are used by a large portion of the population, but information on their pharmacologic interactions is incomplete. To address this challenge, we present SUPP.AI, an application for browsing evidence of supplement-drug interactions (SDIs) extracted from the biomedical literature. We train a model to automatically extract supplement information and identify such interactions from the scientific literature. To address the lack of labeled data for SDI identification, we use labels of the closely related task of identifying drug-drug interactions (DDIs) for supervision. We fine-tune the contextualized word representations of the RoBERTa language model using labeled DDI data, and apply the fine-tuned model to identify supplement interactions. We extract 195k evidence sentences from 22M articles (P=0.82, R=0.58, F1=0.68) for 60k interactions. We create the SUPP.AI application for users to search evidence sentences extracted by our model. SUPP.AI is an attempt to close the information gap on dietary supplements by making up-to-date evidence on SDIs more discoverable for researchers, clinicians, and consumers. An informational video on how to use SUPP.AI is available at: https://youtu.be/dR0ucKdORwc

pdf bib
TLDR: Extreme Summarization of Scientific Documents
Isabel Cachola | Kyle Lo | Arman Cohan | Daniel Weld
Findings of the Association for Computational Linguistics: EMNLP 2020

We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language. To facilitate study on this task, we introduce SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SCITLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden. We propose CATTS, a simple yet effective learning strategy for generating TLDRs that exploits titles as an auxiliary training signal. CATTS improves upon strong baselines under both automated metrics and human evaluations. Data and code are publicly available at https://github.com/allenai/scitldr.

pdf bib
SLEDGE-Z: A Zero-Shot Baseline for COVID-19 Literature Search
Sean MacAvaney | Arman Cohan | Nazli Goharian
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

With worldwide concerns surrounding the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), there is a rapidly growing body of scientific literature on the virus. Clinicians, researchers, and policy-makers need to be able to search these articles effectively. In this work, we present a zero-shot ranking algorithm that adapts to COVID-related scientific literature. Our approach filters training data from another collection down to medical-related queries, uses a neural re-ranking model pre-trained on scientific text (SciBERT), and filters the target document collection. This approach ranks top among zero-shot methods on the TREC COVID Round 1 leaderboard, and exhibits a P@5 of 0.80 and an nDCG@10 of 0.68 when evaluated on both Round 1 and 2 judgments. Despite not relying on TREC-COVID data, our method outperforms models that do. As one of the first search methods to thoroughly evaluate COVID-19 search, we hope that this serves as a strong baseline and helps in the global crisis.

pdf bib
Fact or Fiction: Verifying Scientific Claims
David Wadden | Shanchuan Lin | Kyle Lo | Lucy Lu Wang | Madeleine van Zuylen | Arman Cohan | Hannaneh Hajishirzi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce scientific claim verification, a new task to select abstracts from the research literature containing evidence that SUPPORTS or REFUTES a given scientific claim, and to identify rationales justifying each decision. To study this task, we construct SciFact, a dataset of 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales. We develop baseline models for SciFact, and demonstrate that simple domain adaptation techniques substantially improve performance compared to models trained on Wikipedia or political news. We show that our system is able to verify claims related to COVID-19 by identifying evidence from the CORD-19 corpus. Our experiments indicate that SciFact will provide a challenging testbed for the development of new systems designed to retrieve and reason over corpora containing specialized domain knowledge. Data and code for this new task are publicly available at https://github.com/allenai/scifact. A leaderboard and COVID-19 fact-checking demo are available at https://scifact.apps.allenai.org.

pdf bib
GUIR @ LongSumm 2020: Learning to Generate Long Summaries from Scientific Documents
Sajad Sotudeh Gharebagh | Arman Cohan | Nazli Goharian
Proceedings of the First Workshop on Scholarly Document Processing

This paper presents our methods for the LongSumm 2020: Shared Task on Generating Long Summaries for Scientific Documents, where the task is to generatelong summaries given a set of scientific papers provided by the organizers. We explore 3 main approaches for this task: 1. An extractive approach using a BERT-based summarization model; 2. A two stage model that additionally includes an abstraction step using BART; and 3. A new multi-tasking approach on incorporating document structure into the summarizer. We found that our new multi-tasking approach outperforms the two other methods by large margins. Among 9 participants in the shared task, our best model ranks top according to Rouge-1 score (53.11%) while staying competitive in terms of Rouge-2.

2019

pdf bib
SciBERT: A Pretrained Language Model for Scientific Text
Iz Beltagy | Kyle Lo | Arman Cohan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et. al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised pretraining on a large multi-domain corpus of scientific publications to improve performance on downstream scientific NLP tasks. We evaluate on a suite of tasks including sequence tagging, sentence classification and dependency parsing, with datasets from a variety of scientific domains. We demonstrate statistically significant improvements over BERT and achieve new state-of-the-art results on several of these tasks. The code and pretrained models are available at https://github.com/allenai/scibert/.

pdf bib
Pretrained Language Models for Sequential Sentence Classification
Arman Cohan | Iz Beltagy | Daniel King | Bhavana Dalvi | Dan Weld
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

As a step toward better document-level understanding, we explore classification of a sequence of sentences into their corresponding categories, a task that requires understanding sentences in context of the document. Recent successful models for this task have used hierarchical models to contextualize sentence representations, and Conditional Random Fields (CRFs) to incorporate dependencies between subsequent labels. In this work, we show that pretrained language models, BERT (Devlin et al., 2018) in particular, can be used for this task to capture contextual dependencies without the need for hierarchical encoding nor a CRF. Specifically, we construct a joint sentence representation that allows BERT Transformer layers to directly utilize contextual information from all words in all sentences. Our approach achieves state-of-the-art results on four datasets, including a new dataset of structured scientific abstracts.

pdf bib
Structural Scaffolds for Citation Intent Classification in Scientific Publications
Arman Cohan | Waleed Ammar | Madeleine van Zuylen | Field Cady
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Identifying the intent of a citation in scientific papers (e.g., background information, use of methods, comparing results) is critical for machine reading of individual publications and automated analysis of the scientific literature. We propose structural scaffolds, a multitask model to incorporate structural information of scientific papers into citations for effective classification of citation intents. Our model achieves a new state-of-the-art on an existing ACL anthology dataset (ACL-ARC) with a 13.3% absolute increase in F1 score, without relying on external linguistic resources or hand-engineered features as done in existing methods. In addition, we introduce a new dataset of citation intents (SciCite) which is more than five times larger and covers multiple scientific domains compared with existing datasets. Our code and data are available at: https://github.com/allenai/scicite.

2018

pdf bib
RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses
Sean MacAvaney | Bart Desmet | Arman Cohan | Luca Soldaini | Andrew Yates | Ayah Zirikly | Nazli Goharian
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

Self-reported diagnosis statements have been widely employed in studying language related to mental health in social media. However, existing research has largely ignored the temporality of mental health diagnoses. In this work, we introduce RSDD-Time: a new dataset of 598 manually annotated self-reported depression diagnosis posts from Reddit that include temporal information about the diagnosis. Annotations include whether a mental health condition is present and how recently the diagnosis happened. Furthermore, we include exact temporal spans that relate to the date of diagnosis. This information is valuable for various computational methods to examine mental health through social media because one’s mental health state is not static. We also test several baseline classification and extraction approaches, which suggest that extracting temporal information from self-reported diagnosis statements is challenging.

pdf bib
Helping or Hurting? Predicting Changes in Users’ Risk of Self-Harm Through Online Community Interactions
Luca Soldaini | Timothy Walsh | Arman Cohan | Julien Han | Nazli Goharian
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

In recent years, online communities have formed around suicide and self-harm prevention. While these communities offer support in moment of crisis, they can also normalize harmful behavior, discourage professional treatment, and instigate suicidal ideation. In this work, we focus on how interaction with others in such a community affects the mental state of users who are seeking support. We first build a dataset of conversation threads between users in a distressed state and community members offering support. We then show how to construct a classifier to predict whether distressed users are helped or harmed by the interactions in the thread, and we achieve a macro-F1 score of up to 0.69.

pdf bib
A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents
Arman Cohan | Franck Dernoncourt | Doo Soon Kim | Trung Bui | Seokhwan Kim | Walter Chang | Nazli Goharian
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Neural abstractive summarization models have led to promising results in summarizing relatively short documents. We propose the first model for abstractive summarization of single, longer-form documents (e.g., research papers). Our approach consists of a new hierarchical encoder that models the discourse structure of a document, and an attentive discourse-aware decoder to generate the summary. Empirical results on two large-scale datasets of scientific papers show that our model significantly outperforms state-of-the-art models.

pdf bib
GU IRLAB at SemEval-2018 Task 7: Tree-LSTMs for Scientific Relation Classification
Sean MacAvaney | Luca Soldaini | Arman Cohan | Nazli Goharian
Proceedings of the 12th International Workshop on Semantic Evaluation

SemEval 2018 Task 7 focuses on relation extraction and classification in scientific literature. In this work, we present our tree-based LSTM network for this shared task. Our approach placed 9th (of 28) for subtask 1.1 (relation classification), and 5th (of 20) for subtask 1.2 (relation classification with noisy entities). We also provide an ablation study of features included as input to the network.

pdf bib
SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions
Arman Cohan | Bart Desmet | Andrew Yates | Luca Soldaini | Sean MacAvaney | Nazli Goharian
Proceedings of the 27th International Conference on Computational Linguistics

Mental health is a significant and growing public health concern. As language usage can be leveraged to obtain crucial insights into mental health conditions, there is a need for large-scale, labeled, mental health-related datasets of users who have been diagnosed with one or more of such conditions. In this paper, we investigate the creation of high-precision patterns to identify self-reported diagnoses of nine different mental health conditions, and obtain high-quality labeled data without the need for manual labelling. We introduce the SMHD (Self-reported Mental Health Diagnoses) dataset and make it available. SMHD is a novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users. We examine distinctions in users’ language, as measured by linguistic and psychological variables. We further explore text classification methods to identify individuals with mental conditions through their language.

2017

pdf bib
GUIR at SemEval-2017 Task 12: A Framework for Cross-Domain Clinical Temporal Information Extraction
Sean MacAvaney | Arman Cohan | Nazli Goharian
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

Clinical TempEval 2017 (SemEval 2017 Task 12) addresses the task of cross-domain temporal extraction from clinical text. We present a system for this task that uses supervised learning for the extraction of temporal expression and event spans with corresponding attributes and narrative container relations. Approaches include conditional random fields and decision tree ensembles, using lexical, syntactic, semantic, distributional, and rule-based features. Our system received best or second best scores in TIMEX3 span, EVENT span, and CONTAINS relation extraction.

pdf bib
Depression and Self-Harm Risk Assessment in Online Forums
Andrew Yates | Arman Cohan | Nazli Goharian
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Users suffering from mental health conditions often turn to online resources for support, including specialized online support communities or general communities such as Twitter and Reddit. In this work, we present a framework for supporting and studying users in both types of communities. We propose methods for identifying posts in support communities that may indicate a risk of self-harm, and demonstrate that our approach outperforms strong previously proposed methods for identifying such posts. Self-harm is closely related to depression, which makes identifying depressed users on general forums a crucial related task. We introduce a large-scale general forum dataset consisting of users with self-reported depression diagnoses matched with control users. We show how our method can be applied to effectively identify depressed users from their use of language alone. We demonstrate that our method outperforms strong baselines on this general forum dataset.

2016

pdf bib
Revisiting Summarization Evaluation for Scientific Articles
Arman Cohan | Nazli Goharian
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Evaluation of text summarization approaches have been mostly based on metrics that measure similarities of system generated summaries with a set of human written gold-standard summaries. The most widely used metric in summarization evaluation has been the ROUGE family. ROUGE solely relies on lexical overlaps between the terms and phrases in the sentences; therefore, in cases of terminology variations and paraphrasing, ROUGE is not as effective. Scientific article summarization is one such case that is different from general domain summarization (e.g. newswire data). We provide an extensive analysis of ROUGE’s effectiveness as an evaluation metric for scientific summarization; we show that, contrary to the common belief, ROUGE is not much reliable in evaluating scientific summaries. We furthermore show how different variants of ROUGE result in very different correlations with the manual Pyramid scores. Finally, we propose an alternative metric for summarization evaluation which is based on the content relevance between a system generated summary and the corresponding human written summaries. We call our metric SERA (Summarization Evaluation by Relevance Analysis). Unlike ROUGE, SERA consistently achieves high correlations with manual scores which shows its effectiveness in evaluation of scientific article summarization.

pdf bib
Triaging Mental Health Forum Posts
Arman Cohan | Sydney Young | Nazli Goharian
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
GUIR at SemEval-2016 task 12: Temporal Information Processing for Clinical Narratives
Arman Cohan | Kevin Meurer | Nazli Goharian
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Scientific Article Summarization Using Citation-Context and Article’s Discourse Structure
Arman Cohan | Nazli Goharian
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Matching Citation Text and Cited Spans in Biomedical Literature: a Search-Oriented Approach
Arman Cohan | Luca Soldaini | Nazli Goharian
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Search
Co-authors