Cornelia Caragea


2022

pdf bib
A Data Cartography based MixUp for Pre-trained Language Models
Seo Yeon Park | Cornelia Caragea
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

MixUp is a data augmentation strategy where additional samples are generated during training by combining random pairs of training samples and their labels. However, selecting random pairs is not potentially an optimal choice. In this work, we propose TDMixUp, a novel MixUp strategy that leverages Training Dynamics and allows more informative samples to be combined for generating new data samples. Our proposed TDMixUp first measures confidence, variability, (Swayamdipta et al., 2020), and Area Under the Margin (AUM) (Pleiss et al., 2020) to identify the characteristics of training samples (e.g., as easy-to-learn or ambiguous samples), and then interpolates these characterized samples. We empirically validate that our method not only achieves competitive performance using a smaller subset of the training data compared with strong baselines, but also yields lower expected calibration error on the pre-trained language model, BERT, on both in-domain and out-of-domain settings in a wide range of NLP tasks. We publicly release our code.

pdf bib
On the Calibration of Pre-trained Language Models using Mixup Guided by Area Under the Margin and Saliency
Seo Yeon Park | Cornelia Caragea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A well-calibrated neural model produces confidence (probability outputs) closely approximated by the expected accuracy. While prior studies have shown that mixup training as a data augmentation technique can improve model calibration on image classification tasks, little is known about using mixup for model calibration on natural language understanding (NLU) tasks. In this paper, we explore mixup for model calibration on several NLU tasks and propose a novel mixup strategy for pre-trained language models that improves model calibration further. Our proposed mixup is guided by both the Area Under the Margin (AUM) statistic (Pleiss et al., 2020) and the saliency map of each sample (Simonyan et al., 2013). Moreover, we combine our mixup strategy with model miscalibration correction techniques (i.e., label smoothing and temperature scaling) and provide detailed analyses of their impact on our proposed mixup. We focus on systematically designing experiments on three NLU tasks: natural language inference, paraphrase detection, and commonsense reasoning. Our method achieves the lowest expected calibration error compared to strong baselines on both in-domain and out-of-domain test samples while maintaining competitive accuracy.

pdf bib
SciNLI: A Corpus for Natural Language Inference on Scientific Text
Mobashir Sadat | Cornelia Caragea
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Existing Natural Language Inference (NLI) datasets, while being instrumental in the advancement of Natural Language Understanding (NLU) research, are not related to scientific text. In this paper, we introduce SciNLI, a large dataset for NLI that captures the formality in scientific text and contains 107,412 sentence pairs extracted from scholarly papers on NLP and computational linguistics. Given that the text used in scientific literature differs vastly from the text used in everyday language both in terms of vocabulary and sentence structure, our dataset is well suited to serve as a benchmark for the evaluation of scientific NLU models. Our experiments show that SciNLI is harder to classify than the existing NLI datasets. Our best performing model with XLNet achieves a Macro F1 score of only 78.18% and an accuracy of 78.23% showing that there is substantial room for improvement.

2021

pdf bib
Stance Detection in COVID-19 Tweets
Kyle Glandt | Sarthak Khanal | Yingjie Li | Doina Caragea | Cornelia Caragea
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The prevalence of the COVID-19 pandemic in day-to-day life has yielded large amounts of stance detection data on social media sites, as users turn to social media to share their views regarding various issues related to the pandemic, e.g. stay at home mandates and wearing face masks when out in public. We set out to make use of this data by collecting the stance expressed by Twitter users, with respect to topics revolving around the pandemic. We annotate a new stance detection dataset, called COVID-19-Stance. Using this newly annotated dataset, we train several established stance detection models to ascertain a baseline performance for this specific task. To further improve the performance, we employ self-training and domain adaptation approaches to take advantage of large amounts of unlabeled data and existing stance detection datasets. The dataset, code, and other resources are available on GitHub.

pdf bib
eMLM: A New Pre-training Objective for Emotion Related Tasks
Tiberiu Sosea | Cornelia Caragea
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

BERT has been shown to be extremely effective on a wide variety of natural language processing tasks, including sentiment analysis and emotion detection. However, the proposed pretraining objectives of BERT do not induce any sentiment or emotion-specific biases into the model. In this paper, we present Emotion Masked Language Modelling, a variation of Masked Language Modelling aimed at improving the BERT language representation model for emotion detection and sentiment analysis tasks. Using the same pre-training corpora as the original model, Wikipedia and BookCorpus, our BERT variation manages to improve the downstream performance on 4 tasks from emotion detection and sentiment analysis by an average of 1.2% F-1. Moreover, our approach shows an increased performance in our task-specific robustness tests.

pdf bib
Knowledge Distillation with BERT for Image Tag-Based Privacy Prediction
Chenye Zhao | Cornelia Caragea
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Text in the form of tags associated with online images is often informative for predicting private or sensitive content from images. When using privacy prediction systems running on social networking sites that decide whether each uploaded image should get posted or be protected, users may be reluctant to share real images that may reveal their identity but may share image tags. In such cases, privacy-aware tags become good indicators of image privacy and can be utilized to generate privacy decisions. In this paper, our aim is to learn tag representations for images to improve tag-based image privacy prediction. To achieve this, we explore self-distillation with BERT, in which we utilize knowledge in the form of soft probability distributions (soft labels) from the teacher model to help with the training of the student model. Our approach effectively learns better tag representations with improved performance on private image identification and outperforms state-of-the-art models for this task. Moreover, we utilize the idea of knowledge distillation to improve tag representations in a semi-supervised learning task. Our semi-supervised approach with only 20% of annotated data achieves similar performance compared with its supervised learning counterpart. Last, we provide a comprehensive analysis to get a better understanding of our approach.

pdf bib
Studying the Evolution of Scientific Topics and their Relationships
Ana Sabina Uban | Cornelia Caragea | Liviu P. Dinu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
A Multi-Task Learning Framework for Multi-Target Stance Detection
Yingjie Li | Cornelia Caragea
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
P-Stance: A Large Dataset for Stance Detection in Political Domain
Yingjie Li | Tiberiu Sosea | Aditya Sawant | Ajith Jayaraman Nair | Diana Inkpen | Cornelia Caragea
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Distilling Knowledge for Empathy Detection
Mahshid Hosseini | Cornelia Caragea
Findings of the Association for Computational Linguistics: EMNLP 2021

Empathy is the link between self and others. Detecting and understanding empathy is a key element for improving human-machine interaction. However, annotating data for detecting empathy at a large scale is a challenging task. This paper employs multi-task training with knowledge distillation to incorporate knowledge from available resources (emotion and sentiment) to detect empathy from the natural language in different domains. This approach yields better results on an existing news-related empathy dataset compared to strong baselines. In addition, we build a new dataset for empathy prediction with fine-grained empathy direction, seeking or providing empathy, from Twitter. We release our dataset for research purposes.

pdf bib
Target-Aware Data Augmentation for Stance Detection
Yingjie Li | Cornelia Caragea
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The goal of stance detection is to identify whether the author of a text is in favor of, neutral or against a specific target. Despite substantial progress on this task, one of the remaining challenges is the scarcity of annotations. Data augmentation is commonly used to address annotation scarcity by generating more training samples. However, the augmented sentences that are generated by existing methods are either less diversified or inconsistent with the given target and stance label. In this paper, we formulate the data augmentation of stance detection as a conditional masked language modeling task and augment the dataset by predicting the masked word conditioned on both its context and the auxiliary sentence that contains target and label information. Moreover, we propose another simple yet effective method that generates target-aware sentence by replacing a target mention with the other. Experimental results show that our proposed methods significantly outperforms previous augmentation methods on 11 targets.

pdf bib
Identifying Medical Self-Disclosure in Online Communities
Mina Valizadeh | Pardis Ranjbar-Noiey | Cornelia Caragea | Natalie Parde
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Self-disclosure in online health conversations may offer a host of benefits, including earlier detection and treatment of medical issues that may have otherwise gone unaddressed. However, research analyzing medical self-disclosure in online communities is limited. We address this shortcoming by introducing a new dataset of health-related posts collected from online social platforms, categorized into three groups (No Self-Disclosure, Possible Self-Disclosure, and Clear Self-Disclosure) with high inter-annotator agreement (_k_=0.88). We make this data available to the research community. We also release a predictive model trained on this dataset that achieves an accuracy of 81.02%, establishing a strong performance benchmark for this task.

pdf bib
Improving Stance Detection with Multi-Dataset Learning and Knowledge Distillation
Yingjie Li | Chenye Zhao | Cornelia Caragea
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Stance detection determines whether the author of a text is in favor of, against or neutral to a specific target and provides valuable insights into important events such as legalization of abortion. Despite significant progress on this task, one of the remaining challenges is the scarcity of annotations. Besides, most previous works focused on a hard-label training in which meaningful similarities among categories are discarded during training. To address these challenges, first, we evaluate a multi-target and a multi-dataset training settings by training one model on each dataset and datasets of different domains, respectively. We show that models can learn more universal representations with respect to targets in these settings. Second, we investigate the knowledge distillation in stance detection and observe that transferring knowledge from a teacher model to a student model can be beneficial in our proposed training settings. Moreover, we propose an Adaptive Knowledge Distillation (AKD) method that applies instance-specific temperature scaling to the teacher and student predictions. Results show that the multi-dataset model performs best on all datasets and it can be further improved by the proposed AKD, outperforming the state-of-the-art by a large margin. We publicly release our code.

pdf bib
Exploiting Position and Contextual Word Embeddings for Keyphrase Extraction from Scientific Papers
Krutarth Patel | Cornelia Caragea
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Keyphrases associated with research papers provide an effective way to find useful information in the large and growing scholarly digital collections. In this paper, we present KPRank, an unsupervised graph-based algorithm for keyphrase extraction that exploits both positional information and contextual word embeddings into a biased PageRank. Our experimental results on five benchmark datasets show that KPRank that uses contextual word embeddings with additional position signal outperforms previous approaches and strong baselines for this task.

2020

pdf bib
On the Use of Web Search to Improve Scientific Collections
Krutarth Patel | Cornelia Caragea | Sujatha Das Gollapalli
Proceedings of the First Workshop on Scholarly Document Processing

Despite the advancements in search engine features, ranking methods, technologies, and the availability of programmable APIs, current-day open-access digital libraries still rely on crawl-based approaches for acquiring their underlying document collections. In this paper, we propose a novel search-driven framework for acquiring documents for such scientific portals. Within our framework, publicly-available research paper titles and author names are used as queries to a Web search engine. We were able to obtain ~267,000 unique research papers through our fully-automated framework using ~76,000 queries, resulting in almost 200,000 more papers than the number of queries. Moreover, through a combination of title and author name search, we were able to recover 78% of the original searched titles.

pdf bib
CancerEmo: A Dataset for Fine-Grained Emotion Detection
Tiberiu Sosea | Cornelia Caragea
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Emotions are an important element of human nature, often affecting the overall wellbeing of a person. Therefore, it is no surprise that the health domain is a valuable area of interest for emotion detection, as it can provide medical staff or caregivers with essential information about patients. However, progress on this task has been hampered by the absence of large labeled datasets. To this end, we introduce CancerEmo, an emotion dataset created from an online health community and annotated with eight fine-grained emotions. We perform a comprehensive analysis of these emotions and develop deep learning models on the newly created dataset. Our best BERT model achieves an average F1 of 71%, which we improve further using domain-specific pre-training.

pdf bib
Scientific Keyphrase Identification and Classification by Pre-Trained Language Models Intermediate Task Transfer Learning
Seoyeon Park | Cornelia Caragea
Proceedings of the 28th International Conference on Computational Linguistics

Scientific keyphrase identification and classification is the task of detecting and classifying keyphrases from scholarly text with their types from a set of predefined classes. This task has a wide range of benefits, but it is still challenging in performance due to the lack of large amounts of labeled data required for training deep neural models. In order to overcome this challenge, we explore pre-trained language models BERT and SciBERT with intermediate task transfer learning, using 42 data-rich related intermediate-target task combinations. We reveal that intermediate task transfer learning on SciBERT induces a better starting point for target task fine-tuning compared with BERT and achieves competitive performance in scientific keyphrase identification and classification compared to both previous works and strong baselines. Interestingly, we observe that BERT with intermediate task transfer learning fails to improve the performance of scientific keyphrase identification and classification potentially due to significant catastrophic forgetting. This result highlights that scientific knowledge achieved during the pre-training of language models on large scientific collections plays an important role in the target tasks. We also observe that sequence tagging related intermediate tasks, especially syntactic structure learning tasks such as POS Tagging, tend to work best for scientific keyphrase identification and classification.

pdf bib
Detecting Perceived Emotions in Hurricane Disasters
Shrey Desai | Cornelia Caragea | Junyi Jessy Li
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Natural disasters (e.g., hurricanes) affect millions of people each year, causing widespread destruction in their wake. People have recently taken to social media websites (e.g., Twitter) to share their sentiments and feelings with the larger community. Consequently, these platforms have become instrumental in understanding and perceiving emotions at scale. In this paper, we introduce HurricaneEmo, an emotion dataset of 15,000 English tweets spanning three hurricanes: Harvey, Irma, and Maria. We present a comprehensive study of fine-grained emotions and propose classification tasks to discriminate between coarse-grained emotion groups. Our best BERT model, even after task-guided pre-training which leverages unlabeled Twitter data, achieves only 68% accuracy (averaged across all groups). HurricaneEmo serves not only as a challenging benchmark for models but also as a valuable resource for analyzing emotions in disaster-centric domains.

pdf bib
Cross-Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup
Jishnu Ray Chowdhury | Cornelia Caragea | Doina Caragea
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Distinguishing informative and actionable messages from a social media platform like Twitter is critical for facilitating disaster management. For this purpose, we compile a multilingual dataset of over 130K samples for multi-label classification of disaster-related tweets. We present a masking-based loss function for partially labelled samples and demonstrate the effectiveness of Manifold Mixup in the text domain. Our main model is based on Multilingual BERT, which we further improve with Manifold Mixup. We show that our model generalizes to unseen disasters in the test set. Furthermore, we analyze the capability of our model for zero-shot generalization to new languages. Our code, dataset, and other resources are available on Github.

pdf bib
Dynamic Classification in Web Archiving Collections
Krutarth Patel | Cornelia Caragea | Mark Phillips
Proceedings of the 12th Language Resources and Evaluation Conference

The Web archived data usually contains high-quality documents that are very useful for creating specialized collections of documents. To create such collections, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the large collections (of millions in size) from Web Archiving institutions. However, the patterns of the documents of interest can differ substantially from one document to another, which makes the automatic classification task very challenging. In this paper, we explore dynamic fusion models to find, on the fly, the model or combination of models that performs best on a variety of document types. Our experimental results show that the approach that fuses different models outperforms individual models and other ensemble methods on three datasets.

2019

pdf bib
The Myth of Double-Blind Review Revisited: ACL vs. EMNLP
Cornelia Caragea | Ana Uban | Liviu P. Dinu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The review and selection process for scientific paper publication is essential for the quality of scholarly publications in a scientific field. The double-blind review system, which enforces author anonymity during the review period, is widely used by prestigious conferences and journals to ensure the integrity of this process. Although the notion of anonymity in the double-blind review has been questioned before, the availability of full text paper collections brings new opportunities for exploring the question: Is the double-blind review process really double-blind? We study this question on the ACL and EMNLP paper collections and present an analysis on how well deep learning techniques can infer the authors of a paper. Specifically, we explore Convolutional Neural Networks trained on various aspects of a paper, e.g., content, style features, and references, to understand the extent to which we can infer the authors of a paper and what aspects contribute the most. Our results show that the authors of a paper can be inferred with accuracy as high as 87% on ACL and 78% on EMNLP for the top 100 most prolific authors.

pdf bib
Multi-Task Stance Detection with Sentiment and Stance Lexicons
Yingjie Li | Cornelia Caragea
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Stance detection aims to detect whether the opinion holder is in support of or against a given target. Recent works show improvements in stance detection by using either the attention mechanism or sentiment information. In this paper, we propose a multi-task framework that incorporates target-specific attention mechanism and at the same time takes sentiment classification as an auxiliary task. Moreover, we used a sentiment lexicon and constructed a stance lexicon to provide guidance for the attention layer. Experimental results show that the proposed model significantly outperforms state-of-the-art deep learning methods on the SemEval-2016 dataset.

2018

pdf bib
Exploring Optimism and Pessimism in Twitter Using Deep Learning
Cornelia Caragea | Liviu P. Dinu | Bogdan Dumitru
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Identifying optimistic and pessimistic viewpoints and users from Twitter is useful for providing better social support to those who need such support, and for minimizing the negative influence among users and maximizing the spread of positive attitudes and ideas. In this paper, we explore a range of deep learning models to predict optimism and pessimism in Twitter at both tweet and user level and show that these models substantially outperform traditional machine learning classifiers used in prior work. In addition, we show evidence that a sentiment classifier would not be sufficient for accurately predicting optimism and pessimism in Twitter. Last, we study the verb tense usage as well as the presence of polarity words in optimistic and pessimistic tweets.

pdf bib
Fine-Grained Emotion Detection in Health-Related Online Posts
Hamed Khanpour | Cornelia Caragea
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Detecting fine-grained emotions in online health communities provides insightful information about patients’ emotional states. However, current computational approaches to emotion detection from health-related posts focus only on identifying messages that contain emotions, with no emphasis on the emotion type, using a set of handcrafted features. In this paper, we take a step further and propose to detect fine-grained emotion types from health-related posts and show how high-level and abstract features derived from deep neural networks combined with lexicon-based features can be employed to detect emotions.

2017

pdf bib
Identifying Empathetic Messages in Online Health Communities
Hamed Khanpour | Cornelia Caragea | Prakhar Biyani
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Empathy captures one’s ability to correlate with and understand others’ emotional states and experiences. Messages with empathetic content are considered as one of the main advantages for joining online health communities due to their potential to improve people’s moods. Unfortunately, to this date, no computational studies exist that automatically identify empathetic messages in online health communities. We propose a combination of Convolutional Neural Networks (CNN) and Long Short Term Memory (LSTM) networks, and show that the proposed model outperforms each individual model (CNN and LSTM) as well as several baselines.

pdf bib
PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents
Corina Florescu | Cornelia Caragea
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The large and growing amounts of online scholarly data present both challenges and opportunities to enhance knowledge discovery. One such challenge is to automatically extract a small set of keyphrases from a document that can accurately describe the document’s content and can facilitate fast information processing. In this paper, we propose PositionRank, an unsupervised model for keyphrase extraction from scholarly documents that incorporates information from all positions of a word’s occurrences into a biased PageRank. Our model obtains remarkable improvements in performance over PageRank models that do not take into account word positions as well as over strong baselines for this task. Specifically, on several datasets of research papers, PositionRank achieves improvements as high as 29.09%.

2016

pdf bib
Supervised Keyphrase Extraction as Positive Unlabeled Learning
Lucas Sterckx | Cornelia Caragea | Thomas Demeester | Chris Develder
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf bib
Co-Training for Topic Classification of Scholarly Data
Cornelia Caragea | Florin Bulgarov | Rada Mihalcea
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction
Sujatha Das Gollapalli | Cornelia Caragea | Xiaoli Li | C. Lee Giles
Proceedings of the ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction

2014

pdf bib
Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach
Cornelia Caragea | Florin Adrian Bulgarov | Andreea Godea | Sujatha Das Gollapalli
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Identifying Emotional and Informational Support in Online Health Communities
Prakhar Biyani | Cornelia Caragea | Prasenjit Mitra | John Yen
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2012

pdf bib
Thread Specific Features are Helpful for Identifying Subjectivity Orientation of Online Forum Threads
Prakhar Biyani | Sumit Bhatia | Cornelia Caragea | Prasenjit Mitra
Proceedings of COLING 2012