Thamar Solorio - ACL Anthology

Thamar Solorio

2026

A Scalable Framework for Automated NER Annotation Correction in Low-Resource Languages
Toqeer Ehsan | Thamar Solorio
Findings of the Association for Computational Linguistics: EACL 2026

Poor quality or noisy annotations in Named Entity Recognition (NER), as in any other NLP task, make it challenging to achieve state-of-the-art performance. In this paper, we present a multi-step framework to enhance the annotation quality of NER datasets by employing automated techniques. We propose a frequency-based iterative approach that leverages self-training and a dual-threshold mechanism to enhance inference confidence. Experimental evaluations on different NER datasets demonstrate significant improvements in NER performance with respect to the original datasets. This work further explores the potential of generative Large Language Models (LLMs) to perform NER for low-resource languages.

Beyond Understanding: Evaluating the Pragmatic Gap in LLMs’ Cultural Processing of Figurative Language
Mena Attia | Aashiq Muhamed | Mai Alkhamissi | Thamar Solorio | Mona T. Diab
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a comprehensive evaluation of large language models’ (LLMs) ability to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and social nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation across Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Results show a consistent hierarchy: accuracy on Arabic proverbs is 4.29% lower than on English proverbs, and performance on Egyptian idioms is 10.28% lower than on Arabic proverbs. On the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing idioms’ contextual sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with full inter-annotator agreement. Figurative language thus serves as an effective diagnostic for cultural reasoning, revealing that while LLMs often interpret figurative meaning, they still face major challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.

2025

Emotion Train at SemEval-2025 Task 11: Comparing Generative and Discriminative Models in Emotion Recognition
Anastasiia Demidova | Injy Hamed | Teresa Lynn | Thamar Solorio
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

The emotion recognition task has become increasingly popular as it has a wide range of applications in many fields, such as mental health, product management, and population mood state monitoring. SemEval 2025 Task 11 Track A framed the emotion recognition problem as a multi-label classification task. This paper presents our proposed system submissions in the following languages: English, Algerian and Moroccan Arabic, Brazilian and Mozambican Portuguese, German, Spanish, Nigerian-Pidgin, Russian, and Swedish. Here, we compare the emotion-detecting abilities of generative and discriminative pre-trained language models, exploring multiple approaches, including curriculum learning, in-context learning, and instruction and few-shot fine-tuning. We also propose an extended architecture method with a feature fusion technique enriched with emotion scores and a self-attention mechanism. We find that BERT-based models fine-tuned on data of a corresponding language achieve the best results across multiple languages for multi-label text-based emotion classification, outperforming both baseline and generative models.

Enhancing Depression Detection via Question-wise Modality Fusion
Aishik Mandal | Dana Atzil-Slonim | Thamar Solorio | Iryna Gurevych
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)

Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs. Current depression diagnosis involves determining the depression severity of a person through self-reported questionnaires or interviews conducted by clinicians. This often leads to delayed treatment and involves substantial human resources. Thus, several works try to automate the process using multimodal data. However, they usually overlook the following: i) The variable contribution of each modality for each question in the questionnaire and ii) Using ordinal classification for the task. This results in sub-optimal fusion and training methods. In this work, we propose a novel Question-wise Modality Fusion (QuestMF) framework trained with a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues. The performance of our framework is comparable to the current state-of-the-art models on the E-DAIC dataset and enhances interpretability by predicting scores for each question. This will help clinicians identify an individual’s symptoms, allowing them to customise their interventions accordingly. We also make the code for the QuestMF framework publicly available.

Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation
Toqeer Ehsan | Thamar Solorio
Proceedings of the Tenth Workshop on Noisy and User-generated Text

Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, in this paper, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

MoMentS: A Comprehensive Multimodal Benchmark for Theory of Mind
Emilio Villa-Cueva | S M Masrur Ahmed | Rendi Chevi | Jan Christian Blaise Cruz | Kareem Elzeky | Fermin Cristobal | Alham Fikri Aji | Skyler Wang | Rada Mihalcea | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2025

Understanding Theory of Mind is essential for building socially intelligent multimodal agents capable of perceiving and interpreting human behavior. We introduce MoMentS (Multimodal Mental States), a comprehensive benchmark designed to assess the ToM capabilities of multimodal large language models (LLMs) through realistic, narrative-rich scenarios presented in short films. MoMentS includes over 2,300 multiple-choice questions spanning seven distinct ToM categories. The benchmark features long video context windows and realistic social interactions that provide deeper insight into characters’ mental states. We evaluate several MLLMs and find that although vision generally improves performance, models still struggle to integrate it effectively. For audio, models that process dialogues as audio do not consistently outperform transcript-based inputs. Our findings highlight the need to improve multimodal integration and point to open challenges that must be addressed to advance AI’s social understanding.

ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding
Israel Abebe Azime | Atnafu Lambebo Tonja | Tadesse Destaw Belay | Yonas Chanie | Bontu Fufa Balcha | Negasi Haile Abadi | Henok Biadglign Ademtew | Mulubrhan Abebe Nerea | Debela Desalegn Yadeta | Derartu Dagne Geremew | Assefa Atsbiha Tesfu | Philipp Slusallek | Thamar Solorio | Dietrich Klakow
Findings of the Association for Computational Linguistics: NAACL 2025

A Survey of Code-switched Arabic NLP: Progress, Challenges, and Future Directions
Injy Hamed | Caroline Sabty | Slim Abdennadher | Ngoc Thang Vu | Thamar Solorio | Nizar Habash
Proceedings of the 31st International Conference on Computational Linguistics

Language in the Arab world presents a complex diglossic and multilingual setting, involving the use of Modern Standard Arabic, various dialects and sub-dialects, as well as multiple European languages. This diverse linguistic landscape has given rise to code-switching, both within Arabic varieties and between Arabic and foreign languages. The widespread occurrence of code-switching across the region makes it vital to address these linguistic needs when developing language technologies. In this paper, we provide a review of the current literature in the field of code-switched Arabic NLP, offering a broad perspective on ongoing efforts, challenges, research gaps, and recommendations for future research directions.

Translating cultural content poses challenges for machine translation systems due to the differences in conceptualizations between cultures, where language alone may fail to convey sufficient context to capture region-specific meanings. In this work, we investigate whether images can act as cultural context in multimodal translation. We introduce CaMMT, a human-curated benchmark of over 5,800 triples of images along with parallel captions in English and regional languages. Using this dataset, we evaluate five Vision Language Models (VLMs) in text-only and text+image settings. Through automatic and human evaluations, we find that visual context generally improves translation quality, especially in handling Culturally-Specific Items (CSIs), disambiguation, and correct gender marking. By releasing CaMMT, our objective is to support broader efforts to build and evaluate multimodal translation systems that are better aligned with cultural nuance and regional variations.

Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching
Genta Indra Winata | Sudipta Kar | Marina Zhukova | Thamar Solorio | Xi Ai | Injy Hamed | Mahardika Krisna Krisna Ihsani | Derry Tanti Wijaya | Garry Kuwanto
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching

2024

Context-aware Adversarial Attack on Named Entity Recognition
Shuguang Chen | Leonardo Neves | Thamar Solorio
Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)

In recent years, large pre-trained language models (PLMs) have achieved remarkable performance on many natural language processing benchmarks. Despite their success, prior studies have shown that PLMs are vulnerable to attacks from adversarial examples. In this work, we focus on the named entity recognition task and study context-aware adversarial attack methods to examine the model’s robustness. Specifically, we propose perturbing the most informative words for recognizing entities to create adversarial examples and investigate different candidate replacement methods to generate natural and plausible adversarial examples. Experiments and analyses show that our methods are more effective in deceiving the model into making wrong predictions than strong baselines.

Positive and Risky Message Assessment for Music Products
Yigeng Zhang | Mahsa Shafaei | Fabio Gonzalez | Thamar Solorio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this work, we introduce a pioneering research challenge: evaluating positive and potentially harmful messages within music products. We initiate by setting a multi-faceted, multi-task benchmark for music content assessment. Subsequently, we introduce an efficient multi-task predictive model fortified with ordinality-enforcement to address this challenge. Our findings reveal that the proposed method not only significantly outperforms robust task-specific alternatives but also possesses the capability to assess multiple aspects simultaneously. Furthermore, through detailed case studies, where we employed Large Language Models (LLMs) as surrogates for content assessment, we provide valuable insights to inform and guide future research on this topic. The code for dataset creation and model implementation is publicly available at https://github.com/RiTUAL-UH/music-message-assessment.

OATS: A Challenge Dataset for Opinion Aspect Target Sentiment Joint Detection for Aspect-Based Sentiment Analysis
Siva Uday Sampreeth Chebolu | Franck Dernoncourt | Nedim Lipka | Thamar Solorio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Aspect-based sentiment analysis (ABSA) delves into understanding sentiments specific to distinct elements within a user-generated review. It aims to analyze user-generated reviews to determine a) the target entity being reviewed, b) the high-level aspect to which it belongs, c) the sentiment words used to express the opinion, and d) the sentiment expressed toward the targets and the aspects. While various benchmark datasets have fostered advancements in ABSA, they often come with domain limitations and data granularity challenges. Addressing these, we introduce the OATS dataset, which encompasses three fresh domains and consists of 27,470 sentence-level quadruples and 17,092 review-level tuples. Our initiative seeks to bridge specific observed gaps in existing datasets: the recurrent focus on familiar domains like restaurants and laptops, limited data for intricate quadruple extraction tasks, and an occasional oversight of the synergy between sentence and review-level sentiments. Moreover, to elucidate OATS’s potential and shed light on various ABSA subtasks that OATS can solve, we conducted experiments, establishing initial baselines. We hope the OATS dataset augments current resources, paving the way for an encompassing exploration of ABSA (https://github.com/RiTUAL-UH/OATS-ABSA).

Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.

Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei | Mahsa Shafaei | Yigeng Zhang | Hugo Jair Escalante | Thamar Solorio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We address the challenge of detecting questionable content in online media, specifically the subcategory of comic mischief. This type of content combines elements such as violence, adult content, or sarcasm with humor, making it difficult to detect. Employing a multimodal approach is vital to capture the subtle details inherent in comic mischief content. To tackle this problem, we propose a novel end-to-end multimodal system for the task of comic mischief detection. As part of this contribution, we release a novel dataset for the targeted task consisting of three modalities: video, text (video captions and subtitles), and audio. We also design a HIerarchical Cross-attention model with CAPtions (HICCAP) to capture the intricate relationships among these modalities. The results show that the proposed approach makes a significant improvement over robust baselines and state-of-the-art models for comic mischief detection and its type classification. This emphasizes the potential of our system to empower users, to make informed decisions about the online content they choose to see.

Question-Instructed Visual Descriptions for Zero-Shot Video Answering
David Mogrovejo | Thamar Solorio
Findings of the Association for Computational Linguistics: ACL 2024

We present Q-ViD, a simple approach for video question answering (video QA), that unlike prior methods, which are based on complex architectures, computationally expensive pipelines or use closed models like GPTs, Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions. Specifically, we create captioning instruction prompts that rely on the target questions about the videos and leverage InstructBLIP to obtain video frame captions that are useful to the task at hand. Subsequently, we form descriptions of the whole video using the question-dependent frame captions, and feed that information, along with a question-answering prompt, to a large language model (LLM). The LLM is our reasoning module, and performs the final step of multiple-choice QA. Our simple Q-ViD framework achieves competitive or even higher performances than current state of the art models on a diverse range of videoQA benchmarks, including NExT-QA, STAR, How2QA, TVQA and IntentQA.

We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.

Adaptive Cross-lingual Text Classification through In-Context One-Shot Demonstrations
Emilio Cueva | Adrian Lopez Monroy | Fernando Sánchez-Vega | Thamar Solorio
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Zero-Shot Cross-lingual Transfer (ZS-XLT) utilizes a model trained in a source language to make predictions in another language, often with a performance loss. To alleviate this, additional improvements can be achieved through subsequent adaptation using examples in the target language. In this paper, we exploit In-Context Tuning (ICT) for One-Shot Cross-lingual transfer in the classification task by introducing In-Context Cross-lingual Transfer (IC-XLT). The novel concept involves training a model to learn from context examples and subsequently adapting it during inference to a target language by prepending a One-Shot context demonstration in that language. Our results show that IC-XLT successfully leverages target-language examples to improve the cross-lingual capabilities of the evaluated mT5 model, outperforming prompt-based models in the Zero and Few-shot scenarios adapted through fine-tuning. Moreover, we show that when source-language data is limited, the fine-tuning framework employed for IC-XLT performs comparably to prompt-based fine-tuning with significantly more training data in the source language.

The Zeno’s Paradox of ‘Low-Resource’ Languages
Hellina Hailu Nigatu | Atnafu Lambebo Tonja | Benjamin Rosman | Thamar Solorio | Monojit Choudhury
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced. However, there is limited consensus on what exactly qualifies as a ‘low-resource language.’ To understand how NLP papers define and study ‘low resource’ languages, we qualitatively analyzed 150 papers from the ACL Anthology and popular speech-processing conferences that mention the keyword ‘low-resource.’ Based on our analysis, we show how several interacting axes contribute to ‘low-resourcedness’ of a language and why that makes it difficult to track progress for each individual language. We hope our work (1) elicits explicit definitions of the terminology when it is used in papers and (2) provides grounding for the different axes to consider when connoting a language as low-resource.

NLP Progress in Indigenous Latin American Languages
Atnafu Tonja | Fazlourrahman Balouchzahi | Sabur Butt | Olga Kolesnikova | Hector Ceballos | Alexander Gelbukh | Thamar Solorio
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancements that respect indigenous community perspectives. We show the NLP progress of indigenous Latin American languages and the survey that covers the status of indigenous languages in Latin America, their representation in NLP, and the challenges and innovations required for their preservation and development. The paper contributes to the current literature in understanding the need and progress of NLP for indigenous communities of Latin America, specifically low-resource and indigenous communities in general.

Interpreting Themes from Educational Stories
Yigeng Zhang | Fabio Gonzalez | Thamar Solorio
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Reading comprehension continues to be a crucial research focus in the NLP community. Recent advances in Machine Reading Comprehension (MRC) have mostly centered on literal comprehension, referring to the surface-level understanding of content. In this work, we focus on the next level - interpretive comprehension, with a particular emphasis on inferring the themes of a narrative text. We introduce the first dataset specifically designed for interpretive comprehension of educational narratives, providing corresponding well-edited theme texts. The dataset spans a variety of genres and cultural origins and includes human-annotated theme keywords with varying levels of granularity. We further formulate NLP tasks under different abstractions of interpretive comprehension toward the main idea of a story. After conducting extensive experiments with state-of-the-art methods, we found the task to be both challenging and significant for NLP research. The dataset and source code have been made publicly available to the research community at https://github.com/RiTUAL-UH/EduStory.

2023

SafeWebUH at SemEval-2023 Task 11: Learning Annotator Disagreement in Derogatory Text: Comparison of Direct Training vs Aggregation
Sadat Shahriar | Thamar Solorio
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Subjectivity and difference of opinion are key social phenomena, and it is crucial to take these into account in the annotation and detection process of derogatory textual content. In this paper, we use four datasets provided by SemEval-2023 Task 11 and fine-tune a BERT model to capture the disagreement in the annotation. We find individual annotator modeling and aggregation lowers the Cross-Entropy score by an average of 0.21, compared to the direct training on the soft labels. Our findings further demonstrate that annotator metadata contributes to the average 0.029 reduction in the Cross-Entropy score.

Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching
Genta Winata | Sudipta Kar | Marina Zhukova | Thamar Solorio | Mona Diab | Sunayana Sitaram | Monojit Choudhury | Kalika Bali
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching

A Review of Datasets for Aspect-based Sentiment Analysis
Siva Uday Sampreeth Chebolu | Franck Dernoncourt | Nedim Lipka | Thamar Solorio
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges
Genta Winata | Alham Fikri Aji | Zheng Xin Yong | Thamar Solorio
Findings of the Association for Computational Linguistics: ACL 2023

Code-Switching, a common phenomenon in written text and conversation, has been studied over decades by the natural language processing (NLP) research community. Initially, code-switching is intensively explored by leveraging linguistic theories and, currently, more machine-learning oriented approaches to develop models. We introduce a comprehensive systematic survey on code-switching research in natural language processing to understand the progress of the past decades and conceptualize the challenges and tasks on the code-switching topic. Finally, we summarize the trends and findings and conclude with a discussion for future direction and open questions for further investigation.

Distillation of encoder-decoder transformers for sequence labelling
Marco Farina | Duccio Pappadopulo | Anant Gupta | Leslie Huang | Ozan Irsoy | Thamar Solorio
Findings of the Association for Computational Linguistics: EACL 2023

Driven by encouraging results on a wide range of tasks, the field of NLP is experiencing an accelerated race to develop bigger language models. This race for bigger models has also underscored the need to continue the pursuit of practical distillation approaches that can leverage the knowledge acquired by these big models in a compute-efficient manner. Having this goal in mind, we build on recent work to propose a hallucination-free framework for sequence tagging that is especially suited for distillation. We show empirical results of new state-of-the-art performance across multiple sequence labelling datasets and validate the usefulness of this framework for distilling a large model in a few-shot learning scenario.

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The recent proliferation of Large Language Models (LLMs) compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its per-formance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

2022

Style Transfer as Data Augmentation: A Case Study on Named Entity Recognition
Shuguang Chen | Leonardo Neves | Thamar Solorio
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

In this work, we take the named entity recognition task in the English language as a case study and explore style transfer as a data augmentation method to increase the size and diversity of training data in low-resource scenarios. We propose a new method to effectively transform the text from a high-resource domain to a low-resource domain by changing its style-related attributes to generate synthetic data for training. Moreover, we design a constrained decoding algorithm along with a set of key ingredients for data selection to guarantee the generation of valid and coherent data. Experiments and analysis on five different domain pairs under different data regimes demonstrate that our approach can significantly improve results compared to current state-of-the-art data augmentation methods. Our approach is a practical solution to data scarcity, and we expect it to be applicable to other NLP tasks.

Cross-lingual Few-Shot Learning on Unseen Languages
Genta Winata | Shijie Wu | Mayank Kulkarni | Thamar Solorio | Daniel Preotiuc-Pietro
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Large pre-trained language models (LMs) have demonstrated the ability to obtain good performance on downstream tasks with limited examples in cross-lingual settings. However, this was mostly studied for relatively resource-rich languages, where at least enough unlabeled data is available to be included in pre-training a multilingual language model. In this paper, we explore the problem of cross-lingual transfer in unseen languages, where no unlabeled data is available for pre-training a model. We use a downstream sentiment analysis task across 12 languages, including 8 unseen languages, to analyze the effectiveness of several few-shot learning strategies across the three major types of model architectures and their learning dynamics. We also compare strategies for selecting languages for transfer and contrast findings across languages seen in pre-training compared to those that are not. Our findings contribute to the body of knowledge on cross-lingual models for low-resource settings that is paramount to increasing coverage, diversity, and equity in access to NLP technology. We show that, in few-shot learning, linguistically similar and geographically similar languages are useful for cross-lingual adaptation, but taking the context from a mixture of random source languages is surprisingly more effective. We also compare different model architectures and show that the encoder-only model, XLM-R, gives the best downstream task performance.

2021

From None to Severe: Predicting Severity in Movie Scripts
Yigeng Zhang | Mahsa Shafaei | Fabio Gonzalez | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2021

In this paper, we introduce the task of predicting severity of age-restricted aspects of movie content based solely on the dialogue script. We first investigate categorizing the ordinal severity of movies on 5 aspects: Sex, Violence, Profanity, Substance consumption, and Frightening scenes. The problem is handled using a siamese network-based multitask framework which concurrently improves the interpretability of the predictions. The experimental results show that our method outperforms the previous state-of-the-art model and provides useful information to interpret model predictions. The proposed dataset and source code are publicly available at our GitHub repository.

Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Thamar Solorio | Shuguang Chen | Alan W. Black | Mona Diab | Sunayana Sitaram | Victor Soto | Emre Yilmaz | Anirudh Srinivasan
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Char2Subword: Extending the Subword Embedding Space Using Robust Character Compositionality
Gustavo Aguilar | Bryan McCann | Tong Niu | Nazneen Rajani | Nitish Shirish Keskar | Thamar Solorio
Findings of the Association for Computational Linguistics: EMNLP 2021

Byte-pair encoding (BPE) is a ubiquitous algorithm in the subword tokenization process of language models as it provides multiple benefits. However, this process is solely based on pre-training data statistics, making it hard for the tokenizer to handle infrequent spellings. On the other hand, though robust to misspellings, pure character-level models often lead to unreasonably long sequences and make it harder for the model to learn meaningful words. To alleviate these challenges, we propose a character-based subword module (char2subword) that learns the subword embedding table in pre-trained models like BERT. Our char2subword module builds representations from characters out of the subword vocabulary, and it can be used as a drop-in replacement of the subword embedding table. The module is robust to character-level alterations such as misspellings, word inflection, casing, and punctuation. We integrate it further with BERT through pre-training while keeping BERT transformer parameters fixed–and thus, providing a practical method. Finally, we show that incorporating our module to mBERT significantly improves the performance on the social media linguistic code-switching evaluation (LinCE) benchmark.

Mitigating Temporal-Drift: A Simple Approach to Keep NER Models Crisp
Shuguang Chen | Leonardo Neves | Thamar Solorio
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Performance of neural models for named entity recognition degrades over time, becoming stale. This degradation is due to temporal drift, the change in our target variables’ statistical properties over time. This issue is especially problematic for social media data, where topics change rapidly. In order to mitigate the problem, data annotation and retraining of models is common. Despite its usefulness, this process is expensive and time-consuming, which motivates new research on efficient model updating. In this paper, we propose an intuitive approach to measure the potential trendiness of tweets and use this metric to select the most informative instances to use for training. We conduct experiments on three state-of-the-art models on the Temporal Twitter Dataset. Our approach shows larger increases in prediction accuracy with less training data than the alternatives, making it an attractive, practical solution.

PSED: A Dataset for Selecting Emphasis in Presentation Slides
Amirreza Shirani | Giai Tran | Hieu Trinh | Franck Dernoncourt | Nedim Lipka | Jose Echevarria | Thamar Solorio | Paul Asente
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Can images help recognize entities? A study of the role of images for Multimodal NER
Shuguang Chen | Gustavo Aguilar | Leonardo Neves | Thamar Solorio
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context. While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model’s ability to leverage multimodal interactions remains poorly understood. In this work, we conduct in-depth analyses of existing multimodal fusion techniques from different perspectives and describe the scenarios where adding information from the image does not always boost performance. We also study the use of captions as a way to enrich the context for MNER. Experiments on three datasets from popular social platforms expose the bottleneck of existing multimodal models and the situations where using captions is beneficial.

Exploring Conditional Text Generation for Aspect-Based Sentiment Analysis
Siva Uday Sampreeth Chebolu | Franck Dernoncourt | Nedim Lipka | Thamar Solorio
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

A Case Study of Deep Learning-Based Multi-Modal Methods for Labeling the Presence of Questionable Content in Movie Trailers
Mahsa Shafaei | Christos Smailis | Ioannis Kakadiaris | Thamar Solorio
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this work, we explore different approaches to combine modalities for the problem of automated age-suitability rating of movie trailers. First, we introduce a new dataset containing videos of movie trailers in English downloaded from IMDB and YouTube, along with their corresponding age-suitability rating labels. Secondly, we propose a multi-modal deep learning pipeline addressing the movie trailer age suitability rating problem. This is the first attempt to combine video, audio, and speech information for this problem, and our experimental results show that multi-modal approaches significantly outperform the best mono and bimodal models in this task.

Normalization and Back-Transliteration for Code-Switched Data
Dwija Parikh | Thamar Solorio
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for NLP systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Roman script which prevents from utilizing monolingual resources available in the native Devanagari script. In this paper, we propose a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. We also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research.

Data Augmentation for Cross-Domain Named Entity Recognition
Shuguang Chen | Gustavo Aguilar | Leonardo Neves | Thamar Solorio
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Current work in named entity recognition (NER) shows that data augmentation techniques can produce more robust models. However, most existing techniques focus on augmenting in-domain data in low-resource scenarios where annotated data is quite limited. In this work, we take this research direction to the opposite and study cross-domain data augmentation for the NER task. We investigate the possibility of leveraging data from high-resource domains by projecting it into the low-resource domains. Specifically, we propose a novel neural architecture to transform the data representation from a high-resource to a low-resource domain by learning the patterns (e.g. style, noise, abbreviations, etc.) in the text that differentiate them and a shared feature space where both domains are aligned. We experiment with diverse datasets and show that transforming the data to the low-resource domain representation achieves significant improvements over only using data from high-resource domains.

2020

From English to Code-Switching: Transfer Learning with Strong Morphological Clues
Gustavo Aguilar | Thamar Solorio
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Linguistic Code-switching (CS) is still an understudied phenomenon in natural language processing. The NLP community has mostly focused on monolingual and multi-lingual scenarios, but little attention has been given to CS in particular. This is partly because of the lack of resources and annotated data, despite its increasing occurrence in social media platforms. In this paper, we aim at adapting monolingual models to code-switched text in various tasks. Specifically, we transfer English knowledge from a pre-trained ELMo model to different code-switched language pairs (i.e., Nepali-English, Spanish-English, and Hindi-English) using the task of language identification. Our method, CS-ELMo, is an extension of ELMo with a simple yet effective position-aware attention mechanism inside its character convolutions. We show the effectiveness of this transfer learning step by outperforming multilingual BERT and homologous CS-unaware ELMo models and establishing a new state of the art in CS tasks, such as NER and POS tagging. Our technique can be expanded to more English-paired code-switched languages, providing more resources to the CS community.

Proceedings of the 4th Workshop on Computational Approaches to Code Switching
Thamar Solorio | Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Amitava Das | Mona Diab
Proceedings of the 4th Workshop on Computational Approaches to Code Switching

Age Suitability Rating: Predicting the MPAA Rating Based on Movie Dialogues
Mahsa Shafaei | Niloofar Safi Samghabadi | Sudipta Kar | Thamar Solorio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Movies help us learn and inspire societal change. But they can also contain objectionable content that negatively affects viewers’ behaviour, especially children. In this paper, our goal is to predict the suitability of movie content for children and young adults based on scripts. The criterion that we use to measure suitability is the MPAA rating that is specifically designed for this purpose. We create a corpus for movie MPAA ratings and propose an RNN based architecture with attention that jointly models the genre and the emotions in the script to predict the MPAA rating. We achieve 81% weighted F1-score for the classification model that outperforms the traditional machine learning method by 7%.

Let Me Choose: From Verbal Context to Font Selection
Amirreza Shirani | Franck Dernoncourt | Jose Echevarria | Paul Asente | Nedim Lipka | Thamar Solorio
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we aim to learn associations between visual attributes of fonts and the verbal context of the texts they are typically applied to. Compared to related work leveraging the surrounding visual context, we choose to focus only on the input text, which can enable new applications for which the text is the only visual element in the document. We introduce a new dataset, containing examples of different topics in social media posts and ads, labeled through crowd-sourcing. Due to the subjective nature of the task, multiple fonts might be perceived as acceptable for an input text, which makes this problem challenging. To this end, we investigate different end-to-end models to learn label distributions on crowd-sourced data, to capture inter-subjectivity across all annotations.

Detecting Early Signs of Cyberbullying in Social Media
Niloofar Safi Samghabadi | Adrián Pastor López Monroy | Thamar Solorio
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

Nowadays, the amount of users’ activities on online social media is growing dramatically. These online environments provide excellent opportunities for communication and knowledge sharing. However, some people misuse them to harass and bully others online, a phenomenon called cyberbullying. Due to its harmful effects on people, especially youth, it is imperative to detect cyberbullying as early as possible before it causes irreparable damages to victims. Most of the relevant available resources are not explicitly designed to detect cyberbullying, but related content, such as hate speech and abusive language. In this paper, we propose a new approach to create a corpus suited for cyberbullying detection. We also investigate the possibility of designing a framework to monitor the streams of users’ online messages and detects the signs of cyberbullying as early as possible.

Aggression and Misogyny Detection using BERT: A Multi-Task Approach
Niloofar Safi Samghabadi | Parth Patwa | Srinivas PYKL | Prerana Mukherjee | Amitava Das | Thamar Solorio
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In recent times, the focus of the NLP community has increased towards offensive language, aggression, and hate-speech detection. This paper presents our system for TRAC-2 shared task on “Aggression Identification” (sub-task A) and “Misogynistic Aggression Identification” (sub-task B). The data for this shared task is provided in three different languages - English, Hindi, and Bengali. Each data instance is annotated into one of the three aggression classes - Not Aggressive, Covertly Aggressive, Overtly Aggressive, as well as one of the two misogyny classes - Gendered and Non-Gendered. We propose an end-to-end neural model using attention on top of BERT that incorporates a multi-task learning paradigm to address both the sub-tasks simultaneously. Our team, “na14”, scored 0.8579 weighted F1-measure on the English sub-task B and secured 3rd rank out of 15 teams for the task. The code and the model weights are publicly available at https://github.com/NiloofarSafi/TRAC-2. Keywords: Aggression, Misogyny, Abusive Language, Hate-Speech Detection, BERT, NLP, Neural Networks, Social Media

SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets
Parth Patwa | Gustavo Aguilar | Sudipta Kar | Suraj Pandey | Srinivas PYKL | Björn Gambäck | Tanmoy Chakraborty | Thamar Solorio | Amitava Das
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present the results of the SemEval-2020 Task 9 on Sentiment Analysis of Code-Mixed Tweets (SentiMix 2020). We also release and describe our Hinglish (Hindi-English)and Spanglish (Spanish-English) corpora annotated with word-level language identification and sentence-level sentiment labels. These corpora are comprised of 20K and 19K examples, respectively. The sentiment labels are - Positive, Negative, and Neutral. SentiMix attracted 89 submissions in total including 61 teams that participated in the Hinglish contest and 28 submitted systems to the Spanglish competition. The best performance achieved was 75.0% F1 score for Hinglish and 80.6% F1 for Spanglish. We observe that BERT-like models and ensemble methods are the most common and successful approaches among the participants.

Attending the Emotions to Detect Online Abusive Language
Niloofar Safi Samghabadi | Afsheen Hatami | Mahsa Shafaei | Sudipta Kar | Thamar Solorio
Proceedings of the Fourth Workshop on Online Abuse and Harms

In recent years, abusive behavior has become a serious issue in online social networks. In this paper, we present a new corpus for the task of abusive language detection that is collected from a semi-anonymous online platform, and unlike the majority of other available resources, is not created based on a specific list of bad words. We also develop computational models to incorporate emotions into textual cues to improve aggression identification. We evaluate our proposed methods on a set of corpora related to the task and show promising results with respect to abusive language detection.

SemEval-2020 Task 10: Emphasis Selection for Written Text in Visual Media
Amirreza Shirani | Franck Dernoncourt | Nedim Lipka | Paul Asente | Jose Echevarria | Thamar Solorio
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present the main findings and compare the results of SemEval-2020 Task 10, Emphasis Selection for Written Text in Visual Media. The goal of this shared task is to design automatic methods for emphasis selection, i.e. choosing candidates for emphasis in textual content to enable automated design assistance in authoring. The main focus is on short text instances for social media, with a variety of examples, from social media posts to inspirational quotes. Participants were asked to model emphasis using plain text with no additional context from the user or other design considerations. SemEval-2020 Emphasis Selection shared task attracted 197 participants in the early phase and a total of 31 teams made submissions to this task. The highest-ranked submission achieved 0.823 Matchm score. The analysis of systems submitted to the task indicates that BERT and RoBERTa were the most common choice of pre-trained models used, and part of speech tag (POS) was the most useful feature. Full results can be found on the task’s website.

LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation
Gustavo Aguilar | Sudipta Kar | Thamar Solorio
Proceedings of the Twelfth Language Resources and Evaluation Conference

Recent trends in NLP research have raised an interest in linguistic code-switching (CS); modern approaches have been proposed to solve a wide range of NLP tasks on multiple language pairs. Unfortunately, these proposed methods are hardly generalizable to different code-switched languages. In addition, it is unclear whether a model architecture is applicable for a different task while still being compatible with the code-switching setting. This is mainly because of the lack of a centralized benchmark and the sparse corpora that researchers employ based on their specific needs and interests. To facilitate research in this direction, we propose a centralized benchmark for Linguistic Code-switching Evaluation (LinCE) that combines eleven corpora covering four different code-switched language pairs (i.e., Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks (i.e., language identification, named entity recognition, part-of-speech tagging, and sentiment analysis). As part of the benchmark centralization effort, we provide an online platform where researchers can submit their results while comparing with others in real-time. In addition, we provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems. LinCE is a continuous effort, and we will expand it with more low-resource languages and tasks.

Multi-view Story Characterization from Movie Plot Synopses and Reviews
Sudipta Kar | Gustavo Aguilar | Mirella Lapata | Thamar Solorio
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

This paper considers the problem of characterizing stories by inferring properties such as theme and style using written synopses and reviews of movies. We experiment with a multi-label dataset of movie synopses and a tagset representing various attributes of stories (e.g., genre, type of events). Our proposed multi-view model encodes the synopses and reviews using hierarchical attention and shows improvement over methods that only use synopses. Finally, we demonstrate how we can take advantage of such a model to extract a complementary set of story-attributes from reviews without direct supervision. We have made our dataset and source code publicly available at https://ritual.uh.edu/multiview-tag-2020.

2019

Jointly Learning Author and Annotated Character N-gram Embeddings: A Case Study in Literary Text
Suraj Maharjan | Deepthi Mave | Prasha Shrestha | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

An author’s way of presenting a story through his/her writing style has a great impact on whether the story will be liked by readers or not. In this paper, we learn representations for authors of literary texts together with representations for character n-grams annotated with their functional roles. We train a neural character n-gram based language model using an external corpus of literary texts and transfer learned representations for use in downstream tasks. We show that augmenting the knowledge from external works of authors produces results competitive with other style-based methods for book likability prediction, genre classification, and authorship attribution.

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Jill Burstein | Christy Doran | Thamar Solorio
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Learning Emphasis Selection for Written Text in Visual Media from Crowd-Sourced Label Distributions
Amirreza Shirani | Franck Dernoncourt | Paul Asente | Nedim Lipka | Seokhwan Kim | Jose Echevarria | Thamar Solorio
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In visual communication, text emphasis is used to increase the comprehension of written text to convey the author’s intent. We study the problem of emphasis selection, i.e. choosing candidates for emphasis in short written text, to enable automated design assistance in authoring. Without knowing the author’s intent and only considering the input text, multiple emphasis selections are valid. We propose a model that employs end-to-end label distribution learning (LDL) on crowd-sourced data and predicts a selection distribution, capturing the inter-subjectivity (common-sense) in the audience as well as the ambiguity of the input. We compare the model with several baselines in which the problem is transformed to single-label learning by mapping label distributions to absolute labels via majority voting.

2018

Early Text Classification Using Multi-Resolution Concept Representations
Adrian Pastor López-Monroy | Fabio A. González | Manuel Montes | Hugo Jair Escalante | Thamar Solorio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The intensive use of e-communications in everyday life has given rise to new threats and risks. When the vulnerable asset is the user, detecting these potential attacks before they cause serious damages is extremely important. This paper proposes a novel document representation to improve the early detection of risks in social media sources. The goal is to effectively identify the potential risk using as few text as possible and with as much anticipation as possible. Accordingly, we devise a Multi-Resolution Representation (MulR), which allows us to generate multiple “views” of the analyzed text. These views capture different semantic meanings for words and documents at different levels of detail, which is very useful in early scenarios to model the variable amounts of evidence. Intuitively, the representation captures better the content of short documents (very early stages) in low resolutions, whereas large documents (medium/large stages) are better modeled with higher resolutions. We evaluate the proposed ideas in two different tasks where anticipation is critical: sexual predator detection and depression detection. The experimental evaluation for these early tasks revealed that the proposed approach outperforms previous methodologies by a considerable margin.

Language Identification and Analysis of Code-Switched Social Media Text
Deepthi Mave | Suraj Maharjan | Thamar Solorio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

In this paper, we detail our work on comparing different word-level language identification systems for code-switched Hindi-English data and a standard Spanish-English dataset. In this regard, we build a new code-switched dataset for Hindi-English. To understand the code-switching patterns in these language pairs, we investigate different code-switching metrics. We find that the CRF model outperforms the neural network based models by a margin of 2-5 percentage points for Spanish-English and 3-5 percentage points for Hindi-English.

Proceedings of the Second Workshop on Stylistic Variation
Julian Brooke | Lucie Flekova | Moshe Koppel | Thamar Solorio
Proceedings of the Second Workshop on Stylistic Variation

Proceedings of ACL 2018, System Demonstrations
Fei Liu | Thamar Solorio
Proceedings of ACL 2018, System Demonstrations

Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media
Gustavo Aguilar | Adrian Pastor López-Monroy | Fabio González | Thamar Solorio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recognizing named entities in a document is a key task in many NLP applications. Although current state-of-the-art approaches to this task reach a high performance on clean text (e.g. newswire genres), those algorithms dramatically degrade when they are moved to noisy environments such as social media domains. We present two systems that address the challenges of processing social media data using character-level phonetics and phonology, word embeddings, and Part-of-Speech tags as features. The first model is a multitask end-to-end Bidirectional Long Short-Term Memory (BLSTM)-Conditional Random Field (CRF) network whose output layer contains two CRF classifiers. The second model uses a multitask BLSTM network as feature extractor that transfers the learning to a CRF classifier for the final prediction. Our systems outperform the current F1 scores of the state of the art on the Workshop on Noisy User-generated Text 2017 dataset by 2.45% and 3.69%, establishing a more suitable approach for social media environments.

MPST: A Corpus of Movie Plot Synopses with Tags
Sudipta Kar | Suraj Maharjan | A. Pastor López-Monroy | Thamar Solorio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Folksonomication: Predicting Tags for Movies from Plot Synopses using Emotion Flow Encoded Neural Network
Sudipta Kar | Suraj Maharjan | Thamar Solorio
Proceedings of the 27th International Conference on Computational Linguistics

Folksonomy of movies covers a wide range of heterogeneous information about movies, like the genre, plot structure, visual experiences, soundtracks, metadata, and emotional experiences from watching a movie. Being able to automatically generate or predict tags for movies can help recommendation engines improve retrieval of similar movies, and help viewers know what to expect from a movie in advance. In this work, we explore the problem of creating tags for movies from plot synopses. We propose a novel neural network model that merges information from synopses and emotion flows throughout the plots to predict a set of tags for movies. We compare our system with multiple baselines and found that the addition of emotion flows boosts the performance of the network by learning ≈18% more tags than a traditional machine learning system.

Letting Emotions Flow: Success Prediction by Modeling the Flow of Emotions in Books
Suraj Maharjan | Sudipta Kar | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Books have the power to make us feel happiness, sadness, pain, surprise, or sorrow. An author’s dexterity in the use of these emotions captivates readers and makes it difficult for them to put the book down. In this paper, we model the flow of emotions over a book using recurrent neural networks and quantify its usefulness in predicting success in books. We obtained the best weighted F1-score of 69% for predicting books’ success in a multitask setting (simultaneously predicting success and genre of books).

Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Thamar Solorio | Mona Diab | Julia Hirschberg
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

A Genre-Aware Attention Model to Improve the Likability Prediction of Books
Suraj Maharjan | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Likability prediction of books has many uses. Readers, writers, as well as the publishing industry, can all benefit from automatic book likability prediction systems. In order to make reliable decisions, these systems need to assimilate information from different aspects of a book in a sensible way. We propose a novel multimodal neural architecture that incorporates genre supervision to assign weights to individual feature types. Our proposed method is capable of dynamically tailoring weights given to feature types based on the characteristics of each book. Our architecture achieves competitive results and even outperforms state-of-the-art for this task.

RiTUAL-UH at TRAC 2018 Shared Task: Aggression Identification
Niloofar Safi Samghabadi | Deepthi Mave | Sudipta Kar | Thamar Solorio
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

This paper presents our system for “TRAC 2018 Shared Task on Aggression Identification”. Our best systems for the English dataset use a combination of lexical and semantic features. However, for Hindi data using only lexical features gave us the best results. We obtained weighted F1-measures of 0.5921 for the English Facebook task (ranked 12th), 0.5663 for the English Social Media task (ranked 6th), 0.6292 for the Hindi Facebook task (ranked 1st), and 0.4853 for the Hindi Social Media task (ranked 2nd).

Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Mona Diab | Julia Hirschberg | Thamar Solorio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

In the third shared task of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focus on Named Entity Recognition (NER) on code-switched social-media data. We divide the shared task into two competitions based on the English-Spanish (ENG-SPA) and Modern Standard Arabic-Egyptian (MSA-EGY) language pairs. We use Twitter data and 9 entity types to establish a new dataset for code-switched NER benchmarks. In addition to the CS phenomenon, the diversity of the entities and the social media challenges make the task considerably hard to process. As a result, the best scores of the competitions are 63.76% and 71.61% for ENG-SPA and MSA-EGY, respectively. We present the scores of 9 participants and discuss the most common challenges among submissions.

2017

A Multi-task Approach for Named Entity Recognition in Social Media Data
Gustavo Aguilar | Suraj Maharjan | Adrian Pastor López-Monroy | Thamar Solorio
Proceedings of the 3rd Workshop on Noisy User-generated Text

Named Entity Recognition for social media data is challenging because of its inherent noisiness. In addition to improper grammatical structures, it contains spelling inconsistencies and numerous informal abbreviations. We propose a novel multi-task approach by employing a more general secondary task of Named Entity (NE) segmentation together with the primary task of fine-grained NE categorization. The multi-task neural network architecture learns higher order feature representations from word and character sequences along with basic Part-of-Speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. We were able to obtain the first position in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.

Proceedings of the Workshop on Stylistic Variation
Julian Brooke | Thamar Solorio | Moshe Koppel
Proceedings of the Workshop on Stylistic Variation

Convolutional Neural Networks for Authorship Attribution of Short Texts
Prasha Shrestha | Sebastian Sierra | Fabio González | Manuel Montes | Paolo Rosso | Thamar Solorio
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a model to perform authorship attribution of tweets using Convolutional Neural Networks (CNNs) over character n-grams. We also present a strategy that improves model interpretability by estimating the importance of input text fragments in the predicted classification. The experimental evaluation shows that text CNNs perform competitively and are able to outperform previous methods.

RiTUAL-UH at SemEval-2017 Task 5: Sentiment Analysis on Financial Data Using Neural Networks
Sudipta Kar | Suraj Maharjan | Thamar Solorio
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper, we present our systems for the “SemEval-2017 Task-5 on Fine-Grained Sentiment Analysis on Financial Microblogs and News”. In our system, we combined hand-engineered lexical, sentiment and metadata features, the representations learned from Convolutional Neural Networks (CNN) and Bidirectional Gated Recurrent Unit (Bi-GRU) with Attention model applied on top. With this architecture we obtained weighted cosine similarity scores of 0.72 and 0.74 for subtask-1 and subtask-2, respectively. Using the official scoring system, our system ranked the second place for subtask-2 and eighth place for the subtask-1. It ranked first for both of the subtasks by the scores achieved by an alternate scoring system.

Detecting Nastiness in Social Media
Niloofar Safi Samghabadi | Suraj Maharjan | Alan Sprague | Raquel Diaz-Sprague | Thamar Solorio
Proceedings of the First Workshop on Abusive Language Online

Although social media has made it easy for people to connect on a virtually unlimited basis, it has also opened doors to people who misuse it to undermine, harass, humiliate, threaten and bully others. There is a lack of adequate resources to detect and hinder its occurrence. In this paper, we present our initial NLP approach to detect invective posts as a first step to eventually detect and deter cyberbullying. We crawl data containing profanities and then determine whether or not it contains invective. Annotations on this data are improved iteratively by in-lab annotations and crowdsourcing. We pursue different NLP approaches containing various typical and some newer techniques to distinguish the use of swear words in a neutral way from those instances in which they are used in an insulting way. We also show that this model not only works for our data set, but also can be successfully applied to different data sets.

A Multi-task Approach to Predict Likability of Books
Suraj Maharjan | John Arevalo | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We investigate the value of feature engineering and neural network models for predicting successful writing. Similar to previous work, we treat this as a binary classification task and explore new strategies to automatically learn representations from book contents. We evaluate our feature set on two different corpora created from Project Gutenberg books. The first presents a novel approach for generating the gold standard labels for the task and the other is based on prior research. Using a combination of hand-crafted and recurrent neural network learned representations in a dual learning setting, we obtain the best performance of 73.50% weighted F1-score.

2016

Domain Adaptation for Authorship Attribution: Improved Structural Correspondence Learning
Upendra Sapkota | Thamar Solorio | Manuel Montes | Steven Bethard
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Semi-supervised CLPsych 2016 Shared Task System Submission
Nicolas Rey-Villamizar | Prasha Shrestha | Thamar Solorio | Farig Sadeque | Steven Bethard | Ted Pedersen
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

Why Do They Leave: Modeling Participation in Online Depression Forums
Farig Sadeque | Ted Pedersen | Thamar Solorio | Prasha Shrestha | Nicolas Rey-Villamizar | Steven Bethard
Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media

Proceedings of the Second Workshop on Computational Approaches to Code Switching
Mona Diab | Pascale Fung | Mahmoud Ghoneim | Julia Hirschberg | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

Analysis of Anxious Word Usage on Online Health Forums
Nicolas Rey-Villamizar | Prasha Shrestha | Farig Sadeque | Steven Bethard | Ted Pedersen | Arjun Mukherjee | Thamar Solorio
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering
Marc Franco-Salvador | Sudipta Kar | Thamar Solorio | Paolo Rosso
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Younes Samih | Suraj Maharjan | Mohammed Attia | Laura Kallmeyer | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

Overview for the Second Shared Task on Language Identification in Code-Switched Data
Giovanni Molina | Fahad AlGhamdi | Mahmoud Ghoneim | Abdelati Hawwari | Nicolas Rey-Villamizar | Mona Diab | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings
Mohammed Attia | Suraj Maharjan | Younes Samih | Laura Kallmeyer | Thamar Solorio
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identification of Semantic Relations. Our system won first place for Task-1 and second place for Task-2. The evaluation results of our system on the test set is 88.1% (79.0% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0% (42.3% when excluding RANDOM) for Task-2 on identifying finer-grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNNs) with word embeddings from publicly available word vectors. We found that linear regression performs better in the binary classification (Task-1), while CNNs have better performance in the multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balancing the distribution of labels in the training data.

Age and Gender Prediction on Health Forum Data
Prasha Shrestha | Nicolas Rey-Villamizar | Farig Sadeque | Ted Pedersen | Steven Bethard | Thamar Solorio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Health support forums have become a rich source of data that can be used to improve health care outcomes. A user profile, including information such as age and gender, can support targeted analysis of forum data. But users might not always disclose their age and gender. It is desirable then to be able to automatically extract this information from users’ content. However, to the best of our knowledge there is no such resource for author profiling of health forum data. Here we present a large corpus, with close to 85,000 users, for profiling and also outline our approach and benchmark results to automatically detect a user’s age and gender from their forum posts. We use a mix of features from a user’s text as well as forum specific features to obtain accuracy well above the baseline, thus showing that both our dataset and our method are useful and valid.

Part of Speech Tagging for Code Switched Data
Fahad AlGhamdi | Giovanni Molina | Mona Diab | Thamar Solorio | Abdelati Hawwari | Victor Soto | Julia Hirschberg
Proceedings of the Second Workshop on Computational Approaches to Code Switching

2015

Developing Language-tagged Corpora for Code-switching Tweets
Suraj Maharjan | Elizabeth Blair | Steven Bethard | Thamar Solorio
Proceedings of the 9th Linguistic Annotation Workshop

Predicting Continued Participation in Online Health Forums
Farig Sadeque | Thamar Solorio | Ted Pedersen | Prasha Shrestha | Steven Bethard
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Yang Liu | Thamar Solorio
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

Not All Character N-grams Are Created Equal: A Study in Authorship Attribution
Upendra Sapkota | Steven Bethard | Manuel Montes | Thamar Solorio
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

Overview for the First Shared Task on Language Identification in Code-Switched Data
Thamar Solorio | Elizabeth Blair | Suraj Maharjan | Steven Bethard | Mona Diab | Mahmoud Ghoneim | Abdelati Hawwari | Fahad AlGhamdi | Julia Hirschberg | Alison Chang | Pascale Fung
Proceedings of the First Workshop on Computational Approaches to Code Switching

Proceedings of the First Workshop on Computational Approaches to Code Switching
Mona Diab | Julia Hirschberg | Pascale Fung | Thamar Solorio
Proceedings of the First Workshop on Computational Approaches to Code Switching

Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities
Thamar Solorio | Ragib Hasan | Mainul Mizan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes a corpus of sockpuppet cases from Wikipedia. A sockpuppet is an online user account created with a fake identity for the purpose of covering abusive behavior and/or subverting the editing regulation process. We used a semi-automated method for crawling and curating a dataset of real sockpuppet investigation cases. To the best of our knowledge, this is the first corpus available on real-world deceptive writing. We describe the process for crawling the data and some preliminary results that can be used as baseline for benchmarking research. The dataset has been released under a Creative Commons license from our project website (http://docsig.cis.uab.edu/tools-and-datasets/).

Cross-Topic Authorship Attribution: Will Out-Of-Topic Data Help?
Upendra Sapkota | Thamar Solorio | Manuel Montes | Steven Bethard | Paolo Rosso
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

2013

Using Latent Dirichlet Allocation for Child Narrative Analysis
Khairun-nisa Hassanali | Yang Liu | Thamar Solorio
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

Native Language Identification: a Simple n-gram Based Approach
Binod Gyawali | Gabriela Ramirez | Thamar Solorio
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

Exploring Word Class N-grams to Measure Language Development in Children
Gabriela Ramírez de la Rosa | Thamar Solorio | Manuel Montes | Yang Liu | Lisa Bedore | Elizabeth Peña | Aquiles Iglesias
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

A Case Study of Sockpuppet Detection in Wikipedia
Thamar Solorio | Ragib Hasan | Mainul Mizan
Proceedings of the Workshop on Language Analysis in Social Media

2012

On the Impact of Sentiment and Emotion Based Features in Detecting Online Sexual Predators
Dasha Bogdanova | Paolo Rosso | Thamar Solorio
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

Modelling Fixated Discourse in Chats with Cyberpedophiles
Dasha Bogdanova | Paolo Rosso | Thamar Solorio
Proceedings of the Workshop on Computational Approaches to Deception Detection

Grading the Quality of Medical Evidence
Binod Gyawali | Thamar Solorio | Yassine Benajiba
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

UABCoRAL: A Preliminary study for Resolving the Scope of Negation
Binod Gyawali | Thamar Solorio
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2011

Local Histograms of Character N-grams for Authorship Attribution
Hugo Jair Escalante | Thamar Solorio | Manuel Montes-y-Gómez
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Proceedings of the ACL 2011 Student Session
Sasa Petrovic | Ethan Selfridge | Emily Pitler | Miles Osborne | Thamar Solorio
Proceedings of the ACL 2011 Student Session

Modality Specific Meta Features for Authorship Attribution in Web Forum Posts
Thamar Solorio | Sangita Pillay | Sindhu Raghavan | Manuel Montes y Gómez
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Thamar Solorio | Ted Pedersen
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2009

A Corpus-Based Approach for the Prediction of Language Impairment in Monolingual English and Spanish-English Bilingual Children
Keyur Gabani | Melissa Sherman | Thamar Solorio | Yang Liu | Lisa Bedore | Elizabeth Peña
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

Using Language Models to Identify Language Impairment in Spanish-English Bilingual Children
Thamar Solorio | Yang Liu
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

Learning to Predict Code-Switching Points
Thamar Solorio | Yang Liu
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

Part-of-Speech Tagging for English-Spanish Code-Switched Text
Thamar Solorio | Yang Liu
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

A Filter-Based Approach to Detect End-of-Utterances from Prosody in Dialog Systems
Olac Fuentes | David Vera | Thamar Solorio
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

2006

Improving Name Discrimination: A Language Salad Approach
Ted Pedersen | Anagha Kulkarni | Roxana Angheluta | Zornitsa Kozareva | Thamar Solorio
Proceedings of the Cross-Language Knowledge Induction Workshop

2005

Exploiting Named Entity Taggers in a Second Language
Thamar Solorio
Proceedings of the ACL Student Research Workshop

2004

A Language Independent Method for Question Classification
Thamar Solorio | Manuel Pérez-Coutiño | Manuel Montes-y-Gómez | Luis Villaseñor-Pineda | Aurelio López-López
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Co-authors

Steven Bethard 10

Fabio A. González 10

Franck Dernoncourt 7

Yang Liu (刘扬) 7

Prasha Shrestha 7

Shuguang Chen 6

Julia Hirschberg 6

Niloofar Safi Samghabadi 6

Mahsa Shafaei 6

Genta Indra Winata 6

Alham Fikri Aji 5

Fahad AlGhamdi 5

Adrian Pastor Lopez Monroy 5

Leonardo Neves 5

Nicolas Rey-Villamizar 5

Farig Sadeque 5

Jose Echevarria 4

Amirreza Shirani 4

Vladimir Araujo 3

Siva Uday Sampreeth Chebolu 3

Monojit Choudhury 3

Jan Christian Blaise Cruz 3

Hugo Jair Escalante 3

Mahmoud Ghoneim 3

Binod Gyawali 3

Abdelati Hawwari 3

Upendra Sapkota 3

Sunayana Sitaram 3

Atnafu Lambebo Tonja 3

Mohamed Abdalla 2

Idris Abdulmumin 2

Henok Biadglign Ademtew 2

Ibrahim Said Ahmad 2

Sanchit Ahuja 2

Mohammed Attia 2

Israel Abebe Azime 2

Meriem Beloucif 2

Elizabeth Blair 2

Dasha Bogdanova 2

Julian Brooke 2

Fermin Cristobal 2

Kareem Elzeky 2

Oumaima Hourrane 2

Laura Kallmeyer 2

Aishik Mandal 2

Saif Mohammad 2

Giovanni Molina 2

Shamsuddeen Hassan Muhammad 2

Nedjma Ousidhoum 2

Elizabeth Peña 2

Srinivas Pykl 2

Manish Shrivastava 2

Nirmal Surange 2

Emilio Villa-Cueva 2

Krishnapriya Vishnubhotla 2

Debela Desalegn Yadeta 2

Seid Muhie Yimam 2

Zheng Xin Yong 2

Marina Zhukova 2

Christine de Kock 2

Negasi Haile Abadi 1

Slim Abdennadher 1

S M Masrur Ahmed 1

Mai Alkhamissi 1

Roxana Angheluta 1

Dana Atzil-Slonim 1

Abinew Ali Ayele 1

Elaheh Baharlouei 1

Bontu Fufa Balcha 1

Fazlourrahman Balouchzahi 1

Pavan Baswani 1

Tadesse Destaw Belay 1

Frederico Belcavello 1

Yassine Benajiba 1

Chris Biemann 1

Alan W. Black 1

Sholpan Bolatzhanova 1

Sofia Bourhim 1

Jill Burstein 1

Samuel Cahyawijaya 1

Hector Ceballos 1

Tanmoy Chakraborty 1

Anastasiia Demidova 1

Raquel Diaz-Sprague 1

Christy Doran 1

Naome A. Etori 1

Fauzan Farooqui 1

Lucie Flekova 1

Jessica Zosa Forde 1

Marc Franco-Salvador 1

Björn Gambäck 1

Rowena Garcia 1

Alexander Gelbukh 1

Derartu Dagne Geremew 1

Iryna Gurevych 1

Khairun-nisa Hassanali 1

Afsheen Hatami 1

Aquiles Iglesias 1

Mahardika Krisna Krisna Ihsani 1

Thanmay Jayakumar 1

Soyeong Jeong 1

Ioannis Kakadiaris 1

Gopichand Kanumolu 1

Nitish Shirish Keskar 1

Dietrich Klakow 1

Olga Kolesnikova 1

Zornitsa Kozareva 1

Anagha Kulkarni 1

Mayank Kulkarni 1

Garry Kuwanto 1

Mirella Lapata 1

Zheng Wei Lim 1

Adrian Lopez Monroy 1

Aurelio López-López 1

Lokesh Madasu 1

Sofía Martinelli 1

Rada Mihalcea 1

Mihail Minkov Mihaylov 1

David Mogrovejo 1

Aashiq Muhamed 1

Prerana Mukherjee 1

Arjun Mukherjee 1

Mulubrhan Abebe Nerea 1

Hellina Hailu Nigatu 1

Miles Osborne 1

Duccio Pappadopulo 1

Saša Petrović 1

Sangita Pillay 1

Aniket Pramanick 1

Daniel Preoţiuc-Pietro 1

Sukannya Purkayastha 1

Manuel Pérez-Coutiño 1

Sindhu Raghavan 1

Nazneen Rajani 1

Gabriela Ramirez 1

Gabriela Ramírez de la Rosa 1

Benjamin Rosman 1

Samuel Rutunda 1

Caroline Sabty 1

Israfel Salazar 1

Fernando Sanchez-Vega 1

Ethan Selfridge 1

Sadat Shahriar 1

Melissa Sherman 1

Sebastian Sierra 1

Philipp Slusallek 1

Christos Smailis 1

Anirudh Srinivasan 1

Arjun Subramonian 1

Lintang Sutawika 1

Assefa Atsbiha Tesfu 1

Hailegnaw Tilaye 1

Tiago Timponi Torrent 1

Diana Turmakhan 1

Luis Villaseñor-Pineda 1

Ngoc Thang Vu 1

Derry Tanti Wijaya 1

Ruochen Zhang 1

Venues