Thamar Solorio


2021

pdf bib
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Thamar Solorio | Shuguang Chen | Alan W. Black | Mona Diab | Sunayana Sitaram | Victor Soto | Emre Yilmaz | Anirudh Srinivasan
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

pdf bib
Normalization and Back-Transliteration for Code-Switched Data
Dwija Parikh | Thamar Solorio
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for NLP systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Roman script which prevents from utilizing monolingual resources available in the native Devanagari script. In this paper, we propose a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. We also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research.

pdf bib
Mitigating Temporal-Drift: A Simple Approach to Keep NER Models Crisp
Shuguang Chen | Leonardo Neves | Thamar Solorio
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

Performance of neural models for named entity recognition degrades over time, becoming stale. This degradation is due to temporal drift, the change in our target variables’ statistical properties over time. This issue is especially problematic for social media data, where topics change rapidly. In order to mitigate the problem, data annotation and retraining of models is common. Despite its usefulness, this process is expensive and time-consuming, which motivates new research on efficient model updating. In this paper, we propose an intuitive approach to measure the potential trendiness of tweets and use this metric to select the most informative instances to use for training. We conduct experiments on three state-of-the-art models on the Temporal Twitter Dataset. Our approach shows larger increases in prediction accuracy with less training data than the alternatives, making it an attractive, practical solution.

pdf bib
PSED: A Dataset for Selecting Emphasis in Presentation Slides
Amirreza Shirani | Giai Tran | Hieu Trinh | Franck Dernoncourt | Nedim Lipka | Jose Echevarria | Thamar Solorio | Paul Asente
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
Thamar Solorio | Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Amitava Das | Mona Diab
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

pdf bib
Age Suitability Rating: Predicting the MPAA Rating Based on Movie Dialogues
Mahsa Shafaei | Niloofar Safi Samghabadi | Sudipta Kar | Thamar Solorio
Proceedings of the 12th Language Resources and Evaluation Conference

Movies help us learn and inspire societal change. But they can also contain objectionable content that negatively affects viewers’ behaviour, especially children. In this paper, our goal is to predict the suitability of movie content for children and young adults based on scripts. The criterion that we use to measure suitability is the MPAA rating that is specifically designed for this purpose. We create a corpus for movie MPAA ratings and propose an RNN based architecture with attention that jointly models the genre and the emotions in the script to predict the MPAA rating. We achieve 81% weighted F1-score for the classification model that outperforms the traditional machine learning method by 7%.

pdf bib
LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation
Gustavo Aguilar | Sudipta Kar | Thamar Solorio
Proceedings of the 12th Language Resources and Evaluation Conference

Recent trends in NLP research have raised an interest in linguistic code-switching (CS); modern approaches have been proposed to solve a wide range of NLP tasks on multiple language pairs. Unfortunately, these proposed methods are hardly generalizable to different code-switched languages. In addition, it is unclear whether a model architecture is applicable for a different task while still being compatible with the code-switching setting. This is mainly because of the lack of a centralized benchmark and the sparse corpora that researchers employ based on their specific needs and interests. To facilitate research in this direction, we propose a centralized benchmark for Linguistic Code-switching Evaluation (LinCE) that combines eleven corpora covering four different code-switched language pairs (i.e., Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks (i.e., language identification, named entity recognition, part-of-speech tagging, and sentiment analysis). As part of the benchmark centralization effort, we provide an online platform where researchers can submit their results while comparing with others in real-time. In addition, we provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems. LinCE is a continuous effort, and we will expand it with more low-resource languages and tasks.

pdf bib
From English to Code-Switching: Transfer Learning with Strong Morphological Clues
Gustavo Aguilar | Thamar Solorio
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Linguistic Code-switching (CS) is still an understudied phenomenon in natural language processing. The NLP community has mostly focused on monolingual and multi-lingual scenarios, but little attention has been given to CS in particular. This is partly because of the lack of resources and annotated data, despite its increasing occurrence in social media platforms. In this paper, we aim at adapting monolingual models to code-switched text in various tasks. Specifically, we transfer English knowledge from a pre-trained ELMo model to different code-switched language pairs (i.e., Nepali-English, Spanish-English, and Hindi-English) using the task of language identification. Our method, CS-ELMo, is an extension of ELMo with a simple yet effective position-aware attention mechanism inside its character convolutions. We show the effectiveness of this transfer learning step by outperforming multilingual BERT and homologous CS-unaware ELMo models and establishing a new state of the art in CS tasks, such as NER and POS tagging. Our technique can be expanded to more English-paired code-switched languages, providing more resources to the CS community.

pdf bib
Let Me Choose: From Verbal Context to Font Selection
Amirreza Shirani | Franck Dernoncourt | Jose Echevarria | Paul Asente | Nedim Lipka | Thamar Solorio
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we aim to learn associations between visual attributes of fonts and the verbal context of the texts they are typically applied to. Compared to related work leveraging the surrounding visual context, we choose to focus only on the input text, which can enable new applications for which the text is the only visual element in the document. We introduce a new dataset, containing examples of different topics in social media posts and ads, labeled through crowd-sourcing. Due to the subjective nature of the task, multiple fonts might be perceived as acceptable for an input text, which makes this problem challenging. To this end, we investigate different end-to-end models to learn label distributions on crowd-sourced data, to capture inter-subjectivity across all annotations.

pdf bib
Multi-view Story Characterization from Movie Plot Synopses and Reviews
Sudipta Kar | Gustavo Aguilar | Mirella Lapata | Thamar Solorio
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

This paper considers the problem of characterizing stories by inferring properties such as theme and style using written synopses and reviews of movies. We experiment with a multi-label dataset of movie synopses and a tagset representing various attributes of stories (e.g., genre, type of events). Our proposed multi-view model encodes the synopses and reviews using hierarchical attention and shows improvement over methods that only use synopses. Finally, we demonstrate how we can take advantage of such a model to extract a complementary set of story-attributes from reviews without direct supervision. We have made our dataset and source code publicly available at https://ritual.uh.edu/multiview-tag-2020.

pdf bib
SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets
Parth Patwa | Gustavo Aguilar | Sudipta Kar | Suraj Pandey | Srinivas PYKL | Björn Gambäck | Tanmoy Chakraborty | Thamar Solorio | Amitava Das
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present the results of the SemEval-2020 Task 9 on Sentiment Analysis of Code-Mixed Tweets (SentiMix 2020). We also release and describe our Hinglish (Hindi-English)and Spanglish (Spanish-English) corpora annotated with word-level language identification and sentence-level sentiment labels. These corpora are comprised of 20K and 19K examples, respectively. The sentiment labels are - Positive, Negative, and Neutral. SentiMix attracted 89 submissions in total including 61 teams that participated in the Hinglish contest and 28 submitted systems to the Spanglish competition. The best performance achieved was 75.0% F1 score for Hinglish and 80.6% F1 for Spanglish. We observe that BERT-like models and ensemble methods are the most common and successful approaches among the participants.

pdf bib
SemEval-2020 Task 10: Emphasis Selection for Written Text in Visual Media
Amirreza Shirani | Franck Dernoncourt | Nedim Lipka | Paul Asente | Jose Echevarria | Thamar Solorio
Proceedings of the Fourteenth Workshop on Semantic Evaluation

In this paper, we present the main findings and compare the results of SemEval-2020 Task 10, Emphasis Selection for Written Text in Visual Media. The goal of this shared task is to design automatic methods for emphasis selection, i.e. choosing candidates for emphasis in textual content to enable automated design assistance in authoring. The main focus is on short text instances for social media, with a variety of examples, from social media posts to inspirational quotes. Participants were asked to model emphasis using plain text with no additional context from the user or other design considerations. SemEval-2020 Emphasis Selection shared task attracted 197 participants in the early phase and a total of 31 teams made submissions to this task. The highest-ranked submission achieved 0.823 Matchm score. The analysis of systems submitted to the task indicates that BERT and RoBERTa were the most common choice of pre-trained models used, and part of speech tag (POS) was the most useful feature. Full results can be found on the task’s website.

pdf bib
Attending the Emotions to Detect Online Abusive Language
Niloofar Safi Samghabadi | Afsheen Hatami | Mahsa Shafaei | Sudipta Kar | Thamar Solorio
Proceedings of the Fourth Workshop on Online Abuse and Harms

In recent years, abusive behavior has become a serious issue in online social networks. In this paper, we present a new corpus for the task of abusive language detection that is collected from a semi-anonymous online platform, and unlike the majority of other available resources, is not created based on a specific list of bad words. We also develop computational models to incorporate emotions into textual cues to improve aggression identification. We evaluate our proposed methods on a set of corpora related to the task and show promising results with respect to abusive language detection.

pdf bib
Aggression and Misogyny Detection using BERT: A Multi-Task Approach
Niloofar Safi Samghabadi | Parth Patwa | Srinivas PYKL | Prerana Mukherjee | Amitava Das | Thamar Solorio
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

In recent times, the focus of the NLP community has increased towards offensive language, aggression, and hate-speech detection.This paper presents our system for TRAC-2 shared task on “Aggression Identification” (sub-task A) and “Misogynistic Aggression Identification” (sub-task B). The data for this shared task is provided in three different languages - English, Hindi, and Bengali. Each data instance is annotated into one of the three aggression classes - Not Aggressive, Covertly Aggressive, Overtly Aggressive, as well as one of the two misogyny classes - Gendered and Non-Gendered. We propose an end-to-end neural model using attention on top of BERT that incorporates a multi-task learning paradigm to address both the sub-tasks simultaneously. Our team, “na14”, scored 0.8579 weighted F1-measure on the English sub-task B and secured 3rd rank out of 15 teams for the task. The code and the model weights are publicly available at https://github.com/NiloofarSafi/TRAC-2. Keywords: Aggression, Misogyny, Abusive Language, Hate-Speech Detection, BERT, NLP, Neural Networks, Social Media

pdf bib
Detecting Early Signs of Cyberbullying in Social Media
Niloofar Safi Samghabadi | Adrián Pastor López Monroy | Thamar Solorio
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

Nowadays, the amount of users’ activities on online social media is growing dramatically. These online environments provide excellent opportunities for communication and knowledge sharing. However, some people misuse them to harass and bully others online, a phenomenon called cyberbullying. Due to its harmful effects on people, especially youth, it is imperative to detect cyberbullying as early as possible before it causes irreparable damages to victims. Most of the relevant available resources are not explicitly designed to detect cyberbullying, but related content, such as hate speech and abusive language. In this paper, we propose a new approach to create a corpus suited for cyberbullying detection. We also investigate the possibility of designing a framework to monitor the streams of users’ online messages and detects the signs of cyberbullying as early as possible.

2019

pdf bib
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Jill Burstein | Christy Doran | Thamar Solorio
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

pdf bib
Learning Emphasis Selection for Written Text in Visual Media from Crowd-Sourced Label Distributions
Amirreza Shirani | Franck Dernoncourt | Paul Asente | Nedim Lipka | Seokhwan Kim | Jose Echevarria | Thamar Solorio
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In visual communication, text emphasis is used to increase the comprehension of written text to convey the author’s intent. We study the problem of emphasis selection, i.e. choosing candidates for emphasis in short written text, to enable automated design assistance in authoring. Without knowing the author’s intent and only considering the input text, multiple emphasis selections are valid. We propose a model that employs end-to-end label distribution learning (LDL) on crowd-sourced data and predicts a selection distribution, capturing the inter-subjectivity (common-sense) in the audience as well as the ambiguity of the input. We compare the model with several baselines in which the problem is transformed to single-label learning by mapping label distributions to absolute labels via majority voting.

pdf bib
Jointly Learning Author and Annotated Character N-gram Embeddings: A Case Study in Literary Text
Suraj Maharjan | Deepthi Mave | Prasha Shrestha | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

An author’s way of presenting a story through his/her writing style has a great impact on whether the story will be liked by readers or not. In this paper, we learn representations for authors of literary texts together with representations for character n-grams annotated with their functional roles. We train a neural character n-gram based language model using an external corpus of literary texts and transfer learned representations for use in downstream tasks. We show that augmenting the knowledge from external works of authors produces results competitive with other style-based methods for book likability prediction, genre classification, and authorship attribution.

2018

pdf bib
Folksonomication: Predicting Tags for Movies from Plot Synopses using Emotion Flow Encoded Neural Network
Sudipta Kar | Suraj Maharjan | Thamar Solorio
Proceedings of the 27th International Conference on Computational Linguistics

Folksonomy of movies covers a wide range of heterogeneous information about movies, like the genre, plot structure, visual experiences, soundtracks, metadata, and emotional experiences from watching a movie. Being able to automatically generate or predict tags for movies can help recommendation engines improve retrieval of similar movies, and help viewers know what to expect from a movie in advance. In this work, we explore the problem of creating tags for movies from plot synopses. We propose a novel neural network model that merges information from synopses and emotion flows throughout the plots to predict a set of tags for movies. We compare our system with multiple baselines and found that the addition of emotion flows boosts the performance of the network by learning ≈18% more tags than a traditional machine learning system.

pdf bib
Early Text Classification Using Multi-Resolution Concept Representations
Adrian Pastor López-Monroy | Fabio A. González | Manuel Montes | Hugo Jair Escalante | Thamar Solorio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The intensive use of e-communications in everyday life has given rise to new threats and risks. When the vulnerable asset is the user, detecting these potential attacks before they cause serious damages is extremely important. This paper proposes a novel document representation to improve the early detection of risks in social media sources. The goal is to effectively identify the potential risk using as few text as possible and with as much anticipation as possible. Accordingly, we devise a Multi-Resolution Representation (MulR), which allows us to generate multiple “views” of the analyzed text. These views capture different semantic meanings for words and documents at different levels of detail, which is very useful in early scenarios to model the variable amounts of evidence. Intuitively, the representation captures better the content of short documents (very early stages) in low resolutions, whereas large documents (medium/large stages) are better modeled with higher resolutions. We evaluate the proposed ideas in two different tasks where anticipation is critical: sexual predator detection and depression detection. The experimental evaluation for these early tasks revealed that the proposed approach outperforms previous methodologies by a considerable margin.

pdf bib
Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media
Gustavo Aguilar | Adrian Pastor López-Monroy | Fabio González | Thamar Solorio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recognizing named entities in a document is a key task in many NLP applications. Although current state-of-the-art approaches to this task reach a high performance on clean text (e.g. newswire genres), those algorithms dramatically degrade when they are moved to noisy environments such as social media domains. We present two systems that address the challenges of processing social media data using character-level phonetics and phonology, word embeddings, and Part-of-Speech tags as features. The first model is a multitask end-to-end Bidirectional Long Short-Term Memory (BLSTM)-Conditional Random Field (CRF) network whose output layer contains two CRF classifiers. The second model uses a multitask BLSTM network as feature extractor that transfers the learning to a CRF classifier for the final prediction. Our systems outperform the current F1 scores of the state of the art on the Workshop on Noisy User-generated Text 2017 dataset by 2.45% and 3.69%, establishing a more suitable approach for social media environments.

pdf bib
Letting Emotions Flow: Success Prediction by Modeling the Flow of Emotions in Books
Suraj Maharjan | Sudipta Kar | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Books have the power to make us feel happiness, sadness, pain, surprise, or sorrow. An author’s dexterity in the use of these emotions captivates readers and makes it difficult for them to put the book down. In this paper, we model the flow of emotions over a book using recurrent neural networks and quantify its usefulness in predicting success in books. We obtained the best weighted F1-score of 69% for predicting books’ success in a multitask setting (simultaneously predicting success and genre of books).

pdf bib
MPST: A Corpus of Movie Plot Synopses with Tags
Sudipta Kar | Suraj Maharjan | A. Pastor López-Monroy | Thamar Solorio
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Proceedings of ACL 2018, System Demonstrations
Fei Liu | Thamar Solorio
Proceedings of ACL 2018, System Demonstrations

pdf bib
Proceedings of the Second Workshop on Stylistic Variation
Julian Brooke | Lucie Flekova | Moshe Koppel | Thamar Solorio
Proceedings of the Second Workshop on Stylistic Variation

pdf bib
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Thamar Solorio | Mona Diab | Julia Hirschberg
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

pdf bib
Language Identification and Analysis of Code-Switched Social Media Text
Deepthi Mave | Suraj Maharjan | Thamar Solorio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

In this paper, we detail our work on comparing different word-level language identification systems for code-switched Hindi-English data and a standard Spanish-English dataset. In this regard, we build a new code-switched dataset for Hindi-English. To understand the code-switching patterns in these language pairs, we investigate different code-switching metrics. We find that the CRF model outperforms the neural network based models by a margin of 2-5 percentage points for Spanish-English and 3-5 percentage points for Hindi-English.

pdf bib
Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Mona Diab | Julia Hirschberg | Thamar Solorio
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

In the third shared task of the Computational Approaches to Linguistic Code-Switching (CALCS) workshop, we focus on Named Entity Recognition (NER) on code-switched social-media data. We divide the shared task into two competitions based on the English-Spanish (ENG-SPA) and Modern Standard Arabic-Egyptian (MSA-EGY) language pairs. We use Twitter data and 9 entity types to establish a new dataset for code-switched NER benchmarks. In addition to the CS phenomenon, the diversity of the entities and the social media challenges make the task considerably hard to process. As a result, the best scores of the competitions are 63.76% and 71.61% for ENG-SPA and MSA-EGY, respectively. We present the scores of 9 participants and discuss the most common challenges among submissions.

pdf bib
RiTUAL-UH at TRAC 2018 Shared Task: Aggression Identification
Niloofar Safi Samghabadi | Deepthi Mave | Sudipta Kar | Thamar Solorio
Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018)

This paper presents our system for “TRAC 2018 Shared Task on Aggression Identification”. Our best systems for the English dataset use a combination of lexical and semantic features. However, for Hindi data using only lexical features gave us the best results. We obtained weighted F1-measures of 0.5921 for the English Facebook task (ranked 12th), 0.5663 for the English Social Media task (ranked 6th), 0.6292 for the Hindi Facebook task (ranked 1st), and 0.4853 for the Hindi Social Media task (ranked 2nd).

pdf bib
A Genre-Aware Attention Model to Improve the Likability Prediction of Books
Suraj Maharjan | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Likability prediction of books has many uses. Readers, writers, as well as the publishing industry, can all benefit from automatic book likability prediction systems. In order to make reliable decisions, these systems need to assimilate information from different aspects of a book in a sensible way. We propose a novel multimodal neural architecture that incorporates genre supervision to assign weights to individual feature types. Our proposed method is capable of dynamically tailoring weights given to feature types based on the characteristics of each book. Our architecture achieves competitive results and even outperforms state-of-the-art for this task.

2017

pdf bib
Detecting Nastiness in Social Media
Niloofar Safi Samghabadi | Suraj Maharjan | Alan Sprague | Raquel Diaz-Sprague | Thamar Solorio
Proceedings of the First Workshop on Abusive Language Online

Although social media has made it easy for people to connect on a virtually unlimited basis, it has also opened doors to people who misuse it to undermine, harass, humiliate, threaten and bully others. There is a lack of adequate resources to detect and hinder its occurrence. In this paper, we present our initial NLP approach to detect invective posts as a first step to eventually detect and deter cyberbullying. We crawl data containing profanities and then determine whether or not it contains invective. Annotations on this data are improved iteratively by in-lab annotations and crowdsourcing. We pursue different NLP approaches containing various typical and some newer techniques to distinguish the use of swear words in a neutral way from those instances in which they are used in an insulting way. We also show that this model not only works for our data set, but also can be successfully applied to different data sets.

pdf bib
A Multi-task Approach for Named Entity Recognition in Social Media Data
Gustavo Aguilar | Suraj Maharjan | Adrian Pastor López-Monroy | Thamar Solorio
Proceedings of the 3rd Workshop on Noisy User-generated Text

Named Entity Recognition for social media data is challenging because of its inherent noisiness. In addition to improper grammatical structures, it contains spelling inconsistencies and numerous informal abbreviations. We propose a novel multi-task approach by employing a more general secondary task of Named Entity (NE) segmentation together with the primary task of fine-grained NE categorization. The multi-task neural network architecture learns higher order feature representations from word and character sequences along with basic Part-of-Speech tags and gazetteer information. This neural network acts as a feature extractor to feed a Conditional Random Fields classifier. We were able to obtain the first position in the 3rd Workshop on Noisy User-generated Text (WNUT-2017) with a 41.86% entity F1-score and a 40.24% surface F1-score.

pdf bib
Proceedings of the Workshop on Stylistic Variation
Julian Brooke | Thamar Solorio | Moshe Koppel
Proceedings of the Workshop on Stylistic Variation

pdf bib
RiTUAL-UH at SemEval-2017 Task 5: Sentiment Analysis on Financial Data Using Neural Networks
Sudipta Kar | Suraj Maharjan | Thamar Solorio
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

In this paper, we present our systems for the “SemEval-2017 Task-5 on Fine-Grained Sentiment Analysis on Financial Microblogs and News”. In our system, we combined hand-engineered lexical, sentiment and metadata features, the representations learned from Convolutional Neural Networks (CNN) and Bidirectional Gated Recurrent Unit (Bi-GRU) with Attention model applied on top. With this architecture we obtained weighted cosine similarity scores of 0.72 and 0.74 for subtask-1 and subtask-2, respectively. Using the official scoring system, our system ranked the second place for subtask-2 and eighth place for the subtask-1. It ranked first for both of the subtasks by the scores achieved by an alternate scoring system.

pdf bib
A Multi-task Approach to Predict Likability of Books
Suraj Maharjan | John Arevalo | Manuel Montes | Fabio A. González | Thamar Solorio
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We investigate the value of feature engineering and neural network models for predicting successful writing. Similar to previous work, we treat this as a binary classification task and explore new strategies to automatically learn representations from book contents. We evaluate our feature set on two different corpora created from Project Gutenberg books. The first presents a novel approach for generating the gold standard labels for the task and the other is based on prior research. Using a combination of hand-crafted and recurrent neural network learned representations in a dual learning setting, we obtain the best performance of 73.50% weighted F1-score.

pdf bib
Convolutional Neural Networks for Authorship Attribution of Short Texts
Prasha Shrestha | Sebastian Sierra | Fabio González | Manuel Montes | Paolo Rosso | Thamar Solorio
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a model to perform authorship attribution of tweets using Convolutional Neural Networks (CNNs) over character n-grams. We also present a strategy that improves model interpretability by estimating the importance of input text fragments in the predicted classification. The experimental evaluation shows that text CNNs perform competitively and are able to outperform previous methods.

2016

pdf bib
Domain Adaptation for Authorship Attribution: Improved Structural Correspondence Learning
Upendra Sapkota | Thamar Solorio | Manuel Montes | Steven Bethard
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Age and Gender Prediction on Health Forum Data
Prasha Shrestha | Nicolas Rey-Villamizar | Farig Sadeque | Ted Pedersen | Steven Bethard | Thamar Solorio
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Health support forums have become a rich source of data that can be used to improve health care outcomes. A user profile, including information such as age and gender, can support targeted analysis of forum data. But users might not always disclose their age and gender. It is desirable then to be able to automatically extract this information from users’ content. However, to the best of our knowledge there is no such resource for author profiling of health forum data. Here we present a large corpus, with close to 85,000 users, for profiling and also outline our approach and benchmark results to automatically detect a user’s age and gender from their forum posts. We use a mix of features from a user’s text as well as forum specific features to obtain accuracy well above the baseline, thus showing that both our dataset and our method are useful and valid.

pdf bib
Semi-supervised CLPsych 2016 Shared Task System Submission
Nicolas Rey-Villamizar | Prasha Shrestha | Thamar Solorio | Farig Sadeque | Steven Bethard | Ted Pedersen
Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology

pdf bib
CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings
Mohammed Attia | Suraj Maharjan | Younes Samih | Laura Kallmeyer | Thamar Solorio
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identification of Semantic Relations. Our system won first place for Task-1 and second place for Task-2. The evaluation results of our system on the test set is 88.1% (79.0% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0% (42.3% when excluding RANDOM) for Task-2 on identifying finer-grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNNs) with word embeddings from publicly available word vectors. We found that linear regression performs better in the binary classification (Task-1), while CNNs have better performance in the multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balancing the distribution of labels in the training data.

pdf bib
Proceedings of the Second Workshop on Computational Approaches to Code Switching
Mona Diab | Pascale Fung | Mahmoud Ghoneim | Julia Hirschberg | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Overview for the Second Shared Task on Language Identification in Code-Switched Data
Giovanni Molina | Fahad AlGhamdi | Mahmoud Ghoneim | Abdelati Hawwari | Nicolas Rey-Villamizar | Mona Diab | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Younes Samih | Suraj Maharjan | Mohammed Attia | Laura Kallmeyer | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Part of Speech Tagging for Code Switched Data
Fahad AlGhamdi | Giovanni Molina | Mona Diab | Thamar Solorio | Abdelati Hawwari | Victor Soto | Julia Hirschberg
Proceedings of the Second Workshop on Computational Approaches to Code Switching

pdf bib
Analysis of Anxious Word Usage on Online Health Forums
Nicolas Rey-Villamizar | Prasha Shrestha | Farig Sadeque | Steven Bethard | Ted Pedersen | Arjun Mukherjee | Thamar Solorio
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis

pdf bib
Why Do They Leave: Modeling Participation in Online Depression Forums
Farig Sadeque | Ted Pedersen | Thamar Solorio | Prasha Shrestha | Nicolas Rey-Villamizar | Steven Bethard
Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media

pdf bib
UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering
Marc Franco-Salvador | Sudipta Kar | Thamar Solorio | Paolo Rosso
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
Developing Language-tagged Corpora for Code-switching Tweets
Suraj Maharjan | Elizabeth Blair | Steven Bethard | Thamar Solorio
Proceedings of The 9th Linguistic Annotation Workshop

pdf bib
Predicting Continued Participation in Online Health Forums
Farig Sadeque | Thamar Solorio | Ted Pedersen | Prasha Shrestha | Steven Bethard
Proceedings of the Sixth International Workshop on Health Text Mining and Information Analysis

pdf bib
Not All Character N-grams Are Created Equal: A Study in Authorship Attribution
Upendra Sapkota | Steven Bethard | Manuel Montes | Thamar Solorio
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Yang Liu | Thamar Solorio
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorial Abstracts

2014

pdf bib
Proceedings of the First Workshop on Computational Approaches to Code Switching
Mona Diab | Julia Hirschberg | Pascale Fung | Thamar Solorio
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Overview for the First Shared Task on Language Identification in Code-Switched Data
Thamar Solorio | Elizabeth Blair | Suraj Maharjan | Steven Bethard | Mona Diab | Mahmoud Ghoneim | Abdelati Hawwari | Fahad AlGhamdi | Julia Hirschberg | Alison Chang | Pascale Fung
Proceedings of the First Workshop on Computational Approaches to Code Switching

pdf bib
Cross-Topic Authorship Attribution: Will Out-Of-Topic Data Help?
Upendra Sapkota | Thamar Solorio | Manuel Montes | Steven Bethard | Paolo Rosso
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib
Sockpuppet Detection in Wikipedia: A Corpus of Real-World Deceptive Writing for Linking Identities
Thamar Solorio | Ragib Hasan | Mainul Mizan
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes a corpus of sockpuppet cases from Wikipedia. A sockpuppet is an online user account created with a fake identity for the purpose of covering abusive behavior and/or subverting the editing regulation process. We used a semi-automated method for crawling and curating a dataset of real sockpuppet investigation cases. To the best of our knowledge, this is the first corpus available on real-world deceptive writing. We describe the process for crawling the data and some preliminary results that can be used as baseline for benchmarking research. The dataset has been released under a Creative Commons license from our project website (http://docsig.cis.uab.edu/tools-and-datasets/).

2013

pdf bib
A Case Study of Sockpuppet Detection in Wikipedia
Thamar Solorio | Ragib Hasan | Mainul Mizan
Proceedings of the Workshop on Language Analysis in Social Media

pdf bib
Native Language Identification: a Simple n-gram Based Approach
Binod Gyawali | Gabriela Ramirez | Thamar Solorio
Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Exploring Word Class N-grams to Measure Language Development in Children
Gabriela Ramírez de la Rosa | Thamar Solorio | Manuel Montes | Yang Liu | Lisa Bedore | Elizabeth Peña | Aquiles Iglesias
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

pdf bib
Using Latent Dirichlet Allocation for Child Narrative Analysis
Khairun-nisa Hassanali | Yang Liu | Thamar Solorio
Proceedings of the 2013 Workshop on Biomedical Natural Language Processing

2012

pdf bib
UABCoRAL: A Preliminary study for Resolving the Scope of Negation
Binod Gyawali | Thamar Solorio
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
Modelling Fixated Discourse in Chats with Cyberpedophiles
Dasha Bogdanova | Paolo Rosso | Thamar Solorio
Proceedings of the Workshop on Computational Approaches to Deception Detection

pdf bib
Grading the Quality of Medical Evidence
Binod Gyawali | Thamar Solorio | Yassine Benajiba
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

pdf bib
On the Impact of Sentiment and Emotion Based Features in Detecting Online Sexual Predators
Dasha Bogdanova | Paolo Rosso | Thamar Solorio
Proceedings of the 3rd Workshop in Computational Approaches to Subjectivity and Sentiment Analysis

2011

pdf bib
Modality Specific Meta Features for Authorship Attribution in Web Forum Posts
Thamar Solorio | Sangita Pillay | Sindhu Raghavan | Manuel Montes y Gómez
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Local Histograms of Character N-grams for Authorship Attribution
Hugo Jair Escalante | Thamar Solorio | Manuel Montes-y-Gómez
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the ACL 2011 Student Session
Sasa Petrovic | Ethan Selfridge | Emily Pitler | Miles Osborne | Thamar Solorio
Proceedings of the ACL 2011 Student Session

2010

pdf bib
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas
Thamar Solorio | Ted Pedersen
Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas

2009

pdf bib
A Corpus-Based Approach for the Prediction of Language Impairment in Monolingual English and Spanish-English Bilingual Children
Keyur Gabani | Melissa Sherman | Thamar Solorio | Yang Liu | Lisa Bedore | Elizabeth Peña
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
Using Language Models to Identify Language Impairment in Spanish-English Bilingual Children
Thamar Solorio | Yang Liu
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

pdf bib
Learning to Predict Code-Switching Points
Thamar Solorio | Yang Liu
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Part-of-Speech Tagging for English-Spanish Code-Switched Text
Thamar Solorio | Yang Liu
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
A Filter-Based Approach to Detect End-of-Utterances from Prosody in Dialog Systems
Olac Fuentes | David Vera | Thamar Solorio
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

2006

pdf bib
Improving Name Discrimination: A Language Salad Approach
Ted Pedersen | Anagha Kulkarni | Roxana Angheluta | Zornitsa Kozareva | Thamar Solorio
Proceedings of the Cross-Language Knowledge Induction Workshop

2005

pdf bib
Exploiting Named Entity Taggers in a Second Language
Thamar Solorio
Proceedings of the ACL Student Research Workshop

2004

pdf bib
A Language Independent Method for Question Classification
Thamar Solorio | Manuel Pérez-Coutiño | Manuel Montes-y-Gómez | Luis Villaseñor-Pineda | Aurelio López-López
COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics

Search
Co-authors