Workshop on Noisy User-generated Text (2022)


up

pdf (full)
bib (full)
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

pdf bib
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

pdf bib
Changes in Tweet Geolocation over Time: A Study with Carmen 2.0
Jingyu Zhang | Alexandra DeLucia | Mark Dredze

Researchers across disciplines use Twitter geolocation tools to filter data for desired locations. These tools have largely been trained and tested on English tweets, often originating in the United States from almost a decade ago. Despite the importance of these tools for data curation, the impact of tweet language, country of origin, and creation date on tool performance remains largely unknown. We explore these issues with Carmen, a popular tool for Twitter geolocation. To support this study we introduce Carmen 2.0, a major update which includes the incorporation of GeoNames, a gazetteer that provides much broader coverage of locations. We evaluate using two new Twitter datasets, one for multilingual, multiyear geolocation evaluation, and another for usage trends over time. We found that language, country origin, and time does impact geolocation tool performance.

pdf bib
Extracting Mathematical Concepts from Text
Jacob Collard | Valeria de Paiva | Brendan Fong | Eswaran Subrahmanian

We investigate different systems for extracting mathematical entities from English texts in the mathematical field of category theory as a first step for constructing a mathematical knowledge graph. We consider four different term extractors and compare their results. This small experiment showcases some of the issues with the construction and evaluation of terms extracted from noisy domain text. We also make available two open corpora in research mathematics, in particular in category theory: a small corpus of 755 abstracts from the journal TAC (3188 sentences), and a larger corpus from the nLab community wiki (15,000 sentences)

pdf bib
Data-driven Approach to Differentiating between Depression and Dementia from Noisy Speech and Language Data
Malikeh Ehghaghi | Frank Rudzicz | Jekaterina Novikova

A significant number of studies apply acoustic and linguistic characteristics of human speech as prominent markers of dementia and depression. However, studies on discriminating depression from dementia are rare. Co-morbid depression is frequent in dementia and these clinical conditions share many overlapping symptoms, but the ability to distinguish between depression and dementia is essential as depression is often curable. In this work, we investigate the ability of clustering approaches in distinguishing between depression and dementia from human speech. We introduce a novel aggregated dataset, which combines narrative speech data from multiple conditions, i.e., Alzheimer’s disease, mild cognitive impairment, healthy control, and depression. We compare linear and non-linear clustering approaches and show that non-linear clustering techniques distinguish better between distinct disease clusters. Our interpretability analysis shows that the main differentiating symptoms between dementia and depression are acoustic abnormality, repetitiveness (or circularity) of speech, word finding difficulty, coherence impairment, and differences in lexical complexity and richness.

pdf bib
Cross-Dialect Social Media Dependency Parsing for Social Scientific Entity Attribute Analysis
Chloe Eggleston | Brendan O’Connor

In this paper, we utilize recent advancements in social media natural language processing to obtain state-of-the-art syntactic dependency parsing results for social media English. We observe performance gains of 3.4 UAS and 4.0 LAS against the previous state-of-the-art as well as less disparity between African-American and Mainstream American English dialects. We demonstrate the computational social scientific utility of this parser for the task of socially embedded entity attribute analysis: for a specified entity, derive its semantic relationships from parses’ rich syntax, and accumulate and compare them across social variables. We conduct a case study on politicized views of U.S. official Anthony Fauci during the COVID-19 pandemic.

pdf bib
Impact of Environmental Noise on Alzheimer’s Disease Detection from Speech: Should You Let a Baby Cry?
Jekaterina Novikova

Research related to automatically detecting Alzheimer’s disease (AD) is important, given the high prevalence of AD and the high cost of traditional methods. Since AD significantly affects the acoustics of spontaneous speech, speech processing and machine learning (ML) provide promising techniques for reliably detecting AD. However, speech audio may be affected by different types of background noise and it is important to understand how the noise influences the accuracy of ML models detecting AD from speech. In this paper, we study the effect of fifteen types of environmental noise from five different categories on the performance of four ML models trained with three types of acoustic representations. We perform a thorough analysis showing how ML models and acoustic features are affected by different types of acoustic noise. We show that acoustic noise is not necessarily harmful - certain types of noise are beneficial for AD detection models and help increasing accuracy by up to 4.8%. We provide recommendations on how to utilize acoustic noise in order to achieve the best performance results with the ML models deployed in real world.

pdf bib
Exploring Multimodal Features and Fusion Strategies for Analyzing Disaster Tweets
Raj Pranesh

Social media platforms, such as Twitter, often provide firsthand news during the outbreak of a crisis. It is extremely essential to process these facts quickly to plan the response efforts for minimal loss. Therefore, in this paper, we present an analysis of various multimodal feature fusion techniques to analyze and classify disaster tweets into multiple crisis events via transfer learning. In our study, we utilized three image models pre-trained on ImageNet dataset and three fine-tuned language models to learn the visual and textual features of the data and combine them to make predictions. We have presented a systematic analysis of multiple intra-modal and cross-modal fusion strategies and their effect on the performance of the multimodal disaster classification system. In our experiment, we used 8,242 disaster tweets, each comprising image, and text data with five disaster event classes. The results show that the multimodal with transformer-attention mechanism and factorized bilinear pooling (FBP) for intra-modal and cross-modal feature fusion respectively achieved the best performance.

pdf bib
NTULM: Enriching Social Media Text Representations with Non-Textual Units
Jinning Li | Shubhanshu Mishra | Ahmed El-Kishky | Sneha Mehta | Vivek Kulkarni

On social media, additional context is often present in the form of annotations and meta-data such as the post’s author, mentions, Hashtags, and hyperlinks. We refer to these annotations as Non-Textual Units (NTUs). We posit that NTUs provide social context beyond their textual semantics and leveraging these units can enrich social media text representations. In this work we construct an NTU-centric social heterogeneous network to co-embed NTUs. We then principally integrate these NTU embeddings into a large pretrained language model by fine-tuning with these additional units. This adds context to noisy short-text social media. Experiments show that utilizing NTU-augmented text representations significantly outperforms existing text-only baselines by 2-5% relative points on many downstream tasks highlighting the importance of context to social media NLP. We also highlight that including NTU context into the initial layers of language model alongside text is better than using it after the text embedding is generated. Our work leads to the generation of holistic general purpose social media content embedding.

pdf bib
Robust Candidate Generation for Entity Linking on Short Social Media Texts
Liam Hebert | Raheleh Makki | Shubhanshu Mishra | Hamidreza Saghir | Anusha Kamath | Yuval Merhav

Entity Linking (EL) is the gateway into Knowledge Bases. Recent advances in EL utilize dense retrieval approaches for Candidate Generation, which addresses some of the shortcomings of the Lookup based approach of matching NER mentions against pre-computed dictionaries. In this work, we show that in the domain of Tweets, such methods suffer as users often include informal spelling, limited context, and lack of specificity, among other issues. We investigate these challenges on a large and recent Tweets benchmark for EL, empirically evaluate lookup and dense retrieval approaches, and demonstrate a hybrid solution using long contextual representation from Wikipedia is necessary to achieve considerable gains over previous work, achieving 0.93 recall.

pdf bib
TransPOS: Transformers for Consolidating Different POS Tagset Datasets
Alex Li | Ilyas Bankole-Hameed | Ranadeep Singh | Gabriel Ng | Akshat Gupta

In hope of expanding training data, researchers often want to merge two or more datasets that are created using different labeling schemes. This paper considers two datasets that label part-of-speech (POS) tags under different tagging schemes and leverage the supervised labels of one dataset to help generate labels for the other dataset. This paper further discusses the theoretical difficulties of this approach and proposes a novel supervised architecture employing Transformers to tackle the problem of consolidating two completely disjoint datasets. The results diverge from initial expectations and discourage exploration into the use of disjoint labels to consolidate datasets with different labels.

pdf bib
An Effective, Performant Named Entity Recognition System for Noisy Business Telephone Conversation Transcripts
Xue-Yong Fu | Cheng Chen | Md Tahmid Rahman Laskar | Shashi Bhushan Tn | Simon Corston-Oliver

We present a simple yet effective method to train a named entity recognition (NER) model that operates on business telephone conversation transcripts that contain noise due to the nature of spoken conversation and artifacts of automatic speech recognition. We first fine-tune LUKE, a state-of-the-art Named Entity Recognition (NER) model, on a limited amount of transcripts, then use it as the teacher model to teach a smaller DistilBERT-based student model using a large amount of weakly labeled data and a small amount of human-annotated data. The model achieves high accuracy while also satisfying the practical constraints for inclusion in a commercial telephony product: realtime performance when deployed on cost-effective CPUs rather than GPUs. In this paper, we introduce the fine-tune-then-distill method for entity recognition on real world noisy data to deploy our NER model in a limited budget production environment. By generating pseudo-labels using a large teacher model pre-trained on typed text while fine-tuned on noisy speech text to train a smaller student model, we make the student model 75x times faster while reserving 99.09% of its accuracy. These findings demonstrate that our proposed approach is very effective in limited budget scenarios to alleviate the need of human labeling of a large amount of noisy data.

pdf bib
Leveraging Semantic and Sentiment Knowledge for User-Generated Text Sentiment Classification
Jawad Khan | Niaz Ahmad | Aftab Alam | Youngmoon Lee

Sentiment analysis is essential to process and understand unstructured user-generated content for better data analytics and decision-making. State-of-the-art techniques suffer from a high dimensional feature space because of noisy and irrelevant features from the noisy user-generated text. Our goal is to mitigate such problems using DNN-based text classification and popular word embeddings (Glove, fastText, and BERT) in conjunction with statistical filter feature selection (mRMR and PCA) to select relevant sentiment features and pick out unessential/irrelevant ones. We propose an effective way of integrating the traditional feature construction methods with the DNN-based methods to improve the performance of sentiment classification. We evaluate our model on three real-world benchmark datasets demonstrating that our proposed method improves the classification performance of several existing methods.

pdf bib
An Emotional Journey: Detecting Emotion Trajectories in Dutch Customer Service Dialogues
Sofie Labat | Amir Hadifar | Thomas Demeester | Veronique Hoste

The ability to track fine-grained emotions in customer service dialogues has many real-world applications, but has not been studied extensively. This paper measures the potential of prediction models on that task, based on a real-world dataset of Dutch Twitter conversations in the domain of customer service. We find that modeling emotion trajectories has a small, but measurable benefit compared to predictions based on isolated turns. The models used in our study are shown to generalize well to different companies and economic sectors.

pdf bib
Supervised and Unsupervised Evaluation of Synthetic Code-Switching
Evgeny Orlov | Ekaterina Artemova

Code-switching (CS) is a phenomenon of mixing words and phrases from multiple languages within a single sentence or conversation. The ever-growing amount of CS communication among multilingual speakers in social media has highlighted the need to adapt existing NLP products for CS speakers and lead to a rising interest in solving CS NLP tasks. A large number of contemporary approaches use synthetic CS data for training. As previous work has shown the positive effect of pretraining on high-quality CS data, the task of evaluating synthetic CS becomes crucial. In this paper, we address the task of evaluating synthetic CS in two settings. In supervised setting, we apply Hinglish finetuned models to solve the quality rating prediction task of HinglishEval competition and establish a new SOTA. In unsupervised setting, we employ the method of acceptability measures with the same models. We find that in both settings, models finetuned on CS data consistently outperform their original counterparts.

pdf bib
ArabGend: Gender Analysis and Inference on Arabic Twitter
Hamdy Mubarak | Shammur Absar Chowdhury | Firoj Alam

Gender analysis of Twitter can reveal important socio-cultural differences between male and female users. There has been a significant effort to analyze and automatically infer gender in the past for most widely spoken languages’ content, however, to our knowledge very limited work has been done for Arabic. In this paper, we perform an extensive analysis of differences between male and female users on the Arabic Twitter-sphere. We study differences in user engagement, topics of interest, and the gender gap in professions. Along with gender analysis, we also propose a method to infer gender by utilizing usernames, profile pictures, tweets, and networks of friends. In order to do so, we manually annotated gender and locations for ~166K Twitter accounts associated with ~92K user location, which we plan to make publicly available. Our proposed gender inference method achieve an F1 score of 82.1% (47.3% higher than majority baseline). We also developed a demo and made it publicly available.

pdf bib
Automatic Identification of 5C Vaccine Behaviour on Social Media
Ajay Hemanth Sampath Kumar | Aminath Shausan | Gianluca Demartini | Afshin Rahimi

Monitoring vaccine behaviour through social media can guide health policy. We present a new dataset of 9471 tweets posted in Australia from 2020 to 2022, annotated with sentiment toward vaccines and also 5C, the five types of behaviour toward vaccines, a scheme commonly used in health psychology literature. We benchmark our dataset using BERT and Gradient Boosting Machine and show that jointly training both sentiment and 5C tasks (F1=48) outperforms individual training (F1=39) in this highly imbalanced data. Our sentiment analysis indicates close correlation between the sentiments and prominent events during the pandemic. We hope that our dataset and benchmark models will inform further work in online monitoring of vaccine behaviour. The dataset and benchmark methods are accessible online.

pdf bib
Automatic Extraction of Structured Mineral Drillhole Results from Unstructured Mining Company Reports
Adam Dimeski | Afshin Rahimi

Aggregate mining exploration results can help companies and governments to optimise and police mining permits and operations, a necessity for transition to a renewable energy future, however, these results are buried in unstructured text. We present a novel dataset from 23 Australian mining company reports, framing the extraction of structured drillhole information as a sequence labelling task. Our two benchmark models based on Bi-LSTM-CRF and BERT, show their effectiveness in this task with a F1 score of 77% and 87%, respectively. Our dataset and benchmarks are accessible online.

pdf bib
“Kanglish alli names!” Named Entity Recognition for Kannada-English Code-Mixed Social Media Data
Sumukh S | Manish Shrivastava

Code-mixing (CM) is a frequently observed phenomenon on social media platforms in multilingual societies such as India. While the increase in code-mixed content on these platforms provides good amount of data for studying various aspects of code-mixing, the lack of automated text analysis tools makes such studies difficult. To overcome the same, tools such as language identifiers and parts of-speech (POS) taggers for analysing code-mixed data have been developed. One such tool is Named Entity Recognition (NER), an important Natural Language Processing (NLP) task, which is not only a subtask of Information Extraction, but is also needed for downstream NLP tasks such as semantic role labeling. While entity extraction from social media data is generally difficult due to its informal nature, code-mixed data further complicates the problem due to its informal, unstructured and incomplete information. In this work, we present the first ever corpus for Kannada-English code-mixed social media data with the corresponding named entity tags for NER. We provide strong baselines with machine learning classification models such as CRF, Bi-LSTM, and Bi-LSTM-CRF on our corpus with word, character, and lexical features.

pdf bib
Span Extraction Aided Improved Code-mixed Sentiment Classification
Ramaneswaran S | Sean Benhur | Sreyan Ghosh

Sentiment classification is a fundamental NLP task of detecting the sentiment polarity of a given text. In this paper we show how solving sentiment span extraction as an auxiliary task can help improve final sentiment classification performance in a low-resource code-mixed setup. To be precise, we don’t solve a simple multi-task learning objective, but rather design a unified transformer framework that exploits the bidirectional connection between the two tasks simultaneously. To facilitate research in this direction we release gold-standard human-annotated sentiment span extraction dataset for Tamil-english code-switched texts. Extensive experiments and strong baselines show that our proposed approach outperforms sentiment and span prediction by 1.27% and 2.78% respectively when compared to the best performing MTL baseline. We also establish the generalizability of our approach on the Twitter Sentiment Extraction dataset. We make our code and data publicly available on GitHub

pdf bib
AdBERT: An Effective Few Shot Learning Framework for Aligning Tweets to Superbowl Advertisements
Debarati Das | Roopana Chenchu | Maral Abdollahi | Jisu Huh | Jaideep Srivastava

The tremendous increase in social media usage for sharing Television (TV) experiences has provided a unique opportunity in the Public Health and Marketing sectors to understand viewer engagement and attitudes through viewer-generated content on social media. However, this opportunity also comes with associated technical challenges. Specifically, given a televised event and related tweets about this event, we need methods to effectively align these tweets and the corresponding event. In this paper, we consider the specific ecosystem of the Superbowl 2020 and map viewer tweets to advertisements they are referring to. Our proposed model, AdBERT, is an effective few-shot learning framework that is able to handle the technical challenges of establishing ad-relatedness, class imbalance as well as the scarcity of labeled data. As part of this study, we have curated and developed two datasets that can prove to be useful for Social TV research: 1) dataset of ad-related tweets and 2) dataset of ad descriptions of Superbowl advertisements. Explaining connections to SentenceBERT, we describe the advantages of AdBERT that allow us to make the most out of a challenging and interesting dataset which we will open-source along with the models developed in this paper.

pdf bib
Increasing Robustness for Cross-domain Dialogue Act Classification on Social Media Data
Marcus Vielsted | Nikolaj Wallenius | Rob van der Goot

Automatically detecting the intent of an utterance is important for various downstream natural language processing tasks. This task is also called Dialogue Act Classification (DAC) and was primarily researched on spoken one-to-one conversations. The rise of social media has made this an interesting data source to explore within DAC, although it comes with some difficulties: non-standard form, variety of language types (across and within platforms), and quickly evolving norms. We therefore investigate the robustness of DAC on social media data in this paper. More concretely, we provide a benchmark that includes cross-domain data splits, as well as a variety of improvements on our transformer-based baseline. Our experiments show that lexical normalization is not beneficial in this setup, balancing the labels through resampling is beneficial in some cases, and incorporating context is crucial for this task and leads to the highest performance improvements 7 F1 percentage points in-domain and 20 cross-domain).

pdf bib
Disfluency Detection for Vietnamese
Mai Hoang Dao | Thinh Hung Truong | Dat Quoc Nguyen

In this paper, we present the first empirical study for Vietnamese disfluency detection. To conduct this study, we first create a disfluency detection dataset for Vietnamese, with manual annotations over two disfluency types. We then empirically perform experiments using strong baseline models, and find that: automatic Vietnamese word segmentation improves the disfluency detection performances of the baselines, and the highest performance results are obtained by fine-tuning pre-trained language models in which the monolingual model PhoBERT for Vietnamese does better than the multilingual model XLM-R.

pdf bib
A multi-level approach for hierarchical Ticket Classification
Matteo Marcuzzo | Alessandro Zangari | Michele Schiavinato | Lorenzo Giudice | Andrea Gasparetto | Andrea Albarelli

The automatic categorization of support tickets is a fundamental tool for modern businesses. Such requests are most commonly composed of concise textual descriptions that are noisy and filled with technical jargon. In this paper, we test the effectiveness of pre-trained LMs for the classification of issues related to software bugs. First, we test several strategies to produce single, ticket-wise representations starting from their BERT-generated word embeddings. Then, we showcase a simple yet effective way to build a multi-level classifier for the categorization of documents with two hierarchically dependent labels. We experiment on a public bugs dataset and compare our results with standard BERT-based and traditional SVM classifiers. Our findings suggest that both embedding strategies and hierarchical label dependencies considerably impact classification accuracy.

pdf bib
Towards better structured and less noisy Web data: Oscar with Register annotations
Veronika Laippala | Anna Salmela | Samuel Rönnqvist | Alham Fikri Aji | Li-Hsin Chang | Asma Dhifallah | Larissa Goulart | Henna Kortelainen | Marc Pàmies | Deise Prina Dutra | Valtteri Skantsi | Lintang Sutawika | Sampo Pyysalo

Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.

pdf bib
True or False? Detecting False Information on Social Media Using Graph Neural Networks
Samyo Rode-Hasinger | Anna Kruspe | Xiao Xiang Zhu

In recent years, false information such as fake news, rumors and conspiracy theories on many relevant issues in society have proliferated. This phenomenon has been significantly amplified by the fast and inexorable spread of misinformation on social media and instant messaging platforms. With this work, we contribute to containing the negative impact on society caused by fake news. We propose a graph neural network approach for detecting false information on Twitter. We leverage the inherent structure of graph-based social media data aggregating information from short text messages (tweets), user profiles and social interactions. We use knowledge from pre-trained language models efficiently, and show that user-defined descriptions of profiles provide useful information for improved prediction performance. The empirical results indicate that our proposed framework significantly outperforms text- and user-based methods on misinformation datasets from two different domains, even in a difficult multilingual setting.

pdf bib
Analyzing the Real Vulnerability of Hate Speech Detection Systems against Targeted Intentional Noise
Piush Aggarwal | Torsten Zesch

Hate speech detection systems have been shown to be vulnerable against obfuscation attacks, where a potential hater tries to circumvent detection by deliberately introducing noise in their posts. In previous work, noise is often introduced for all words (which is likely overestimating the impact) or single untargeted words (likely underestimating the vulnerability). We perform a user study asking people to select words they would obfuscate in a post. Using this realistic setting, we find that the real vulnerability of hate speech detection systems against deliberately introduced noise is almost as high as when using a whitebox attack and much more severe than when using a non-targeted dictionary. Our results are based on 4 different datasets, 12 different obfuscation strategies, and hate speech detection systems using different paradigms.