2024
pdf
bib
abs
Multilingual Topic Classification in X: Dataset and Analysis
Dimosthenis Antypas
|
Asahi Ushio
|
Francesco Barbieri
|
Jose Camacho-Collados
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In the dynamic realm of social media, diverse topics are discussed daily, transcending linguistic boundaries. However, the complexities of understanding and categorising this content across various languages remain an important challenge with traditional techniques like topic modelling often struggling to accommodate this multilingual diversity. In this paper, we introduce X-Topic, a multilingual dataset featuring content in four distinct languages (English, Spanish, Japanese, and Greek), crafted for the purpose of tweet topic classification. Our dataset includes a wide range of topics, tailored for social media content, making it a valuable resource for scientists and professionals working on cross-linguistic analysis, the development of robust multilingual models, and computational scientists studying online dialogue. Finally, we leverage X-Topic to perform a comprehensive cross-linguistic and multilingual analysis, and compare the capabilities of current general- and domain-specific language models.
pdf
bib
abs
How Are Metaphors Processed by Language Models? The Case of Analogies
Joanne Boisson
|
Asahi Ushio
|
Hsuvas Borkakoty
|
Kiamehr Rezaee
|
Dimosthenis Antypas
|
Zara Siddique
|
Nina White
|
Jose Camacho-Collados
Proceedings of the 28th Conference on Computational Natural Language Learning
The ability to compare by analogy, metaphorically or not, lies at the core of how humans understand the world and communicate. In this paper, we study the likelihood of metaphoric outputs, and the capability of a wide range of pretrained transformer-based language models to identify metaphors from other types of analogies, including anomalous ones. In particular, we are interested in discovering whether language models recognise metaphorical analogies equally well as other types of analogies, and whether the model size has an impact on this ability. The results show that there are relevant differences using perplexity as a proxy, with the larger models reducing the gap when it comes to analogical processing, and for distinguishing metaphors from incorrect analogies. This behaviour does not result in increased difficulties for larger generative models in identifying metaphors in comparison to other types of analogies from anomalous sentences in a zero-shot generation setting, when perplexity values of metaphoric and non-metaphoric analogies are similar.
pdf
bib
abs
A Multi-Faceted NLP Analysis of Misinformation Spreaders in Twitter
Dimosthenis Antypas
|
Alun Preece
|
Jose Camacho-Collados
Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis
Social media is an integral part of the daily life of an increasingly large number of people worldwide. Used for entertainment, communication and news updates, it constitutes a source of information that has been extensively used to study human behaviour. Unfortunately, the open nature of social media platforms along with the difficult task of supervising their content has led to a proliferation of misinformation posts. In this paper, we aim to identify the textual differences between the profiles of user that share misinformation from questionable sources and those that do not. Our goal is to better understand user behaviour in order to be better equipped to combat this issue. To this end, we identify Twitter (X) accounts of potential misinformation spreaders and apply transformer models specialised in social media to extract characteristics such as sentiment, emotion, topic and presence of hate speech. Our results indicate that, while there may be some differences between the behaviour of users that share misinformation and those that do not, there are no large differences when it comes to the type of content shared.
2023
pdf
bib
abs
SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research
Dimosthenis Antypas
|
Asahi Ushio
|
Francesco Barbieri
|
Leonardo Neves
|
Kiamehr Rezaee
|
Luis Espinosa-Anke
|
Jiaxin Pei
|
Jose Camacho-Collados
Findings of the Association for Computational Linguistics: EMNLP 2023
Despite its relevance, the maturity of NLP for social media pales in comparison with general-purpose models, metrics and benchmarks. This fragmented landscape makes it hard for the community to know, for instance, given a task, which is the best performing model and how it compares with others. To alleviate this issue, we introduce a unified benchmark for NLP evaluation in social media, SuperTweetEval, which includes a heterogeneous set of tasks and datasets combined, adapted and constructed from scratch. We benchmarked the performance of a wide range of models on SuperTweetEval and our results suggest that, despite the recent advances in language modelling, social media remains challenging.
pdf
bib
abs
Robust Hate Speech Detection in Social Media: A Cross-Dataset Empirical Evaluation
Dimosthenis Antypas
|
Jose Camacho-Collados
The 7th Workshop on Online Abuse and Harms (WOAH)
The automatic detection of hate speech online is an active research area in NLP. Most of the studies to date are based on social media datasets that contribute to the creation of hate speech detection models trained on them. However, data creation processes contain their own biases, and models inherently learn from these dataset-specific biases. In this paper, we perform a large-scale cross-dataset comparison where we fine-tune language models on different hate speech detection datasets. This analysis shows how some datasets are more generalizable than others when used as training data. Crucially, our experiments show how combining hate speech detection datasets can contribute to the development of robust hate speech detection models. This robustness holds even when controlling by data size and compared with the best individual datasets.
2022
pdf
bib
abs
TweetNLP: Cutting-Edge Natural Language Processing for Social Media
Jose Camacho-collados
|
Kiamehr Rezaee
|
Talayeh Riahi
|
Asahi Ushio
|
Daniel Loureiro
|
Dimosthenis Antypas
|
Joanne Boisson
|
Luis Espinosa Anke
|
Fangyu Liu
|
Eugenio Martínez Cámara
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
In this paper we present TweetNLP, an integrated platform for Natural Language Processing (NLP) in social media. TweetNLP supports a diverse set of NLP tasks, including generic focus areas such as sentiment analysis and named entity recognition, as well as social media-specific tasks such as emoji prediction and offensive language identification. Task-specific systems are powered by reasonably-sized Transformer-based language models specialized on social media text (in particular, Twitter) which can be run without the need for dedicated hardware or cloud services. The main contributions of TweetNLP are: (1) an integrated Python library for a modern toolkit supporting social media analysis using our various task-specific models adapted to the social domain; (2) an interactive online demo for codeless experimentation using our models; and (3) a tutorial covering a wide variety of typical social media applications.
pdf
bib
abs
Twitter Topic Classification
Dimosthenis Antypas
|
Asahi Ushio
|
Jose Camacho-Collados
|
Vitor Silva
|
Leonardo Neves
|
Francesco Barbieri
Proceedings of the 29th International Conference on Computational Linguistics
Social media platforms host discussions about a wide variety of topics that arise everyday. Making sense of all the content and organising it into categories is an arduous task. A common way to deal with this issue is relying on topic modeling, but topics discovered using this technique are difficult to interpret and can differ from corpus to corpus. In this paper, we present a new task based on tweet topic classification and release two associated datasets. Given a wide range of topics covering the most important discussion points in social media, we provide training and testing data from recent time periods that can be used to evaluate tweet classification models. Moreover, we perform a quantitative evaluation and analysis of current general- and domain-specific language models on the task, which provide more insights on the challenges and nature of the task.
2021
pdf
bib
abs
COVID-19 and Misinformation: A Large-Scale Lexical Analysis on Twitter
Dimosthenis Antypas
|
Jose Camacho-Collados
|
Alun Preece
|
David Rogers
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Social media is often used by individuals and organisations as a platform to spread misinformation. With the recent coronavirus pandemic we have seen a surge of misinformation on Twitter, posing a danger to public health. In this paper, we compile a large COVID-19 Twitter misinformation corpus and perform an analysis to discover patterns with respect to vocabulary usage. Among others, our analysis reveals that the variety of topics and vocabulary usage are considerably more limited and negative in tweets related to misinformation than in randomly extracted tweets. In addition to our qualitative analysis, our experimental results show that a simple linear model based only on lexical features is effective in identifying misinformation-related tweets (with accuracy over 80%), providing evidence to the fact that the vocabulary used in misinformation largely differs from generic tweets.