2024
pdf
bib
abs
Women Are Beautiful, Men Are Leaders: Gender Stereotypes in Machine Translation and Language Modeling
Matúš Pikuliak
|
Stefan Oresko
|
Andrea Hrckova
|
Marian Simko
Findings of the Association for Computational Linguistics: EMNLP 2024
We present GEST – a new manually created dataset designed to measure gender-stereotypical reasoning in language models and machine translation systems. GEST contains samples for 16 gender stereotypes about men and women (e.g., Women are beautiful, Men are leaders) that are compatible with the English language and 9 Slavic languages. The definition of said stereotypes was informed by gender experts. We used GEST to evaluate English and Slavic masked LMs, English generative LMs, and machine translation systems. We discovered significant and consistent amounts of gender-stereotypical reasoning in almost all the evaluated models and languages. Our experiments confirm the previously postulated hypothesis that the larger the model, the more stereotypical it usually is.
pdf
bib
abs
Disinformation Capabilities of Large Language Models
Ivan Vykopal
|
Matúš Pikuliak
|
Ivan Srba
|
Robert Moro
|
Dominik Macko
|
Maria Bielikova
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Automated disinformation generation is often listed as one of the risks of large language models (LLMs). The theoretical ability to flood the information space with disinformation content might have dramatic consequences for democratic societies around the world. This paper presents a comprehensive study of the disinformation capabilities of the current generation of LLMs to generate false news articles in English language. In our study, we evaluated the capabilities of 10 LLMs using 20 disinformation narratives. We evaluated several aspects of the LLMs: how well they are at generating news articles, how strongly they tend to agree or disagree with the disinformation narratives, how often they generate safety warnings, etc. We also evaluated the abilities of detection models to detect these articles as LLM-generated. We conclude that LLMs are able to generate convincing news articles that agree with dangerous disinformation narratives.
2023
pdf
bib
abs
MULTITuDE: Large-Scale Multilingual Machine-Generated Text Detection Benchmark
Dominik Macko
|
Robert Moro
|
Adaku Uchendu
|
Jason Lucas
|
Michiharu Yamashita
|
Matúš Pikuliak
|
Ivan Srba
|
Thai Le
|
Dongwon Lee
|
Jakub Simko
|
Maria Bielikova
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
There is a lack of research into capabilities of recent LLMs to generate convincing text in languages other than English and into performance of detectors of machine-generated text in multilingual settings. This is also reflected in the available benchmarks which lack authentic texts in languages other than English and predominantly cover older generators. To fill this gap, we introduce MULTITuDE, a novel benchmarking dataset for multilingual machine-generated text detection comprising of 74,081 authentic and machine-generated texts in 11 languages (ar, ca, cs, de, en, es, nl, pt, ru, uk, and zh) generated by 8 multilingual LLMs. Using this benchmark, we compare the performance of zero-shot (statistical and black-box) and fine-tuned detectors. Considering the multilinguality, we evaluate 1) how these detectors generalize to unseen languages (linguistically similar as well as dissimilar) and unseen LLMs and 2) whether the detectors improve their performance when trained on multiple languages.
pdf
bib
abs
Multilingual Previously Fact-Checked Claim Retrieval
Matúš Pikuliak
|
Ivan Srba
|
Robert Moro
|
Timo Hromadka
|
Timotej Smoleň
|
Martin Melišek
|
Ivan Vykopal
|
Jakub Simko
|
Juraj Podroužek
|
Maria Bielikova
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Fact-checkers are often hampered by the sheer amount of online content that needs to be fact-checked. NLP can help them by retrieving already existing fact-checks relevant to the content being investigated. This paper introduces a new multilingual dataset for previously fact-checked claim retrieval. We collected 28k posts in 27 languages from social media, 206k fact-checks in 39 languages written by professional fact-checkers, as well as 31k connections between these two groups. This is the most extensive and the most linguistically diverse dataset of this kind to date. We evaluated how different unsupervised methods fare on this dataset and its various dimensions. We show that evaluating such a diverse dataset has its complexities and proper care needs to be taken before interpreting the results. We also evaluated a supervised fine-tuning approach, improving upon the unsupervised method significantly.
pdf
bib
abs
In-Depth Look at Word Filling Societal Bias Measures
Matúš Pikuliak
|
Ivana Beňová
|
Viktor Bachratý
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics
Many measures of societal bias in language models have been proposed in recent years. A popular approach is to use a set of word filling prompts to evaluate the behavior of the language models. In this work, we analyze the validity of two such measures – StereoSet and CrowS-Pairs. We show that these measures produce unexpected and illogical results when appropriate control group samples are constructed. Based on this, we believe that they are problematic and using them in the future should be reconsidered. We propose a way forward with an improved testing protocol. Finally, we also introduce a new gender bias dataset for Slovak.
2022
pdf
bib
abs
SlovakBERT: Slovak Masked Language Model
Matúš Pikuliak
|
Štefan Grivalský
|
Martin Konôpka
|
Miroslav Blšták
|
Martin Tamajka
|
Viktor Bachratý
|
Marian Simko
|
Pavol Balážik
|
Michal Trnka
|
Filip Uhlárik
Findings of the Association for Computational Linguistics: EMNLP 2022
We introduce a new Slovak masked language model called SlovakBERT. This is to our best knowledge the first paper discussing Slovak transformers-based language models. We evaluate our model on several NLP tasks and achieve state-of-the-art results. This evaluation is likewise the first attempt to establish a benchmark for Slovak language models. We publish the masked language model, as well as the fine-tuned models for part-of-speech tagging, sentiment analysis and semantic textual similarity.
pdf
bib
abs
Average Is Not Enough: Caveats of Multilingual Evaluation
Matúš Pikuliak
|
Marian Simko
Proceedings of the 2nd Workshop on Multi-lingual Representation Learning (MRL)
This position paper discusses the problem of multilingual evaluation. Using simple statistics, such as average language performance, might inject linguistic biases in favor of dominant language families into evaluation methodology. We argue that a qualitative analysis informed by comparative linguistics is needed for multilingual results to detect this kind of bias. We show in our case study that results in published works can indeed be linguistically biased and we demonstrate that visualization based on URIEL typological database can detect it.
2019
pdf
bib
abs
STUFIIT at SemEval-2019 Task 5: Multilingual Hate Speech Detection on Twitter with MUSE and ELMo Embeddings
Michal Bojkovský
|
Matúš Pikuliak
Proceedings of the 13th International Workshop on Semantic Evaluation
We present a number of models used for hate speech detection for Semeval 2019 Task-5: Hateval. We evaluate the viability of multilingual learning for this task. We also experiment with adversarial learning as a means of creating a multilingual model. Ultimately our multilingual models have had worse results than their monolignual counterparts. We find that the choice of word representations (word embeddings) is very crucial for deep learning as a simple switch between MUSE and ELMo embeddings has shown a 3-4% increase in accuracy. This also shows the importance of context when dealing with online content.
2018
pdf
bib
abs
Improving Moderation of Online Discussions via Interpretable Neural Models
Andrej Švec
|
Matúš Pikuliak
|
Marián Šimko
|
Mária Bieliková
Proceedings of the 2nd Workshop on Abusive Language Online (ALW2)
Growing amount of comments make online discussions difficult to moderate by human moderators only. Antisocial behavior is a common occurrence that often discourages other users from participating in discussion. We propose a neural network based method that partially automates the moderation process. It consists of two steps. First, we detect inappropriate comments for moderators to see. Second, we highlight inappropriate parts within these comments to make the moderation faster. We evaluated our method on data from a major Slovak news discussion platform.