Alexander Shvets


2024

pdf bib
Community OSCAR: A Community Effort for Multilingual Web Data
Manuel Brack | Malte Ostendorff | Pedro Ortiz Suarez | José Javier Saiz | Iñaki Lacunza Castilla | Jorge Palomar-Giner | Alexander Shvets | Patrick Schramowski | Georg Rehm | Marta Villegas | Kristian Kersting
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)

The development of large language models (LLMs) relies heavily on extensive, high-quality datasets. Publicly available datasets focus predominantly on English, leaving other language communities behind. To address this issue, we introduce Community OSCAR, a multilingual dataset initiative designed to address the gap between English and non-English data availability. Through a collective effort, Community OSCAR covers over 150 languages with 45 billion documents, totaling over 345 TiB of data. Initial results indicate that Community OSCAR provides valuable raw data for training LLMs and enhancing the performance of multilingual models. This work aims to contribute to the ongoing advancements in multilingual NLP and to support a more inclusive AI ecosystem by making high-quality, multilingual data more accessible to those working with low-resource languages.

pdf bib
GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection?
Yiping Jin | Leo Wanner | Alexander Shvets
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Online hate detection suffers from biases incurred in data sampling, annotation, and model pre-training. Therefore, measuring the averaged performance over all examples in held-out test data is inadequate. Instead, we must identify specific model weaknesses and be informed when it is more likely to fail. A recent proposal in this direction is HateCheck, a suite for testing fine-grained model functionalities on synthesized data generated using templates of the kind “You are just a [slur] to me.” However, despite enabling more detailed diagnostic insights, the HateCheck test cases are often generic and have simplistic sentence structures that do not match the real-world data. To address this limitation, we propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch by instructing large language models (LLMs). We employ an additional natural language inference (NLI) model to verify the generations. Crowd-sourced annotation demonstrates that the generated test cases are of high quality. Using the new functional tests, we can uncover model weaknesses that would be overlooked using the original HateCheck dataset.

2023

pdf bib
Error syntax aware augmentation of feedback comment generation dataset
Nikolay Babakov | Maria Lysyuk | Alexander Shvets | Lilya Kazakova | Alexander Panchenko
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges

This paper presents a solution to the GenChal 2022 shared task dedicated to feedback comment generation for writing learning. In terms of this task given a text with an error and a span of the error, a system generates an explanatory note that helps the writer (language learner) to improve their writing skills. Our solution is based on fine-tuning the T5 model on the initial dataset augmented according to syntactical dependencies of the words located within indicated error span. The solution of our team ‘nigula’ obtained second place according to manual evaluation by the organizers.

pdf bib
A Pipeline for Extracting Abstract Dependency Templates for Data-to-Text Natural Language Generation
Simon Mille | Josep Ricci | Alexander Shvets | Anya Belz
Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)

We present work in progress that aims to address the coverage issue faced by rule-based text generators. We propose a pipeline for extracting abstract dependency template (predicate-argument structures) from Wikipedia text to be used as input for generating text from structured data with the FORGe system. The pipeline comprises three main components: (i) candidate sentence retrieval, (ii) clause extraction, ranking and selection, and (iii) conversion to predicate-argument form. We present an approach and preliminary evaluation for the ranking and selection module.

pdf bib
Towards Weakly-Supervised Hate Speech Classification Across Datasets
Yiping Jin | Leo Wanner | Vishakha Kadam | Alexander Shvets
The 7th Workshop on Online Abuse and Harms (WOAH)

As pointed out by several scholars, current research on hate speech (HS) recognition is characterized by unsystematic data creation strategies and diverging annotation schemata. Subsequently, supervised-learning models tend to generalize poorly to datasets they were not trained on, and the performance of the models trained on datasets labeled using different HS taxonomies cannot be compared. To ease this problem, we propose applying extremely weak supervision that only relies on the class name rather than on class samples from the annotated data. We demonstrate the effectiveness of a state-of-the-art weakly-supervised text classification model in various in-dataset and cross-dataset settings. Furthermore, we conduct an in-depth quantitative and qualitative analysis of the source of poor generalizability of HS classification models.

2022

pdf bib
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers
Luis Espinosa Anke | Alexander Shvets | Alireza Mohammadshahi | James Henderson | Leo Wanner
Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

Recognizing and categorizing lexical collocations in context is useful for language learning, dictionary compilation and downstream NLP. However, it is a challenging task due to the varying degrees of frozenness lexical collocations exhibit. In this paper, we put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.

2021

pdf bib
Targets and Aspects in Social Media Hate Speech
Alexander Shvets | Paula Fortuna | Juan Soler | Leo Wanner
Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021)

Mainstream research on hate speech focused so far predominantly on the task of classifying mainly social media posts with respect to predefined typologies of rather coarse-grained hate speech categories. This may be sufficient if the goal is to detect and delete abusive language posts. However, removal is not always possible due to the legislation of a country. Also, there is evidence that hate speech cannot be successfully combated by merely removing hate speech posts; they should be countered by education and counter-narratives. For this purpose, we need to identify (i) who is the target in a given hate speech post, and (ii) what aspects (or characteristics) of the target are attributed to the target in the post. As the first approximation, we propose to adapt a generic state-of-the-art concept extraction model to the hate speech domain. The outcome of the experiments is promising and can serve as inspiration for further work on the task

2018

pdf bib
Sentence Packaging in Text Generation from Semantic Graphs as a Community Detection Problem
Alexander Shvets | Simon Mille | Leo Wanner
Proceedings of the 11th International Conference on Natural Language Generation

An increasing amount of research tackles the challenge of text generation from abstract ontological or semantic structures, which are in their very nature potentially large connected graphs. These graphs must be “packaged” into sentence-wise subgraphs. We interpret the problem of sentence packaging as a community detection problem with post optimization. Experiments on the texts of the VerbNet/FrameNet structure annotated-Penn Treebank, which have been converted into graphs by a coreference merge using Stanford CoreNLP, show a high F1-score of 0.738.

2012

pdf bib
Speech and Language Resources for LVCSR of Russian
Sergey Zablotskiy | Alexander Shvets | Maxim Sidorov | Eugene Semenkin | Wolfgang Minker
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

A syllable-based language model reduces the lexicon size by hundreds of times. It is especially beneficial in case of highly inflective languages like Russian due to the abundance of word forms according to various grammatical categories. However, the main arising challenge is the concatenation of recognised syllables into the originally spoken sentence or phrase, particularly in the presence of syllable recognition mistakes. Natural fluent speech does not usually incorporate clear information about the outside borders of the spoken words. In this paper a method for the syllable concatenation and error correction is suggested and tested. It is based on the designed co-evolutionary asymptotic probabilistic genetic algorithm for the determination of the most likely sentence corresponding to the recognized chain of syllables within an acceptable time frame. The advantage of this genetic algorithm modification is the minimum number of settings to be manually adjusted comparing to the standard algorithm. Data used for acoustic and language modelling are also described here. A special issue is the preprocessing of the textual data, particularly, handling of abbreviations, Arabic and Roman numerals, since their inflection mostly depends on the context and grammar.