Silvia Bernardini

2023

On the Identification and Forecasting of Hate Speech in Inceldom
Paolo Gajo | Arianna Muti | Katerina Korre | Silvia Bernardini | Alberto Barrón-Cedeño
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Spotting hate speech in social media posts is crucial to increase the civility of the Web and has been thoroughly explored in the NLP community. For the first time, we introduce a multilingual corpus for the analysis and identification of hate speech in the domain of inceldom, built from incel Web forums in English and Italian, including expert annotation at the post level for two kinds of hate speech: misogyny and racism. This resource paves the way for the development of mono- and cross-lingual models for (a) the identification of hateful (misogynous and racist) posts and (b) the forecasting of the amount of hateful responses that a post is likely to trigger. Our experiments aim at improving the performance of Transformer-based models using masked language modeling pre-training and dataset merging. The results show that these strategies boost the models’ performance in all settings (binary classification, multi-label classification and forecasting), especially in the cross-lingual scenarios.

pdf bib

Hate Speech Detection in an Italian Incel Forum Using Bilingual Data for Pre-Training and Fine-Tuning
Paolo Gajo | Silvia Bernardini | Adriano Ferraresi | Alberto Barrón-Cedeño
Proceedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)

pdf bib abs

Return to the Source: Assessing Machine Translation Suitability
Francesco Fernicola | Silvia Bernardini | Federico Garcea | Adriano Ferraresi | Alberto Barrón-Cedeño
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

We approach the task of assessing the suitability of a source text for translation by transferring the knowledge from established MT evaluation metrics to a model able to predict MT quality a priori from the source text alone. To open the door to experiments in this regard, we depart from reference English-German parallel corpora to build a corpus of 14,253 source text-quality score tuples. The tuples include four state-of-the-art metrics: cushLEPOR, BERTScore, COMET, and TransQuest. With this new resource at hand, we fine-tune XLM-RoBERTa, both in a single-task and a multi-task setting, to predict these evaluation scores from the source text alone. Results for this methodology are promising, with the single-task model able to approximate well-established MT evaluation and quality estimation metrics - without looking at the actual machine translations - achieving low RMSE values in the [0.1-0.2] range and Pearson correlation scores up to 0.688.

2019

pdf bib

Do translator trainees trust machine translation? An experiment on post-editing and revision
Randy Scansani | Silvia Bernardini | Adriano Ferraresi | Luisa Bentivogli
Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks

pdf bib

MAGMATic: A Multi-domain Academic Gold Standard with Manual Annotation of Terminology for Machine Translation Evaluation
Randy Scansani | Luisa Bentivogli | Silvia Bernardini | Adriano Ferraresi
Proceedings of Machine Translation Summit XVII: Research Track

2017

pdf bib abs

Enhancing Machine Translation of Academic Course Catalogues with Terminological Resources
Randy Scansani | Silvia Bernardini | Adriano Ferraresi | Federico Gaspari | Marcello Soffritti
Proceedings of the Workshop Human-Informed Translation and Interpreting Technology

This paper describes an approach to translating course unit descriptions from Italian and German into English, using a phrase-based machine translation (MT) system. The genre is very prominent among those requiring translation by universities in European countries in which English is a non-native language. For each language combination, an in-domain bilingual corpus including course unit and degree program descriptions is used to train an MT engine, whose output is then compared to a baseline engine trained on the Europarl corpus. In a subsequent experiment, a bilingual terminology database is added to the training sets in both engines and its impact on the output quality is evaluated based on BLEU and post-editing score. Results suggest that the use of domain-specific corpora boosts the engines quality for both language combinations, especially for German-English, whereas adding terminological resources does not seem to bring notable benefits.

2009

pdf bib

Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning
Iustina Ilisei | Viktor Pekar | Silvia Bernardini
Proceedings of the Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography, and Language Learning

2008

pdf bib abs

Introducing, evaluating ukWaC, a very large web-derived corpus of English
Adriano Ferraresi | Eros Zanchetta | Marco Baroni | Silvia Bernardini
Proceedings of the 4th Web as Corpus Workshop

In this paper we introduce ukWaC, a large corpus of English constructed by crawling the .uk Internet domain. The corpus contains more than 2 billion tokens, is one of the largest freely available linguistic resources for English. The paper describes the tools, methodology used in the construction of the corpus, provides a qualitative evaluation of its contents, carried out through a vocabulary-based comparison with the BNC. We conclude by giving practical information about availability, format of the corpus.