Alice Fedotova


2024

pdf bib
Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription
Alice Fedotova | Adriano Ferraresi | Maja Miličević Petrović | Alberto Barrón-Cedeño
Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024)

This paper presents a novel pipeline for constructing multimodal and multilingual parallel corpora, with a focus on evaluating state-of-the-art ASR tools for verbatim transcription. Our findings indicate that current technologies can streamline corpus construction, with fine-tuning showing promising results in terms of transcription quality compared to out-of-the-box Whisper models. The lowest overall WER achieved for English was 0.180, using a fine-tuned Whisper-small model. As for Italian, the fine-tuned Whisper-small model obtained a lower WER of 0.201 compared to the baseline Whisper-small’s WER of 0.219. While limitations remain, the updated pipeline is expected to drastically reduce the human efforts involved.

pdf bib
A Corpus for Sentence-Level Subjectivity Detection on English News Articles
Francesco Antici | Federico Ruggeri | Andrea Galassi | Katerina Korre | Arianna Muti | Alessandra Bardi | Alice Fedotova | Alberto Barrón-Cedeño
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We develop novel annotation guidelines for sentence-level subjectivity detection, which are not limited to language-specific cues. We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English and across other languages without relying on language-specific tools, such as lexicons or machine translation. We evaluate state-of-the-art multilingual transformer-based models on the task in mono-, multi-, and cross-language settings. For this purpose, we re-annotate an existing Italian corpus. We observe that models trained in the multilingual setting achieve the best performance on the task.