Alessia Battisti


2024

pdf bib
Automatic Annotation Elaboration as Feedback to Sign Language Learners
Alessia Battisti | Sarah Ebling
Proceedings of The 18th Linguistic Annotation Workshop (LAW-XVIII)

Beyond enabling linguistic analyses, linguistic annotations may serve as training material for developing automatic language assessment models as well as for providing textual feedback to language learners. Yet these linguistic annotations in their original form are often not easily comprehensible for learners. In this paper, we explore the utilization of GPT-4, as an example of a large language model (LLM), to process linguistic annotations into clear and understandable feedback on their productions for language learners, specifically sign language learners.

pdf bib
Advancing Annotation for Continuous Data in Swiss German Sign Language
Alessia Battisti | Katja Tissi | Sandra Sidler-Miserez | Sarah Ebling
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources

pdf bib
Person Identification from Pose Estimates in Sign Language
Alessia Battisti | Emma van den Bold | Anne Göhring | Franz Holzknecht | Sarah Ebling
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources

2023

pdf bib
First WMT Shared Task on Sign Language Translation (WMT-SLT22)
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgoz | Cristina España-Bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios Gonzales | Dimitar Shterionov | Sandra Sidler-Miserez | Katja Tissi | Davy Van Landuyt
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website (https://www.wmt-slt.com/) or in the findings paper (Müller et al., 2022).

2022

pdf bib
Findings of the First WMT Shared Task on Sign Language Translation (WMT-SLT22)
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgöz | Cristina España-bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios | Dimitar Shterionov | Sandra Sidler-miserez | Katja Tissi
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper presents the results of the First WMT Shared Task on Sign Language Translation (WMT-SLT22).This shared task is concerned with automatic translation between signed and spoken languages. The task is novel in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT).The task featured two tracks, translating from Swiss German Sign Language (DSGS) to German and vice versa. Seven teams participated in this first edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora, reproducible baseline systems and new protocols and software for human evaluation. Finally, the task also resulted in the first publicly available set of system outputs and human evaluation scores for sign language translation.

pdf bib
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

2020

pdf bib
A Corpus for Automatic Readability Assessment and Text Simplification of German
Alessia Battisti | Dominik Pfütze | Andreas Säuberli | Marek Kostrzewa | Sarah Ebling
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German, the first of its kind for this language. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.