Alessia Battisti
2025
ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords
Sina Ahmadi | Micha David Hess | Elena Álvarez-Mellado | Alessia Battisti | Cui Ding | Anne Göhring | Yingqiang Gao | Zifan Jiang | Andrianos Michail | Peshmerge Morad | Joel Niklaus | Maria Christina Panagiotopoulou | Stefano Perrella | Juri Opitz | Anastassia Shaitarova | Rico Sennrich
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sina Ahmadi | Micha David Hess | Elena Álvarez-Mellado | Alessia Battisti | Cui Ding | Anne Göhring | Yingqiang Gao | Zifan Jiang | Andrianos Michail | Peshmerge Morad | Joel Niklaus | Maria Christina Panagiotopoulou | Stefano Perrella | Juri Opitz | Anastassia Shaitarova | Rico Sennrich
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Lexical borrowing, the adoption of words from one language into another, is a ubiquitous linguistic phenomenon influenced by geopolitical, societal, and technological factors. This paper introduces ConLoan–a novel contrastive dataset comprising sentences with and without loanwords across 10 languages. Through systematic evaluation using this dataset, we investigate how state-of-the-art machine translation and language models process loanwords compared to their native alternatives. Our experiments reveal that these systems show systematic preferences for loanwords over native terms and exhibit varying performance across languages. These findings provide valuable insights for developing more linguistically robust NLP systems.
2024
Automatic Annotation Elaboration as Feedback to Sign Language Learners
Alessia Battisti | Sarah Ebling
Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII)
Alessia Battisti | Sarah Ebling
Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII)
Beyond enabling linguistic analyses, linguistic annotations may serve as training material for developing automatic language assessment models as well as for providing textual feedback to language learners. Yet these linguistic annotations in their original form are often not easily comprehensible for learners. In this paper, we explore the utilization of GPT-4, as an example of a large language model (LLM), to process linguistic annotations into clear and understandable feedback on their productions for language learners, specifically sign language learners.
Advancing Annotation for Continuous Data in Swiss German Sign Language
Alessia Battisti | Katja Tissi | Sandra Sidler-Miserez | Sarah Ebling
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources
Alessia Battisti | Katja Tissi | Sandra Sidler-Miserez | Sarah Ebling
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources
Person Identification from Pose Estimates in Sign Language
Alessia Battisti | Emma van den Bold | Anne Göhring | Franz Holzknecht | Sarah Ebling
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources
Alessia Battisti | Emma van den Bold | Anne Göhring | Franz Holzknecht | Sarah Ebling
Proceedings of the LREC-COLING 2024 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources
2023
First WMT Shared Task on Sign Language Translation (WMT-SLT22)
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgoz | Cristina España-Bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios Gonzales | Dimitar Shterionov | Sandra Sidler-Miserez | Katja Tissi | Davy Van Landuyt
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgoz | Cristina España-Bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios Gonzales | Dimitar Shterionov | Sandra Sidler-Miserez | Katja Tissi | Davy Van Landuyt
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
This paper is a brief summary of the First WMT Shared Task on Sign Language Translation (WMT-SLT22), a project partly funded by EAMT. The focus of this shared task is automatic translation between signed and spoken languages. Details can be found on our website (https://www.wmt-slt.com/) or in the findings paper (Müller et al., 2022).
2022
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
Julia Kreutzer | Isaac Caswell | Lisa Wang | Ahsan Wahab | Daan van Esch | Nasanbayar Ulzii-Orshikh | Allahsera Tapo | Nishant Subramani | Artem Sokolov | Claytone Sikasote | Monang Setyawan | Supheakmungkol Sarin | Sokhar Samb | Benoît Sagot | Clara Rivera | Annette Rios | Isabel Papadimitriou | Salomey Osei | Pedro Ortiz Suarez | Iroro Orife | Kelechi Ogueji | Andre Niyongabo Rubungo | Toan Q. Nguyen | Mathias Müller | André Müller | Shamsuddeen Hassan Muhammad | Nanda Muhammad | Ayanda Mnyakeni | Jamshidbek Mirzakhalov | Tapiwanashe Matangira | Colin Leong | Nze Lawson | Sneha Kudugunta | Yacine Jernite | Mathias Jenny | Orhan Firat | Bonaventure F. P. Dossou | Sakhile Dlamini | Nisansa de Silva | Sakine Çabuk Ballı | Stella Biderman | Alessia Battisti | Ahmed Baruwa | Ankur Bapna | Pallavi Baljekar | Israel Abebe Azime | Ayodele Awokoya | Duygu Ataman | Orevaoghene Ahia | Oghenefego Ahia | Sweta Agrawal | Mofetoluwa Adeyemi
Transactions of the Association for Computational Linguistics, Volume 10
With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.
Findings of the First WMT Shared Task on Sign Language Translation (WMT-SLT22)
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgöz | Cristina España-bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios | Dimitar Shterionov | Sandra Sidler-miserez | Katja Tissi
Proceedings of the Seventh Conference on Machine Translation (WMT)
Mathias Müller | Sarah Ebling | Eleftherios Avramidis | Alessia Battisti | Michèle Berger | Richard Bowden | Annelies Braffort | Necati Cihan Camgöz | Cristina España-bonet | Roman Grundkiewicz | Zifan Jiang | Oscar Koller | Amit Moryossef | Regula Perrollaz | Sabine Reinhard | Annette Rios | Dimitar Shterionov | Sandra Sidler-miserez | Katja Tissi
Proceedings of the Seventh Conference on Machine Translation (WMT)
This paper presents the results of the First WMT Shared Task on Sign Language Translation (WMT-SLT22).This shared task is concerned with automatic translation between signed and spoken languages. The task is novel in the sense that it requires processing visual information (such as video frames or human pose estimation) beyond the well-known paradigm of text-to-text machine translation (MT).The task featured two tracks, translating from Swiss German Sign Language (DSGS) to German and vice versa. Seven teams participated in this first edition of the task, all submitting to the DSGS-to-German track. Besides a system ranking and system papers describing state-of-the-art techniques, this shared task makes the following scientific contributions: novel corpora, reproducible baseline systems and new protocols and software for human evaluation. Finally, the task also resulted in the first publicly available set of system outputs and human evaluation scores for sign language translation.
2020
A Corpus for Automatic Readability Assessment and Text Simplification of German
Alessia Battisti | Dominik Pfütze | Andreas Säuberli | Marek Kostrzewa | Sarah Ebling
Proceedings of the Twelfth Language Resources and Evaluation Conference
Alessia Battisti | Dominik Pfütze | Andreas Säuberli | Marek Kostrzewa | Sarah Ebling
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German, the first of its kind for this language. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.
2019
Search
Fix author
Co-authors
- Sarah Ebling 7
- Zifan Jiang 3
- Mathias Müller 3
- Annette Rios Gonzales 3
- Sandra Sidler-Miserez 3
- Katja Tissi 3
- Eleftherios Avramidis 2
- Michèle Berger 2
- Richard Bowden 2
- Annelies Braffort 2
- Necati Cihan Camgöz 2
- Cristina España-Bonet 2
- Roman Grundkiewicz 2
- Anne Göhring 2
- Oscar Koller 2
- Amit Moryossef 2
- Regula Perrollaz 2
- Sabine Reinhard 2
- Dimitar Shterionov 2
- Mofetoluwa Adeyemi 1
- Sweta Agrawal 1
- Orevaoghene Ahia 1
- Oghenefego Ahia 1
- Sina Ahmadi 1
- Duygu Ataman 1
- Ayodele Awokoya 1
- Israel Abebe Azime 1
- Pallavi Baljekar 1
- Ankur Bapna 1
- Ahmed Baruwa 1
- Stella Biderman 1
- Isaac Caswell 1
- Nisansa De Silva 1
- Cui Ding 1
- Sakhile Dlamini 1
- Bonaventure F. P. Dossou 1
- Orhan Firat 1
- Yingqiang Gao 1
- Micha David Hess 1
- Franz Holzknecht 1
- Mathias Jenny 1
- Yacine Jernite 1
- Marek Kostrzewa 1
- Julia Kreutzer 1
- Sneha Kudugunta 1
- Davy Van Landuyt 1
- Nze Lawson 1
- Colin Leong 1
- Tapiwanashe Matangira 1
- Andrianos Michail 1
- Jamshidbek Mirzakhalov 1
- Ayanda Mnyakeni 1
- Peshmerge Morad 1
- Shamsuddeen Hassan Muhammad 1
- Nanda Muhammad 1
- André Müller 1
- Toan Q. Nguyen 1
- Joel Niklaus 1
- Kelechi Ogueji 1
- Juri Opitz 1
- Iroro Orife 1
- Pedro Ortiz Suarez 1
- Salomey Osei 1
- Maria Christina Panagiotopoulou 1
- Isabel Papadimitriou 1
- Stefano Perrella 1
- Dominik Pfütze 1
- Clara Rivera 1
- Andre Niyongabo Rubungo 1
- Benoît Sagot 1
- Sokhar Samb 1
- Supheakmungkol Sarin 1
- Rico Sennrich 1
- Monang Setyawan 1
- Anastassia Shaitarova 1
- Claytone Sikasote 1
- Artem Sokolov 1
- Nishant Subramani 1
- Andreas Säuberli 1
- Allahsera Tapo 1
- Nasanbayar Ulzii-Orshikh 1
- Martin Volk 1
- Ahsan Wahab 1
- Lisa Wang 1
- Daan van Esch 1
- Emma van den Bold 1
- Elena Álvarez-Mellado 1
- Sakine Çabuk Ballı 1