Yves Scherrer

2025

pdf bib abs
Dialects, Topic Models, and Border Effects: The Rusyn Case
Achim Rabus | Yves Scherrer
Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025)

In this contribution, we present, discuss, and apply a data-driven approach for analyzing varieties of the Slavic minority language Carpathian Rusyn spoken in different countries in the Carpathian region. Using topic modeling, a method originally developed for text mining, we show that the Rusyn varieties are subject to border effects, i.e., vertical convergence and horizontal divergence, due to language contacts with their respective umbrella languages Polish, Slovak and Standard Ukrainian. Additionally, we show that the method is suitable for uncovering fieldworker isoglosses, i.e., different transcription principles in an otherwise homogeneous dataset.

pdf bib abs
Explaining novel senses using definition generation with open language models
Mariia Fedorova | Andrey Kutuzov | Francesco Periti | Yves Scherrer
Findings of the Association for Computational Linguistics: EMNLP 2025

We apply definition generators based on open-weights large language models to the task of creating explanations of novel senses, taking target word usages as an input. To this end, we employ the datasets from the AXOLOTL’24 shared task on explainable semantic change modeling, which features Finnish, Russian and German languages. We fine-tune and provide publicly the open-source models performing higher than the best submissions of the aforementioned shared task, which employed closed proprietary LLMs. In addition, we find that encoder-decoder definition generators perform on par with their decoder-only counterparts.

pdf bib abs
Functional Lexicon in Subword Tokenization
Zachary William Hopton | Yves Scherrer | Tanja Samardzic
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

The distinction between function and content units of the lexicon has been somewhat neglected in recent NLP work, but it could still be useful when working with low-resource languages, and, in particular, to improve cross-lingual transfer. In this paper, we investigate to what extent BPE subword tokenization can be used to identify units of the functional lexicon in a language without any annotated data. We analyze subword tokens in terms of their productivity and attempt to find thresholds that best distinguish function from content tokens. On a sample of seven diverse languages, we find that the best results are obtained with 50 BPE merges. We also show that this subword tokenization setting can be beneficial for the interlinear glossing task.

pdf bib abs
OpusDistillery: A Configurable End-to-End Pipeline for Systematic Multilingual Distillation of Open NMT Models
Ona de Gibert | Tommi Nieminen | Yves Scherrer | Jörg Tiedemann
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

In this work, we introduce OpusDistillery, a novel framework to streamline the Knowledge Distillation (KD) process of multilingual NMT models. OpusDistillery’s main features are the integration of openly available teacher models from OPUS-MT and Hugging Face, comprehensive multilingual support and robust GPU utilization tracking. We describe the tool in detail and discuss the individual contributions of its pipeline components, demonstrating its flexibility for different use cases. OpusDistillery is open-source and released under a permissive license, aiming to facilitate further research and development in the field of multilingual KD for any sequence-to-sequence task. Our code is available at https://github.com/Helsinki-NLP/OpusDistillery.

pdf bib abs
Interactive maps for corpus-based dialectology
Yves Scherrer | Olli Kuparinen
Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Traditional data collection methods in dialectology rely on structured surveys, whose results can be easily presented on printed or digital maps. But in recent years, corpora of transcribed dialect speech have become a precious alternative data source for data-driven linguistic analysis. For example, topic models can be advantageously used to discover both general dialectal variation patterns and specific linguistic features that are most characteristic for certain dialects. Multilingual (or rather, multilectal) language modeling tasks can also be used to learn speaker-specific embeddings. In connection with this paper, we introduce a website that presents the results of two recent studies in the form of interactive maps, allowing visitors to explore the effects of various parameter settings. The website covers two tasks (topic models and speaker embeddings) and three language areas (Finland, Norway, and German-speaking Switzerland). It is available at https://www.corcodial.net/ .

pdf bib abs
Multi-label Scandinavian Language Identification (SLIDE)
Mariia Fedorova | Jonas Sebulon Frydenberg | Victoria Handford | Victoria Ovedie Chruickshank Langø | Solveig Helene Willoch | Marthe Løken Midtgaard | Yves Scherrer | Petter Mæhlum | David Samuel
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

pdf bib
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jorg Tiedemann | Marcos Zampieri
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib abs
Findings of the VarDial Evaluation Campaign 2025: The NorSID Shared Task on Norwegian Slot, Intent and Dialect Identification
Yves Scherrer | Rob van der Goot | Petter Mæhlum
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

The VarDial Evaluation Campaign 2025 was organized as part of the twelfth workshop on Natural Language Processing for Similar Languages, Varieties and Dialects (VarDial), colocated with COLING 2025. It consisted of one shared task with three subtasks: intent detection, slot filling and dialect identification for Norwegian dialects. This report presents the results of this shared task. Four participating teams have submitted systems with very high performance (> 97% accuracy) for intent detection, whereas slot detection and dialect identification showed to be much more challenging, with respectively span-F1 scores up to 89%, and weighted dialect F1 scores of 84%.

pdf bib abs
LTG at VarDial 2025 NorSID: More and Better Training Data for Slot and Intent Detection
Marthe Midtgaard | Petter Mæhlum | Yves Scherrer
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the LTG submission to the VarDial 2025 shared task, where we participate in the Norwegian slot and intent detection subtasks. The shared task focuses on Norwegian dialects, which present challenges due to their low-resource nature and variation. We test a variety of neural models and training data configurations, with the focus on improving and extending the available Norwegian training data. This includes automatically re-aligning slot spans in Norwegian Bokmål, as well as re-translating the original English training data into both Bokmål and Nynorsk. % to address dialectal diversity. We also re-annotate an external Norwegian dataset to augment the training data. Our best models achieve first place in both subtasks, achieving an span F1 score of 0.893 for slot filling and an accuracy of 0.980 for intent detection. Our results indicate that while translation quality is less critical, improving the slot labels has a notable impact on slot performance. Moreover, adding more standard Norwegian data improves performance, but incorporating even small amounts of dialectal data leads to greater gains.

pdf bib abs
Improved Norwegian Bokmål Translations for FLORES
Petter Mæhlum | Anders Næss Evensen | Yves Scherrer
Proceedings of the Tenth Conference on Machine Translation

FLORES+ is a collection of parallel datasets obtained by translation from originally English source texts. FLORES+ contains Norwegian translations for the two official written variants of Norwegian: Norwegian Bokmål and Norwegian Nynorsk. However, the earliest Bokmål version contained non-native-like mistakes, and even after a later revision, the dataset contained grammatical and lexical errors. This paper aims at correcting unambiguous mistakes, and thus creating a new version of the Bokmål dataset. At the same time, we provide a translation into Radical Bokmål, a sub-variety of Norwegian which is closer to Nynorsk in some aspects, while still being within the official norms for Bokmål. We discuss existing errors and differences in the various translations and the corrections that we provide.

pdf bib abs
EdinHelsOW WMT 2025 CreoleMT System Description: Improving Lusophone Creole Translation through Data Augmentation, Model Merging and LLM Post-editing
Jacqueline Rowe | Ona De Gibert | Mateusz Klimaszewski | Coleman Haley | Alexandra Birch | Yves Scherrer
Proceedings of the Tenth Conference on Machine Translation

In this work, we present our submissions to the unconstrained track of the System subtask of the WMT 2025 Creole Language Translation Shared Task. Of the 52 Creole languages included in the task, we focus on translation between English and seven Lusophone Creoles. Our approach leverages known strategies for low-resource machine translation, including back-translation and distillation of data, fine-tuning pre-trained multilingual models, and post-editing with large language models and lexicons. We also demonstrate that adding high-quality parallel Portuguese data in training, initialising Creole embeddings with Portuguese embedding weights, and strategically merging best checkpoints of different fine-tuned models all produce considerable gains in performance in certain translation directions. Our best models outperform the baselines on the Task test set for eight out of fourteen translation directions. When evaluated on a more diverse test set, they surpass the baselines in all but one direction.

2024

pdf bib abs
System Description of the NordicsAlps Submission to the AmericasNLP 2024 Machine Translation Shared Task
Joseph Attieh | Zachary Hopton | Yves Scherrer | Tanja Samardžić
Proceedings of the 4th Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP 2024)

This paper presents the system description of the NordicsAlps team for the AmericasNLP 2024 Machine Translation Shared Task 1. We investigate the effect of tokenization on translation quality by exploring two different tokenization schemes: byte-level and redundancy-driven tokenization. We submitted three runs per language pair. The redundancy-driven tokenization ranked first among all submissions, scoring the highest average chrF2++, chrF, and BLEU metrics (averaged across all languages). These findings demonstrate the importance of carefully tailoring the tokenization strategies of machine translation systems, particularly in resource-constrained scenarios.

pdf bib abs
Definition generation for lexical semantic change detection
Mariia Fedorova | Andrey Kutuzov | Yves Scherrer
Findings of the Association for Computational Linguistics: ACL 2024

We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as ‘senses’, and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.

pdf bib abs
Explainability of machine learning approaches in forensic linguistics: a case study in geolinguistic authorship profiling
Dana Roemling | Yves Scherrer | Aleksandra Miletić
Proceedings of the First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security

Forensic authorship profiling uses linguistic markers to infer characteristics about an author of a text. This task is paralleled in dialect classification, where a prediction is made about the linguistic variety of a text based on the text itself. While there have been significant advances in recent years in variety classification, forensic linguistics rarely relies on these approaches due to their lack of transparency, among other reasons. In this paper we therefore explore the explainability of machine learning approaches considering the forensic context. We focus on variety classification as a means of geolinguistic profiling of unknown texts based on social media data from the German-speaking area. For this, we identify the lexical items that are the most impactful for the variety classification. We find that the extracted lexical features are indeed representative of their respective varieties and note that the trained models also rely on place names for classifications.

pdf bib
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Marcos Zampieri | Preslav Nakov | Jörg Tiedemann
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

pdf bib abs
VarDial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification
Adrian-Gabriel Chifu | Goran Glavaš | Radu Tudor Ionescu | Nikola Ljubešić | Aleksandra Miletić | Filip Miletić | Yves Scherrer | Ivan Vulić
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2024. The campaign is part of the eleventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2024. Two shared tasks were included this year: dialectal causal commonsense reasoning (DIALECT-COPA), and Multi-label classification of similar languages (DSL-ML). Both tasks were organized for the first time this year, but DSL-ML partially overlaps with the DSL-TL task organized in 2023.

pdf bib abs
NoMusic - The Norwegian Multi-Dialectal Slot and Intent Detection Corpus
Petter Mæhlum | Yves Scherrer
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)

This paper presents a new textual resource for Norwegian and its dialects. The NoMusic corpus contains Norwegian translations of the xSID dataset, an evaluation dataset for spoken language understanding (slot and intent detection). The translations cover Norwegian Bokmål, as well as eight dialects from three of the four major Norwegian dialect areas. To our knowledge, this is the first multi-parallel resource for written Norwegian dialects, and the first evaluation dataset for slot and intent detection focusing on non-standard Norwegian varieties. In this paper, we describe the annotation process and provide some analyses on the types of linguistic variation that can be found in the dataset.

pdf bib abs
Hybrid Distillation from RBMT and NMT: Helsinki-NLP’s Submission to the Shared Task on Translation into Low-Resource Languages of Spain
Ona De Gibert | Mikko Aulamo | Yves Scherrer | Jörg Tiedemann
Proceedings of the Ninth Conference on Machine Translation

The Helsinki-NLP team participated in the 2024 Shared Task on Translation into Low-Resource languages of Spain with four multilingual systems covering all language pairs. The task consists in developing Machine Translation (MT) models to translate from Spanish into Aragonese, Aranese and Asturian. Our models leverage known approaches for multilingual MT, namely, data filtering, fine-tuning, data tagging, and distillation. We use distillation to merge the knowledge from neural and rule-based systems and explore the trade-offs between translation quality and computational efficiency. We demonstrate that our distilled models can achieve competitive results while significantly reducing computational costs. Our best models ranked 4th, 5th, and 2nd in the open submission track for Spanish–Aragonese, Spanish–Aranese, and Spanish–Asturian, respectively. We release our code and data publicly at https://github.com/Helsinki-NLP/lowres-spain-st.

2023

pdf bib abs
Four Approaches to Low-Resource Multilingual NMT: The Helsinki Submission to the AmericasNLP 2023 Shared Task
Ona De Gibert | Raúl Vázquez | Mikko Aulamo | Yves Scherrer | Sami Virpioja | Jörg Tiedemann
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)

The Helsinki-NLP team participated in the AmericasNLP 2023 Shared Task with 6 submissions for all 11 language pairs arising from 4 different multilingual systems. We provide a detailed look at the work that went into collecting and preprocessing the data that led to our submissions. We explore various setups for multilingual Neural Machine Translation (NMT), namely knowledge distillation and transfer learning, multilingual NMT including a high-resource language (English), language-specific fine-tuning, and multilingual NMT exclusively using low-resource data. Our multilingual Model B ranks first in 4 out of the 11 language pairs.

pdf bib abs
The Helsinki-NLP Submissions at NADI 2023 Shared Task: Walking the Baseline
Yves Scherrer | Aleksandra Miletić | Olli Kuparinen
Proceedings of ArabicNLP 2023

The Helsinki-NLP team participated in the NADI 2023 shared tasks on Arabic dialect translation with seven submissions. We used statistical (SMT) and neural machine translation (NMT) methods and explored character- and subword-based data preprocessing. Our submissions placed second in both tracks. In the open track, our winning submission is a character-level SMT system with additional Modern Standard Arabic language models. In the closed track, our best BLEU scores were obtained with the leave-as-is baseline, a simple copy of the input, and narrowly followed by SMT systems. In both tracks, fine-tuning existing multilingual models such as AraT5 or ByT5 did not yield superior performance compared to SMT.

pdf bib abs
CorCoDial - Machine translation techniques for corpus-based computational dialectology
Yves Scherrer | Olli Kuparinen | Aleksandra Miletic
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

This paper presents CorCoDial, a research project funded by the Academy of Finland aiming to leverage machine translation technology for corpus-based computational dialectology. In this paper, we briefly present intermediate results of our project-related research.

pdf bib abs
Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation
Olli Kuparinen | Aleksandra Miletić | Yves Scherrer
Findings of the Association for Computational Linguistics: EMNLP 2023

Text normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization – i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety – as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful sequence-to-sequence model architectures proposed for text normalization tasks using different tokenization approaches and context sizes. We find that a character-level Transformer trained on sliding windows of three words works best for Finnish, Swiss German and Slovene, whereas the pre-trained byT5 model using full sentences obtains the best results for Norwegian. Finally, we perform an error analysis to evaluate the effect of different data splits on model performance.

pdf bib abs
Changing usage of Low Saxon auxiliary and modal verbs
Janine Siewert | Martijn Wieling | Yves Scherrer
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

We investigate the usage of auxiliary and modal verbs in Low Saxon dialects from both Germany and the Netherlands based on word vectors, and compare developments in the modern language to Middle Low Saxon. Although most of these function words have not been affected by lexical replacement, changes in usage that likely at least partly result from contact with the state languages can still be observed.

pdf bib abs
Character alignment methods for dialect-to-standard normalization
Yves Scherrer
Proceedings of the 20th SIGMORPHON workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper evaluates various character alignment methods on the task of sentence-level standardization of dialect transcriptions. We compare alignment methods from different scientific traditions (dialectometry, speech processing, machine translation) and apply them to Finnish, Norwegian and Swiss German dialect datasets. In the absence of gold alignments, we evaluate the methods on a set of characteristics that are deemed undesirable for the task. We find that trained alignment methods only show marginal benefits to simple Levenshtein distance. On this particular task, eflomal outperforms related methods such as GIZA++ or fast_align by a large margin.

pdf bib
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jörg Tiedemann | Marcos Zampieri
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

pdf bib abs
Dialect Representation Learning with Neural Dialect-to-Standard Normalization
Olli Kuparinen | Yves Scherrer
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

Language label tokens are often used in multilingual neural language modeling and sequence-to-sequence learning to enhance the performance of such models. An additional product of the technique is that the models learn representations of the language tokens, which in turn reflect the relationships between the languages. In this paper, we study the learned representations of dialects produced by neural dialect-to-standard normalization models. We use two large datasets of typologically different languages, namely Finnish and Norwegian, and evaluate the learned representations against traditional dialect divisions of both languages. We find that the inferred dialect embeddings correlate well with the traditional dialects. The methodology could be further used in noisier settings to find new insights into language variation.

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages – True Labels (DSL-TL), and Discriminating Between Similar Languages – Speech (DSL-S). All three tasks were organized for the first time this year.

2022

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Luciana Benotti | Naoaki Okazaki | Yves Scherrer | Marcos Zampieri
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib abs
Low Saxon dialect distances at the orthographic and syntactic level
Janine Siewert | Yves Scherrer | Martijn Wieling
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

We compare five Low Saxon dialects from the 19th and 21st century from Germany and the Netherlands with each other as well as with modern Standard Dutch and Standard German. Our comparison is based on character n-grams on the one hand and PoS n-grams on the other and we show that these two lead to different distances. Particularly in the PoS-based distances, one can observe all of the 21st century Low Saxon dialects shifting towards the modern majority languages.

pdf bib
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jörg Tiedemann | Marcos Zampieri
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2022. The campaign is part of the ninth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2022. Three separate shared tasks were included this year: Identification of Languages and Dialects of Italy (ITDI), French Cross-Domain Dialect Identification (FDI), and Dialectal Extractive Question Answering (DialQA). All three tasks were organized for the first time this year.

pdf bib abs
OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan
Aleksandra Miletic | Yves Scherrer
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext’s language identification model from Meta AI’s No Language Left Behind initiative, released in July 2022.

2021

pdf bib abs
The Helsinki submission to the AmericasNLP shared task
Raúl Vázquez | Yves Scherrer | Sami Virpioja | Jörg Tiedemann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs. Our multilingual NMT models reached the first rank on all language pairs in track 1, and first rank on nine out of ten language pairs in track 2. We focused our efforts on three aspects: (1) the collection of additional data from various sources such as Bibles and political constitutions, (2) the cleaning and filtering of training data with the OpusFilter toolkit, and (3) different multilingual training techniques enabled by the latest version of the OpenNMT-py toolkit to make the most efficient use of the scarce data. This paper describes our efforts in detail.

pdf bib
Towards a balanced annotated Low Saxon dataset for diachronic investigation of dialectal variation
Janine Siewert | Yves Scherrer | Jörg Tiedemann
Proceedings of the 17th Conference on Natural Language Processing (KONVENS 2021)

pdf bib abs
Boosting Neural Machine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation
Mikko Aulamo | Sami Virpioja | Yves Scherrer | Jörg Tiedemann
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We consider a low-resource translation task from Finnish into Northern Sámi. Collecting all available parallel data between the languages, we obtain around 30,000 sentence pairs. However, there exists a significantly larger monolingual Northern Sámi corpus, as well as a rule-based machine translation (RBMT) system between the languages. To make the best use of the monolingual data in a neural machine translation (NMT) system, we use the backtranslation approach to create synthetic parallel data from it using both NMT and RBMT systems. Evaluating the results on an in-domain test set and a small out-of-domain set, we find that the RBMT backtranslation outperforms NMT backtranslation clearly for the out-of-domain test set, but also slightly for the in-domain data, for which the NMT backtranslation model provided clearly better BLEU scores than the RBMT. In addition, combining both backtranslated data sets improves the RBMT approach only for the in-domain test set. This suggests that the RBMT system provides general-domain knowledge that cannot be found from the relative small parallel training data.

pdf bib
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer | Tommi Jauhiainen
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2021. The campaign was part of the eighth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2021. Four separate shared tasks were included this year: Dravidian Language Identification (DLI), Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). DLI was organized for the first time and the other three continued a series of tasks from previous evaluation campaigns.

pdf bib abs
Social Media Variety Geolocation with geoBERT
Yves Scherrer | Nikola Ljubešić
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the Helsinki–Ljubljana contribution to the VarDial 2021 shared task on social media variety geolocation. Following our successful participation at VarDial 2020, we again propose constrained and unconstrained systems based on the BERT architecture. In this paper, we report experiments with different tokenization settings and different pre-trained models, and we contrast our parameter-free regression approach with various classification schemes proposed by other participants at VarDial 2020. Both the code and the best-performing pre-trained models are made freely available.

pdf bib abs
Sesame Street to Mount Sinai: BERT-constrained character-level Moses models for multilingual lexical normalization
Yves Scherrer | Nikola Ljubešić
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization. Our system is based on a BERT token classification preprocessing step, where for each token the type of the necessary transformation is predicted (none, uppercase, lowercase, capitalize, modify), and a character-level SMT step where the text is translated from original to normalized given the BERT-predicted transformation constraints. For some languages, depending on the results on development data, the training data was extended by back-translating OpenSubtitles data. In the final ordering of the ten participating teams, the HEL-LJU team has taken the second place, scoring better than the previous state-of-the-art.

2020

pdf bib abs
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation
Alessandro Raganato | Yves Scherrer | Jörg Tiedemann
Findings of the Association for Computational Linguistics: EMNLP 2020

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed – non-learnable – attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

pdf bib abs
Paraphrase Generation and Evaluation on Colloquial-Style Sentences
Eetu Sjöblom | Mathias Creutz | Yves Scherrer
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we investigate paraphrase generation in the colloquial domain. We use state-of-the-art neural machine translation models trained on the Opusparcus corpus to generate paraphrases in six languages: German, English, Finnish, French, Russian, and Swedish. We perform experiments to understand how data selection and filtering for diverse paraphrase pairs affects the generated paraphrases. We compare two different model architectures, an RNN and a Transformer model, and find that the Transformer does not generally outperform the RNN. We also conduct human evaluation on five of the six languages and compare the results to the automatic evaluation metrics BLEU and the recently proposed BERTScore. The results advance our understanding of the trade-offs between the quality and novelty of generated paraphrases, affected by the data selection method. In addition, our comparison of the evaluation methods shows that while BLEU correlates well with human judgments at the corpus level, BERTScore outperforms BLEU in both corpus and sentence-level evaluation.

pdf bib abs
An Evaluation Benchmark for Testing the Word Sense Disambiguation Capabilities of Machine Translation Systems
Alessandro Raganato | Yves Scherrer | Jörg Tiedemann
Proceedings of the Twelfth Language Resources and Evaluation Conference

Lexical ambiguity is one of the many challenging linguistic phenomena involved in translation, i.e., translating an ambiguous word with its correct sense. In this respect, previous work has shown that the translation quality of neural machine translation systems can be improved by explicitly modeling the senses of ambiguous words. Recently, several evaluation test sets have been proposed to measure the word sense disambiguation (WSD) capability of machine translation systems. However, to date, these evaluation test sets do not include any training data that would provide a fair setup measuring the sense distributions present within the training data itself. In this paper, we present an evaluation benchmark on WSD for machine translation for 10 language pairs, comprising training data with known sense distributions. Our approach for the construction of the benchmark builds upon the wide-coverage multilingual sense inventory of BabelNet, the multilingual neural parsing pipeline TurkuNLP, and the OPUS collection of translated texts from the web. The test suite is available at http://github.com/Helsinki-NLP/MuCoW.

pdf bib abs
TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages
Yves Scherrer
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences “meaning the same thing”. This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists. The dataset is available at https://doi.org/10.5281/zenodo.3707949.

pdf bib
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper presents the results of the VarDial Evaluation Campaign 2020 organized as part of the seventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2020. The campaign included three shared tasks each focusing on a different challenge of language and dialect identification: Romanian Dialect Identification (RDI), Social Media Variety Geolocation (SMG), and Uralic Language Identification (ULI). The campaign attracted 30 teams who enrolled to participate in one or multiple shared tasks and 14 of them submitted runs across the three shared tasks. Finally, 11 papers describing participating systems are published in the VarDial proceedings and referred to in this report.

pdf bib abs
LSDC - A comprehensive dataset for Low Saxon Dialect Classification
Janine Siewert | Yves Scherrer | Martijn Wieling | Jörg Tiedemann
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

We present a new comprehensive dataset for the unstandardised West-Germanic language Low Saxon covering the last two centuries, the majority of modern dialects and various genres, which will be made openly available in connection with the final version of this paper. Since so far no such comprehensive dataset of contemporary Low Saxon exists, this provides a great contribution to NLP research on this language. We also test the use of this dataset for dialect classification by training a few baseline models comparing statistical and neural approaches. The performance of these models shows that in spite of an imbalance in the amount of data per dialect, enough features can be learned for a relatively high classification accuracy.

pdf bib abs
HeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models
Yves Scherrer | Nikola Ljubešić
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation. Our solutions are based on the BERT Transformer models, the constrained versions of our models reaching 1st place in two subtasks and 3rd place in one subtask, while our unconstrained models outperform all the constrained systems by a large margin. We show in our analyses that Transformer-based models outperform traditional models by far, and that improvements obtained by pre-training models on large quantities of (mostly standard) text are significant, but not drastic, with single-language models also outperforming multilingual models. Our manual analysis shows that two types of signals are the most crucial for a (mis)prediction: named entities and dialectal features, both of which are handled well by our models.

pdf bib abs
The MUCOW word sense disambiguation test suite at WMT 2020
Yves Scherrer | Alessandro Raganato | Jörg Tiedemann
Proceedings of the Fifth Conference on Machine Translation

This paper reports on our participation with the MUCOW test suite at the WMT 2020 news translation task. We introduced MUCOW at WMT 2019 to measure the ability of MT systems to perform word sense disambiguation (WSD), i.e., to translate an ambiguous word with its correct sense. MUCOW is created automatically using existing resources, and the evaluation process is also entirely automated. We evaluate all participating systems of the language pairs English -> Czech, English -> German, and English -> Russian and compare the results with those obtained at WMT 2019. While current NMT systems are fairly good at handling ambiguous source words, we could not identify any substantial progress - at least to the extent that it is measurable by the MUCOW method - in that area over the last year.

pdf bib abs
The University of Helsinki and Aalto University submissions to the WMT 2020 news and low-resource translation tasks
Yves Scherrer | Stig-Arne Grönroos | Sami Virpioja
Proceedings of the Fifth Conference on Machine Translation

This paper describes the joint participation of University of Helsinki and Aalto University to two shared tasks of WMT 2020: the news translation between Inuktitut and English and the low-resource translation between German and Upper Sorbian. For both tasks, our efforts concentrate on efficient use of monolingual and related bilingual corpora with scheduled multi-task learning as well as an optimized subword segmentation with sampling. Our submission obtained the highest score for Upper Sorbian -> German and was ranked second for German -> Upper Sorbian according to BLEU scores. For English–Inuktitut, we reached ranks 8 and 10 out of 11 according to BLEU scores.

2019

pdf bib abs
Analysing concatenation approaches to document-level NMT in two different domains
Yves Scherrer | Jörg Tiedemann | Sharid Loáiciga
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

In this paper, we investigate how different aspects of discourse context affect the performance of recent neural MT systems. We describe two popular datasets covering news and movie subtitles and we provide a thorough analysis of the distribution of various document-level features in their domains. Furthermore, we train a set of context-aware MT models on both datasets and propose a comparative evaluation scheme that contrasts coherent context with artificially scrambled documents and absent context, arguing that the impact of discourse-aware MT models will become visible in this way. Our results show that the models are indeed affected by the manipulation of the test data, providing a different view on document-level translation quality than absolute sentence-level scores.

In this paper, we present the findings of the Third VarDial Evaluation Campaign organized as part of the sixth edition of the workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2019. This year, the campaign included five shared tasks, including one task re-run – German Dialect Identification (GDI) – and four new tasks – Cross-lingual Morphological Analysis (CMA), Discriminating between Mainland and Taiwan variation of Mandarin Chinese (DMT), Moldavian vs. Romanian Cross-dialect Topic identification (MRC), and Cuneiform Language Identification (CLI). A total of 22 teams submitted runs across the five shared tasks. After the end of the competition, we received 14 system description papers, which are published in the VarDial workshop proceedings and referred to in this report.

pdf bib abs
Measuring Semantic Abstraction of Multilingual NMT with Paraphrase Recognition and Generation Tasks
Jörg Tiedemann | Yves Scherrer
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

In this paper, we investigate whether multilingual neural translation models learn stronger semantic abstractions of sentences than bilingual ones. We test this hypotheses by measuring the perplexity of such models when applied to paraphrases of the source language. The intuition is that an encoder produces better representations if a decoder is capable of recognizing synonymous sentences in the same language even though the model is never trained for that task. In our setup, we add 16 different auxiliary languages to a bidirectional bilingual baseline model (English-French) and test it with in-domain and out-of-domain paraphrases in English. The results show that the perplexity is significantly reduced in each of the cases, indicating that meaning can be grounded in translation. This is further supported by a study on paraphrase generation that we also include at the end of the paper.

In this paper we present the University of Helsinki submissions to the WMT 2019 shared news translation task in three language pairs: English-German, English-Finnish and Finnish-English. This year we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German we trained both sentence-level transformer models as well as compared different document-level translation approaches. For Finnish-English and English-Finnish we focused on different segmentation approaches and we also included a rule-based system for English-Finnish.

pdf bib abs
The MuCoW Test Suite at WMT 2019: Automatically Harvested Multilingual Contrastive Word Sense Disambiguation Test Sets for Machine Translation
Alessandro Raganato | Yves Scherrer | Jörg Tiedemann
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

Supervised Neural Machine Translation (NMT) systems currently achieve impressive translation quality for many language pairs. One of the key features of a correct translation is the ability to perform word sense disambiguation (WSD), i.e., to translate an ambiguous word with its correct sense. Existing evaluation benchmarks on WSD capabilities of translation systems rely heavily on manual work and cover only few language pairs and few word types. We present MuCoW, a multilingual contrastive test suite that covers 16 language pairs with more than 200 thousand contrastive sentence pairs, automatically built from word-aligned parallel corpora and the wide-coverage multilingual sense inventory of BabelNet. We evaluate the quality of the ambiguity lexicons and of the resulting test suite on all submissions from 9 language pairs presented in the WMT19 news shared translation task, plus on other 5 language pairs using NMT pretrained models. The MuCoW test suite is available at http://github.com/Helsinki-NLP/MuCoW.

pdf bib abs
The University of Helsinki Submissions to the WMT19 Similar Language Translation Task
Yves Scherrer | Raúl Vázquez | Sami Virpioja
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

This paper describes the University of Helsinki Language Technology group’s participation in the WMT 2019 similar language translation task. We trained neural machine translation models for the language pairs Czech <-> Polish and Spanish <-> Portuguese. Our experiments focused on different subword segmentation methods, and in particular on the comparison of a cognate-aware segmentation method, Cognate Morfessor, with character segmentation and unsupervised segmentation methods for which the data from different languages were simply concatenated. We did not observe major benefits from cognate-aware segmentation methods, but further research may be needed to explore larger parts of the parameter space. Character-level models proved to be competitive for translation between Spanish and Portuguese, but they are slower in training and decoding.

2018

pdf bib abs
The University of Helsinki submissions to the IWSLT 2018 low-resource translation task
Yves Scherrer
Proceedings of the 15th International Conference on Spoken Language Translation

This paper presents the University of Helsinki submissions to the Basque–English low-resource translation task. Our primary system is a standard bilingual Transformer system, trained on the available parallel data and various types of synthetic data. We describe the creation of the synthetic datasets, some of which use a pivoting approach, in detail. One of our contrastive submissions is a multilingual model trained on comparable data, but without the synthesized parts. Our bilingual model with synthetic data performed best, obtaining 25.25 BLEU on the test data.

pdf bib
Crowdsourcing Regional Variation Data and Automatic Geolocalisation of Speakers of European French
Jean-Philippe Goldman | Yves Scherrer | Julie Glikman | Mathieu Avanzi | Christophe Benzitoun | Philippe Boula de Mareüil
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.

pdf bib abs
The University of Helsinki submissions to the WMT18 news task
Alessandro Raganato | Yves Scherrer | Tommi Nieminen | Arvi Hurskainen | Jörg Tiedemann
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the University of Helsinki’s submissions to the WMT18 shared news translation task for English-Finnish and English-Estonian, in both directions. This year, our main submissions employ a novel neural architecture, the Transformer, using the open-source OpenNMT framework. Our experiments couple domain labeling and fine tuned multilingual models with shared vocabularies between the source and target language, using the provided parallel data of the shared task and additional back-translations. Finally, we compare, for the English-to-Finnish case, the effectiveness of different machine translation architectures, starting from a rule-based approach to our best neural model, analyzing the output and highlighting future research.

Progress in the quality of machine translation output calls for new automatic evaluation procedures and metrics. In this paper, we extend the Morpheval protocol introduced by Burlot and Yvon (2017) for the English-to-Czech and English-to-Latvian translation directions to three additional language pairs, and report its use to analyze the results of WMT 2018’s participants for these language pairs. Considering additional, typologically varied source and target languages also enables us to draw some generalizations regarding this morphology-oriented evaluation procedure.

2017

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks: Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

pdf bib abs
Multi-source morphosyntactic tagging for spoken Rusyn
Yves Scherrer | Achim Rabus
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolkit, we show that a tagger trained on a balanced set of the four source languages outperforms single language taggers by about 9%, and that additional automatically induced morphosyntactic lexicons lead to further improvements. The best observed accuracies for Rusyn are 82.4% for part-of-speech tagging and 75.5% for full morphological tagging.

pdf bib abs
Lexicon Induction for Spoken Rusyn – Challenges and Results
Achim Rabus | Yves Scherrer
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4% relative (42.9% absolute), which results in a tagging recall increased by 11.6% relative (9.1% absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages.

pdf bib
The Helsinki Neural Machine Translation System
Robert Östling | Yves Scherrer | Jörg Tiedemann | Gongbo Tang | Tommi Nieminen
Proceedings of the Second Conference on Machine Translation

pdf bib abs
Neural Machine Translation with Extended Context
Jörg Tiedemann | Yves Scherrer
Proceedings of the Third Workshop on Discourse in Machine Translation

We investigate the use of extended context in attention-based neural machine translation. We base our experiments on translated movie subtitles and discuss the effect of increasing the segments beyond single translation units. We study the use of extended source language context as well as bilingual context extensions. The models learn to distinguish between information from different segments and are surprisingly robust with respect to translation quality. In this pilot study, we observe interesting cross-sentential attention patterns that improve textual coherence in translation at least in some selected cases.

2016

pdf bib abs
Cartopho : un site web de cartographie de variantes de prononciation en français (Cartopho: a website for mapping pronunciation variants in French)
Philippe Boula de Mareüil | Jean-Philippe Goldman | Albert Rilliard | Yves Scherrer | Frédéric Vernier
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP

Le présent travail se propose de renouveler les traditionnels atlas dialectologiques pour cartographier les variantes de prononciation en français, à travers un site internet. La toile est utilisée non seulement pour collecter des données, mais encore pour disséminer les résultats auprès des chercheurs et du grand public. La méthodologie utilisée, à base de crowdsourcing (ou « production participative »), nous a permis de recueillir des informations auprès de 2500 francophones d’Europe (France, Belgique, Suisse). Une plateforme dynamique à l’interface conviviale a ensuite été développée pour cartographier la prononciation de 70 mots dans les différentes régions des pays concernés (des mots notamment à voyelle moyenne ou dont la consonne finale peut être prononcée ou non). Les options de visualisation par département/canton/province ou par région, combinant plusieurs traits de prononciation et ensembles de mots, sous forme de pastilles colorées, de hachures, etc. sont présentées dans cet article. On peut ainsi observer immédiatement un /E/ plus fermé (ainsi qu’un /O/ plus ouvert) dans le Nord-Pas-de-Calais et le sud de la France, pour des mots comme parfait ou rose, un /Œ/ plus fermé en Suisse pour un mot comme gueule, par exemple.

pdf bib abs
On-line Multilingual Linguistic Services
Eric Wehrli | Yves Scherrer | Luka Nerima
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this demo, we present our free on-line multilingual linguistic services which allow to analyze sentences or to extract collocations from a corpus directly on-line, or by uploading a corpus. They are available for 8 European languages (English, French, German, Greek, Italian, Portuguese, Romanian, Spanish) and can also be accessed as web services by programs. While several open systems are available for POS-tagging and dependency parsing or terminology extraction, their integration into an application requires some computational competence. Furthermore, none of the parsers/taggers handles MWEs very satisfactorily, in particular when the two terms of the collocation are distant from each other or in reverse order. Our tools, on the other hand, are specifically designed for users with no particular computational literacy. They do not require from the user any download, installation or adaptation if used on-line, and their integration in an application, using one the scripts described below is quite easy. Furthermore, by default, the parser handles collocations and other MWEs, as well as anaphora resolution (limited to 3rd person personal pronouns). When used in the tagger mode, it can be set to display grammatical functions and collocations.

pdf bib abs
ArchiMob - A Corpus of Spoken Swiss German
Tanja Samardžić | Yves Scherrer | Elvira Glaser
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, automatic processing of Swiss German is still a considerable challenge due to the fact that it is mostly a spoken variety rarely recorded and that it is subject to considerable regional variation. This paper presents a freely available general-purpose corpus of spoken Swiss German suitable for linguistic research, but also for training automatic tools. The corpus is a result of a long design process, intensive manual work and specially adapted computational processing. We first describe how the documents were transcribed, segmented and aligned with the sound source, and how inconsistent transcriptions were unified through an additional normalisation layer. We then present a bootstrapping approach to automatic normalisation using different machine-translation-inspired methods. Furthermore, we evaluate the performance of part-of-speech taggers on our data and show how the same bootstrapping approach improves part-of-speech tagging by 10% over four rounds. Finally, we present the modalities of access of the corpus as well as the data format.

2014

pdf bib abs
SwissAdmin: A multilingual tagged parallel corpus of press releases
Yves Scherrer | Luka Nerima | Lorenza Russo | Maria Ivanova | Eric Wehrli
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

SwissAdmin is a new multilingual corpus of press releases from the Swiss Federal Administration, available in German, French, Italian and English. We provide SwissAdmin in three versions: (i) plain texts of approximately 6 to 8 million words per language; (ii) sentence-aligned bilingual texts for each language pair; (iii) a part-of-speech-tagged version consisting of annotations in both the Universal tagset and the richer Fips tagset, along with grammatical functions, verb valencies and collocations. The SwissAdmin corpus is freely available at www.latl.unige.ch/swissadmin.

pdf bib abs
A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages
Yves Scherrer | Benoît Sagot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS information is transferred from the resourced language along translation pairs to the non-resourced language and used for tagging the corpus. We evaluate our methods on three language families, consisting of five Romance languages, three Germanic languages and five Slavic languages. We obtain tagging accuracies of up to 91.6%.

pdf bib
Unsupervised adaptation of supervised part-of-speech taggers for closely related languages
Yves Scherrer
Proceedings of the First Workshop on Applying NLP Tools to Similar Languages, Varieties and Dialects

2013

pdf bib
Modernizing historical Slovene words with character-based SMT
Yves Scherrer | Tomaž Erjavec
Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

pdf bib
Lexicon induction and part-of-speech tagging of non-resourced languages without any bilingual resources
Yves Scherrer | Benoît Sagot
Proceedings of the Workshop on Adaptation of Language Resources and Tools for Closely Related Languages and Language Variants

2012

pdf bib abs
The Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction
Yves Scherrer | Bruno Cartoni
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we present a trilingual parallel corpus for German, Italian and Romansh, a Swiss minority language spoken in the canton of Grisons. The corpus called ALLEGRA contains press releases automatically gathered from the website of the cantonal administration of Grisons. Texts have been preprocessed and aligned with a current state-of-the-art sentence aligner. The corpus is one of the first of its kind, and can be of great interest, particularly for the creation of natural language processing resources and tools for Romansh. We illustrate the use of such a trilingual resource for automatic induction of bilingual lexicons, which is a real challenge for under-represented languages. We induce a bilingual lexicon for German-Romansh by phrase alignment and evaluate the resulting entries with the help of a reference lexicon. We then show that the use of the third language of the corpus ― Italian ― as a pivot language can improve the precision of the induced lexicon, without loss in terms of quality of the extracted pairs.

pdf bib
Recovering dialect geography from an unaligned comparable corpus
Yves Scherrer
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

2011

pdf bib abs
Étude inter-langues de la distribution et des ambiguïtés syntaxiques des pronoms (A study of cross-language distribution and syntactic ambiguities of pronouns)
Lorenza Russo | Yves Scherrer | Jean-Philippe Goldman | Sharid Loáiciga | Luka Nerima | Éric Wehrli
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Ce travail décrit la distribution des pronoms selon le style de texte (littéraire ou journalistique) et selon la langue (français, anglais, allemand et italien). Sur la base d’un étiquetage morpho-syntaxique effectué automatiquement puis vérifié manuellement, nous pouvons constater que la proportion des différents types de pronoms varie selon le type de texte et selon la langue. Nous discutons les catégories les plus ambiguës de manière détaillée. Comme nous avons utilisé l’analyseur syntaxique Fips pour l’étiquetage des pronoms, nous l’avons également évalué et obtenu une précision moyenne de plus de 95%.

pdf bib abs
La traduction automatique des pronoms. Problèmes et perspectives (Automatic translation of pronouns. Problems and perspectives)
Yves Scherrer | Lorenza Russo | Jean-Philippe Goldman | Sharid Loáiciga | Luka Nerima | Éric Wehrli
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Dans cette étude, notre système de traduction automatique, Its-2, a fait l’objet d’une évaluation manuelle de la traduction des pronoms pour cinq paires de langues et sur deux corpus : un corpus littéraire et un corpus de communiqués de presse. Les résultats montrent que les pourcentages d’erreurs peuvent atteindre 60% selon la paire de langues et le corpus. Nous discutons ainsi deux pistes de recherche pour l’amélioration des performances de Its-2 : la résolution des ambiguïtés d’analyse et la résolution des anaphores pronominales.

pdf bib
Syntactic transformations for Swiss German dialects
Yves Scherrer
Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

2010

pdf bib abs
Des cartes dialectologiques numérisées pour le TALN
Yves Scherrer
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations

Cette démonstration présente une interface web pour des données numérisées de l’atlas linguistique de la Suisse allemande. Nous présentons d’abord l’intégration des données brutes et des données interpolées de l’atlas dans une interface basée sur Google Maps. Ensuite, nous montrons des prototypes de systèmes de traduction automatique et d’identification de dialectes qui s’appuient sur ces données dialectologiques numérisées.

pdf bib
Word-Based Dialect Identification with Georeferenced Rules
Yves Scherrer | Owen Rambow
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2009

pdf bib abs
Un système de traduction automatique paramétré par des atlas dialectologiques
Yves Scherrer
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

Contrairement à la plupart des systèmes de traitement du langage, qui s’appliquent à des langues écrites et standardisées, nous présentons ici un système de traduction automatique qui prend en compte les spécificités des dialectes. En général, les dialectes se caractérisent par une variation continue et un manque de données textuelles en qualité et quantité suffisantes. En même temps, du moins en Europe, les dialectologues ont étudié en détail les caractéristiques linguistiques des dialectes. Nous soutenons que des données provenant d’atlas dialectologiques peuvent être utilisées pour paramétrer un système de traduction automatique. Nous illustrons cette idée avec le prototype d’un système de traduction basé sur des règles, qui traduit de l’allemand standard vers les différents dialectes de Suisse allemande. Quelques exemples linguistiquement motivés serviront à exposer l’architecture de ce système.

pdf bib
Deep Linguistic Multilingual Translation and Bilingual Dictionaries
Eric Wehrli | Luka Nerima | Yves Scherrer
Proceedings of the Fourth Workshop on Statistical Machine Translation

2008

pdf bib abs
Transducteurs à fenêtre glissante pour l’induction lexicale
Yves Scherrer
Actes de la 15ème conférence sur le Traitement Automatique des Langues Naturelles. REncontres jeunes Chercheurs en Informatique pour le Traitement Automatique des Langues

Nous appliquons différents modèles de similarité graphique à la tâche de l’induction de lexiques bilingues entre un dialecte de Suisse allemande et l’allemand standard. Nous comparons des transducteurs stochastiques utilisant des fenêtres glissantes de 1 à 3 caractères, entraînés à l’aide de l’algorithme de maximisation de l’espérance avec des corpus d’entraînement de tailles différentes. Si les transducteurs à unigrammes donnent des résultats satisfaisants avec des corpus très petits, nous montrons que les transducteurs à bigrammes les dépassent à partir de 750 paires de mots d’entraînement. En général, les modèles entraînés nous ont permis d’améliorer la F-mesure de 7% à 15% par rapport à la distance de Levenshtein.

pdf bib
Part-of-Speech Tagging with a Symbolic Full Parser: Using the TIGER Treebank to Evaluate Fips
Yves Scherrer
Proceedings of the Workshop on Parsing German