2024
pdf
bib
abs
VarDial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification
Adrian-Gabriel Chifu
|
Goran Glavaš
|
Radu Tudor Ionescu
|
Nikola Ljubešić
|
Aleksandra Miletić
|
Filip Miletić
|
Yves Scherrer
|
Ivan Vulić
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2024. The campaign is part of the eleventh workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with NAACL 2024. Two shared tasks were included this year: dialectal causal commonsense reasoning (DIALECT-COPA), and Multi-label classification of similar languages (DSL-ML). Both tasks were organized for the first time this year, but DSL-ML partially overlaps with the DSL-TL task organized in 2023.
pdf
bib
abs
A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian
Aleksandra Miletić
|
Filip Miletić
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Bosnian, Croatian, Montenegrin and Serbian are the official standard linguistic varieties in Bosnia and Herzegovina, Croatia, Montenegro, and Serbia, respectively. When these four countries were part of the former Yugoslavia, the varieties were considered to share a single linguistic standard. After the individual countries were established, the national standards emerged. Today, a central question about these varieties remains the following: How different are they from each other? How hard is it to distinguish them? While this has been addressed in NLP as part of the task on Distinguishing Between Similar Languages (DSL), little is known about human performance, making it difficult to contextualize system results. We tackle this question by reannotating the existing BCMS dataset for DSL with annotators from all target regions. We release a new gold standard, replacing the original single-annotator, single-label annotation by a multi-annotator, multi-label one, thus improving annotation reliability and explicitly coding the existence of ambiguous instances. We reassess a previously proposed DSL system on the new gold standard and establish the human upper bound on the task. Finally, we identify sources of annotation difficulties and provide linguistic insights into the BCMS dialect continuum, with multiple indicators highlighting an intermediate position of Bosnian and Montenegrin.
pdf
bib
abs
Loflòc: A Morphological Lexicon for Occitan using Universal Dependencies
Marianne Vergez-Couret
|
Myriam Bras
|
Aleksandra Miletić
|
Clamença Poujade
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
This paper presents Loflòc (Lexic obèrt flechit Occitan – Open Inflected Lexicon of Occitan), a morphological lexicon for Occitan. Even though the lexicon no longer occupies the same place in the NLP pipeline since the advent of large language models, it remains a crucial resource for low-resourced languages. Occitan is a Romance language spoken in the south of France and in parts of Italy and Spain. It is not recognized as an official language in France and no standard variety is shared across the area. To the best of our knowledge, Loflòc is the first publicly available lexicon for Occitan. It contains 650 thousand entries for 57 thousand lemmas. Each entry is accompanied by the corresponding Universal Dependencies Part-of-Speech tag. We show that the lexicon has solid coverage on the existing freely available corpora of Occitan in four major dialects. Coverage gaps on multi-dialect corpora are overwhelmingly driven by dialectal variation, which affects both open and closed classes. Based on this analysis we propose directions for future improvements.
2023
pdf
bib
abs
Dialect-to-Standard Normalization: A Large-Scale Multilingual Evaluation
Olli Kuparinen
|
Aleksandra Miletić
|
Yves Scherrer
Findings of the Association for Computational Linguistics: EMNLP 2023
Text normalization methods have been commonly applied to historical language or user-generated content, but less often to dialectal transcriptions. In this paper, we introduce dialect-to-standard normalization – i.e., mapping phonetic transcriptions from different dialects to the orthographic norm of the standard variety – as a distinct sentence-level character transduction task and provide a large-scale analysis of dialect-to-standard normalization methods. To this end, we compile a multilingual dataset covering four languages: Finnish, Norwegian, Swiss German and Slovene. For the two biggest corpora, we provide three different data splits corresponding to different use cases for automatic normalization. We evaluate the most successful sequence-to-sequence model architectures proposed for text normalization tasks using different tokenization approaches and context sizes. We find that a character-level Transformer trained on sliding windows of three words works best for Finnish, Swiss German and Slovene, whereas the pre-trained byT5 model using full sentences obtains the best results for Norwegian. Finally, we perform an error analysis to evaluate the effect of different data splits on model performance.
pdf
bib
abs
Outiller l’occitan : nouvelles ressources et lemmatisation
Aleksandra Miletić
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs
Ce travail présente des contributions récentes à l’effort de doter l’occitan de ressources et outils pour le TAL. Plusieurs ressources existantes ont été modifiées ou adaptées, notamment un tokéniseur à base de règles, un lexique morphosyntaxique et un corpus arboré. Ces ressources ont été utilisées pour entraîner et évaluer des modèles neuronaux pour la lemmatisation. Dans le cadre de ces expériences, un nouveau corpus plus large (2 millions de tokens) provenant du Wikipédia a été annoté en parties du discours, lemmatisé et diffusé.
pdf
bib
abs
CorCoDial - Machine translation techniques for corpus-based computational dialectology
Yves Scherrer
|
Olli Kuparinen
|
Aleksandra Miletic
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
This paper presents CorCoDial, a research project funded by the Academy of Finland aiming to leverage machine translation technology for corpus-based computational dialectology. In this paper, we briefly present intermediate results of our project-related research.
pdf
bib
abs
The Helsinki-NLP Submissions at NADI 2023 Shared Task: Walking the Baseline
Yves Scherrer
|
Aleksandra Miletić
|
Olli Kuparinen
Proceedings of ArabicNLP 2023
The Helsinki-NLP team participated in the NADI 2023 shared tasks on Arabic dialect translation with seven submissions. We used statistical (SMT) and neural machine translation (NMT) methods and explored character- and subword-based data preprocessing. Our submissions placed second in both tracks. In the open track, our winning submission is a character-level SMT system with additional Modern Standard Arabic language models. In the closed track, our best BLEU scores were obtained with the leave-as-is baseline, a simple copy of the input, and narrowly followed by SMT systems. In both tracks, fine-tuning existing multilingual models such as AraT5 or ByT5 did not yield superior performance compared to SMT.
pdf
bib
abs
Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan
Aleksandra Miletić
|
Janine Siewert
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learning-based approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.
2022
pdf
bib
abs
OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan
Aleksandra Miletic
|
Yves Scherrer
Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects
This paper presents OcWikiDisc, a new freely available corpus in Occitan, as well as language identification experiments on Occitan done as part of the corpus building process. Occitan is a regional language spoken mainly in the south of France and in parts of Spain and Italy. It exhibits rich diatopic variation, it is not standardized, and it is still low-resourced, especially when it comes to large downloadable corpora. We introduce OcWikiDisc, a corpus extracted from the talk pages associated with the Occitan Wikipedia. The version of the corpus with the most restrictive language filtering contains 8K user messages for a total of 618K tokens. The language filtering is performed based on language identification experiments with five off-the-shelf tools, including the new fasttext’s language identification model from Meta AI’s No Language Left Behind initiative, released in July 2022.
pdf
bib
abs
Pro-TEXT: an Annotated Corpus of Keystroke Logs
Aleksandra Miletic
|
Christophe Benzitoun
|
Georgeta Cislaru
|
Santiago Herrera-Yanez
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Pro-TEXT is a corpus of keystroke logs written in French. Keystroke logs are recordings of the writing process executed through a keyboard, which keep track of all actions taken by the writer (character additions, deletions, substitutions). As such, the Pro-TEXT corpus offers new insights into text genesis and underlying cognitive processes from the production perspective. A subset of the corpus is linguistically annotated with parts of speech, lemmas and syntactic dependencies, making it suitable for the study of interactions between linguistic and behavioural aspects of the writing process. The full corpus contains 202K tokens, while the annotated portion is currently 30K tokens large. The annotated content is progressively being made available in a database-like CSV format and in CoNLL format, and the work on an HTML-based visualisation tool is currently under way. To the best of our knowledge, Pro-TEXT is the first corpus of its kind in French.
2020
pdf
bib
abs
A Four-Dialect Treebank for Occitan: Building Process and Parsing Experiments
Aleksandra Miletic
|
Myriam Bras
|
Marianne Vergez-Couret
|
Louise Esher
|
Clamença Poujade
|
Jean Sibille
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Occitan is a Romance language spoken mainly in the south of France. It has no official status in the country, it is not standardized and displays important diatopic variation resulting in a rich system of dialects. Recently, a first treebank for this language was created. However, this corpus is based exclusively on texts in the Lengadocian dialect. Our paper describes the work aimed at extending the existing corpus with content in three new dialects, namely Gascon, Provençau and Lemosin. We describe both the annotation of initial content in these new varieties of Occitan and experiments allowing us to identify the most efficient method for further enrichment of the corpus. We observe that parsing models trained on Occitan dialects achieve better results than a delexicalized model trained on other Romance languages despite the latter training corpus being much larger (20K vs 900K tokens). The results of the native Occitan models show an important impact of cross-dialectal lexical variation, whereas syntactic variation seems to affect the systems less. We hope that the resulting corpus, incorporating several Occitan varieties, will facilitate the training of robust NLP tools, capable of processing all kinds of Occitan texts.
pdf
bib
abs
Building a Universal Dependencies Treebank for Occitan
Aleksandra Miletic
|
Myriam Bras
|
Marianne Vergez-Couret
|
Louise Esher
|
Clamença Poujade
|
Jean Sibille
Proceedings of the Twelfth Language Resources and Evaluation Conference
This paper outlines the ongoing effort of creating the first treebank for Occitan, a low-ressourced regional language spoken mainly in the south of France. We briefly present the global context of the project and report on its current status. We adopt the Universal Dependencies framework for this project. Our methodology is based on two main principles. Firstly, in order to guarantee the annotation quality, we use the agile annotation approach. Secondly, we rely on pre-processing using existing tools (taggers and parsers) to facilitate the work of human annotators, mainly through a delexicalized cross-lingual parsing approach. We present the results available at this point (annotation guidelines and a sub-corpus annotated with PoS tags and lemmas) and give the timeline for the rest of the work.
2019
pdf
bib
abs
Transformation d’annotations en parties du discours et lemmes vers le format Universal Dependencies : étude de cas pour l’alsacien et l’occitan (Converting POS-tag and Lemma Annotations into the Universal Dependencies Format : A Case Study on Alsatian and Occitan )
Aleksandra Miletić
|
Delphine Bernhard
|
Myriam Bras
|
Anne-Laure Ligozat
|
Marianne Vergez-Couret
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts
Cet article présente un retour d’expérience sur la transformation de corpus annotés pour l’alsacien et l’occitan vers le format CONLL-U défini dans le projet Universal Dependencies. Il met en particulier l’accent sur divers points de vigilance à prendre en compte, concernant la tokénisation et la définition des catégories pour l’annotation.
pdf
bib
Building a treebank for Occitan: what use for Romance UD corpora?
Aleksandra Miletic
|
Myriam Bras
|
Louise Esher
|
Jean Sibille
|
Marianne Vergez-Couret
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)
2018
pdf
bib
De la constitution d’un corpus arboré à l’analyse syntaxique du serbe [From the constitution of a treebank to the syntactic analysis of the Serbian language]
Aleksandra Miletic
|
Cécile Fabre
|
Dejan Stosic
Traitement Automatique des Langues, Volume 59, Numéro 3 : Traitement automatique des langues peu dotées [NLP for Under-Resourced Languages]
2017
pdf
bib
Non-Projectivity in Serbian: Analysis of Formal and Linguistic Properties
Aleksandra Miletic
|
Assaf Urieli
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)
2016
pdf
bib
abs
Mise au point d’une méthode d’annotation morphosyntaxique fine du serbe (Developping a method for detailed morphosyntactic tagging of Serbian)
Aleksandra Miletic
|
Cécile Fabre
|
Dejan Stosic
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 2 : TALN (Posters)
Cet article présente une expérience d’annotation morphosyntaxique fine du volet serbe du corpus parallèle ParCoLab (corpus serbe-français-anglais). Elle a consisté à enrichir une annotation existante en parties du discours avec des traits morphosyntaxiques fins, afin de préparer une étape ultérieure de parsing. Nous avons comparé trois approches : 1) annotation manuelle ; 2) préannotation avec un étiqueteur entraîné sur le croate suivie d’une correction manuelle ; 3) réentraînement de l’outil sur un petit échantillon validé du corpus, suivi de l’annotation automatique et de la correction manuelle. Le modèle croate maintient une stabilité globale en passant au serbe, mais les différences entre les deux jeux d’étiquettes exigent des interventions manuelles importantes. Le modèle ré-entraîné sur un échantillon de taille limité (20K tokens) atteint la même exactitude que le modèle existant et le gain de temps observé montre que cette méthode optimise la phase de correction.
2014
pdf
bib
abs
TALC-sef A Manually-Revised POS-TAgged Literary Corpus in Serbian, English and French
Antonio Balvet
|
Dejan Stosic
|
Aleksandra Miletic
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
In this paper, we present a parallel literary corpus for Serbian, English and French, the TALC-sef corpus. The corpus includes a manually-revised pos-tagged reference Serbian corpus of over 150,000 words. The initial objective was to devise a reference parallel corpus in the three languages, both for literary and linguistic studies. The French and English sub-corpora had been pos-tagged from the onset, using TreeTagger (Schmid, 1994), but the corpus lacked, until now, a tagged version of the Serbian sub-corpus. Here, we present the original parallel literary corpus, then we address issues related to pos-tagging a large collection of Serbian text: from the conception of an appropriate tagset for Serbian, to the choice of an automatic pos-tagger adapted to the task, and then to some quantitative and qualitative results. We then move on to a discussion of perspectives in the near future for further annotations of the whole parallel corpus.