2024
pdf
bib
abs
Surveying the Technology Support of Languages
Annika Grützner-Zahn
|
Federico Gaspari
|
Maria Giagkou
|
Stefanie Hegele
|
Andy Way
|
Georg Rehm
Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024
Many of the world’s languages are left behind when it comes to Language Technology applications, since most of these are available only in a limited number of languages, creating a digital divide that affects millions of users worldwide. It is crucial, therefore, to monitor and quantify the progress of technology support for individual languages, which also enables comparisons across language communities. In this way, efforts can be directed towards reducing language barriers, promoting economic and social inclusion, and ensuring that all citizens can use their preferred language in the digital age. This paper critically reviews and compares recent quantitative approaches to measuring technology support for languages. Despite using different approaches and methodologies, the findings of all analysed papers demonstrate the unequal distribution of technology support and emphasise the existence of a digital divide among languages.
pdf
bib
abs
Occiglot at WMT24: European Open-source Large Language Models Evaluated on Translation
Eleftherios Avramidis
|
Annika Grützner-Zahn
|
Manuel Brack
|
Patrick Schramowski
|
Pedro Ortiz Suarez
|
Malte Ostendorff
|
Fabio Barth
|
Shushen Manakhimova
|
Vivien Macketanz
|
Georg Rehm
|
Kristian Kersting
Proceedings of the Ninth Conference on Machine Translation
This document describes the submission of the very first version of the Occiglot open-source large language model to the General MT Shared Task of the 9th Conference of Machine Translation (WMT24). Occiglot is an open-source, community-based LLM based on Mistral-7B, which went through language-specific continual pre-training and subsequent instruction tuning, including instructions relevant to machine translation.We examine the automatic metric scores for translating the WMT24 test set and provide a detailed linguistically-motivated analysis.Despite Occiglot performing worse than many of the other system submissions, we observe that it performs better than Mistral7B, which has been based upon, which indicates the positive effect of the language specific continual-pretraining and instruction tuning. We see the submission of this very early version of the model as a motivation to unite community forces and pursue future LLM research on the translation task.
pdf
bib
abs
Common European Language Data Space
Georg Rehm
|
Stelios Piperidis
|
Khalid Choukri
|
Andrejs Vasiļjevs
|
Katrin Marheinecke
|
Victoria Arranz
|
Aivars Bērziņš
|
Miltos Deligiannis
|
Dimitris Galanis
|
Maria Giagkou
|
Katerina Gkirtzou
|
Dimitris Gkoumas
|
Annika Grützner-Zahn
|
Athanasia Kolovou
|
Penny Labropoulou
|
Andis Lagzdiņš
|
Elena Leitner
|
Valérie Mapelli
|
Hélène Mazo
|
Simon Ostermann
|
Stefania Racioppa
|
Mickaël Rigault
|
Leon Voukoutis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The Common European Language Data Space (LDS) is an integral part of the EU data strategy, which aims at developing a single market for data. Its decentralised technical infrastructure and governance scheme are currently being developed by the LDS project, which also has dedicated tasks for proof-of-concept prototypes, handling legal aspects, raising awareness and promoting the LDS through events and social media channels. The LDS is part of a broader vision for establishing all necessary components to develop European large language models.
2022
pdf
bib
abs
Introducing the Digital Language Equality Metric: Contextual Factors
Annika Grützner-Zahn
|
Georg Rehm
Proceedings of the Workshop Towards Digital Language Equality within the 13th Language Resources and Evaluation Conference
In our digital age, digital language equality is an important goal to enable participation in society for all citizens, independent of the language they speak. To assess the current state of play with regard to Europe’s languages, we developed, in the project European Language Equality, a metric for digital language equality that consists of two parts, technological and contextual (i.e., non-technological) factors. We present a metric for calculating the contextual factors for over 80 European languages. For each language, a score is calculated that reflects the broader context or socio-economic ecosystem of a language, which has, for a given language, a direct impact for technology and resource development; it is important to note, though, that Language Technologies and Resources related aspects are reflected by the technological factors. To reduce the vast number of potential contextual factors to an adequate number, five different configurations were calculated and evaluated with a panel of experts. The best results were achieved by a configuration in which 12 manually curated factors were included. In the factor selection process, attention was paid to data quality, automatic updatability, inclusion of data from different domains, and a balance between different data types. The evaluation shows that this specific configuration is stable for the official EU languages; while for regional and minority languages, as well as national non-official EU languages, there is room for improvement.