Common European Language Data Space
Georg Rehm
Stelios Piperidis
Khalid Choukri
Andrejs Vasiļjevs
Katrin Marheinecke
Victoria Arranz
Aivars Bērziņš
Miltos Deligiannis
Dimitris Galanis
Maria Giagkou
Katerina Gkirtzou
Dimitris Gkoumas
Annika Grützner-Zahn
Athanasia Kolovou
Penny Labropoulou
Andis Lagzdiņš
Elena Leitner
Valérie Mapelli
Hélène Mazo
Simon Ostermann
Stefania Racioppa
Mickaël Rigault
Leon Voukoutis
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The Common European Language Data Space (LDS) is an integral part of the EU data strategy, which aims at developing a single market for data. Its decentralised technical infrastructure and governance scheme are currently being developed by the LDS project, which also has dedicated tasks for proof-of-concept prototypes, handling legal aspects, raising awareness and promoting the LDS through events and social media channels. The LDS is part of a broader vision for establishing all necessary components to develop European large language models.
Language Resources to Support Language Diversity – the ELRA Achievements
Valérie Mapelli
Victoria Arranz
Khalid Choukri
Hélène Mazo
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This article highlights ELRA’s latest achievements in the field of Language Resources (LRs) identification, sharing and production. It also reports on ELRA’s involvement in several national and international projects, as well as in the organization of events for the support of LRs and related Language Technologies, including for under-resourced languages. Over the past few years, ELRA, together with its operational agency ELDA, has continued to increase its catalogue offer of LRs, establishing worldwide partnerships for the production of various types of LRs (SMS, tweets, crawled data, MT aligned data, speech LRs, sentiment-based data, etc.). Through their consistent involvement in EU-funded projects, ELRA and ELDA have contributed to improve the access to multilingual information in the context of the pandemic, develop tools for the de-identification of texts in the legal and medical domains, support the EU eTranslation Machine Translation system, and set up a European platform providing access to both resources and services. In December 2019, ELRA co-organized the LT4All conference, whose main topics were Language Technologies for enabling linguistic diversity and multilingualism worldwide. Moreover, although LREC was cancelled in 2020, ELRA published the LREC 2020 proceedings for the Main conference and Workshops papers, and carried on its dissemination activities while targeting the new LREC edition for 2022.
Categorizing legal features in a metadata-oriented task: defining the conditions of use
Mickaël Rigault
Victoria Arranz
Valérie Mapelli
Penny Labropoulou
Stelios Piperidis
Proceedings of the Workshop on Ethical and Legal Issues in Human Language Technologies and Multilingual De-Identification of Sensitive Data In Language Resources within the 13th Language Resources and Evaluation Conference
In recent times, more attention has been brought by the Human Language Technology (HLT) community to the legal framework for making available and reusing Language Resources (LR) and tools. Licensing is now an issue that is foreseen in most research projects and that is essential to provide legal certainty for repositories when distributing resources. Some repositories such as Zenodo or Quantum Stat do not offer the possibility to search for resources by licenses which can turn the searching for relevant resources a very complex task. Other repositories such as Hugging Face propose a search feature by license which may make it difficult to figure out what use can be made of such resources. During the European Language Grid (ELG) project, we moved a step forward to link metadata with the terms and conditions of use. In this paper, we document the process we undertook to categorize legal features of licenses listed in the SPDX license list and widely used in the HLT community as well as those licenses used within the ELG platform
European Language Grid: A Joint Platform for the European Language Technology Community
Georg Rehm
Stelios Piperidis
Kalina Bontcheva
Jan Hajic
Victoria Arranz
Andrejs Vasiļjevs
Gerhard Backfried
Jose Manuel Gomez-Perez
Ulrich Germann
Rémi Calizzano
Nils Feldhus
Stefanie Hegele
Florian Kintzel
Katrin Marheinecke
Julian Moreno-Schneider
Dimitris Galanis
Penny Labropoulou
Miltos Deligiannis
Katerina Gkirtzou
Athanasia Kolovou
Dimitris Gkoumas
Leon Voukoutis
Ian Roberts
Jana Hamrlova
Dusan Varis
Lukas Kacena
Khalid Choukri
Valérie Mapelli
Mickaël Rigault
Julija Melnika
Miro Janosik
Katja Prinz
Andres Garcia-Silva
Cristian Berrio
Ondrej Klejch
Steve Renals
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.
Making Metadata Fit for Next Generation Language Technology Platforms: The Metadata Schema of the European Language Grid
Penny Labropoulou
Katerina Gkirtzou
Maria Gavriilidou
Miltos Deligiannis
Dimitris Galanis
Stelios Piperidis
Georg Rehm
Maria Berger
Valérie Mapelli
Michael Rigault
Victoria Arranz
Khalid Choukri
Gerhard Backfried
José Manuel Gómez-Pérez
Andres Garcia-Silva
Proceedings of the Twelfth Language Resources and Evaluation Conference
The current scientific and technological landscape is characterised by the increasing availability of data resources and processing tools and services. In this setting, metadata have emerged as a key factor facilitating management, sharing and usage of such digital assets. In this paper we present ELG-SHARE, a rich metadata schema catering for the description of Language Resources and Technologies (processing and generation services and tools, models, corpora, term lists, etc.), as well as related entities (e.g., organizations, projects, supporting documents, etc.). The schema powers the European Language Grid platform that aims to be the primary hub and marketplace for industry-relevant Language Technology in Europe. ELG-SHARE has been based on various metadata schemas, vocabularies, and ontologies, as well as related recommendations and guidelines.
Data Management Plan (DMP) for Language Data under the New General Da-ta Protection Regulation (GDPR)
Pawel Kamocki
Valérie Mapelli
Khalid Choukri
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
European Language Resource Coordination: Collecting Language Resources for Public Sector Multilingual Information Management
Andrea Lösch
Valérie Mapelli
Stelios Piperidis
Andrejs Vasiļjevs
Lilli Smal
Thierry Declerck
Eileen Schnur
Khalid Choukri
Josef van Genabith
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
New directions in ELRA activities
Valérie Mapelli
Victoria Arranz
Hélène Mazo
Pawel Kamocki
Vladimir Popescu
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
ELRA Activities and Services
Khalid Choukri
Valérie Mapelli
Hélène Mazo
Vladimir Popescu
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
After celebrating its 20th anniversary in 2015, ELRA is carrying on its strong involvement in the HLT field. To share ELRA’s expertise of those 21 past years, this article begins with a presentation of ELRA’s strategic Data and LR Management Plan for a wide use by the language communities. Then, we further report on ELRA’s activities and services provided since LREC 2014. When looking at the cataloguing and licensing activities, we can see that ELRA has been active at making the Meta-Share repository move toward new developments steps, supporting Europe to obtain accurate LRs within the Connecting Europe Facility programme, promoting the use of LR citation, creating the ELRA License Wizard web portal. The article further elaborates on the recent LR production activities of various written, speech and video resources, commissioned by public and private customers. In parallel, ELDA has also worked on several EU-funded projects centred on strategic issues related to the European Digital Single Market. The last part gives an overview of the latest dissemination activities, with a special focus on the celebration of its 20th anniversary organised in Dubrovnik (Croatia) and the following up of LREC, as well as the launching of the new ELRA portal.
Language Resource Citation: the ISLRN Dissemination and Further Developments
Valérie Mapelli
Vladimir Popescu
Lin Liu
Khalid Choukri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This article presents the latest dissemination activities and technical developments that were carried out for the International Standard Language Resource Number (ISLRN) service. It also recalls the main principle and submission process for providers to obtain their 13-digit ISLRN identifier. Up to March 2016, 2100 Language Resources were allocated an ISLRN number, not only ELRA’s and LDC’s catalogued Language Resources, but also the ones from other important organisations like the Joint Research Centre (JRC) and the Resource Management Agency (RMA) who expressed their strong support to this initiative. In the research field, not only assigning a unique identification number is important, but also referring to a Language Resource as an object per se (like publications) has now become an obvious requirement. The ISLRN could also become an important parameter to be considered to compute a Language Resource Impact Factor (LRIF) in order to recognize the merits of the producers of Language Resources. Integrating the ISLRN number into a LR-oriented bibliographical reference is thus part of the objective. The idea is to make use of a BibTeX entry that would take into account Language Resources items, including ISLRN.The ISLRN being a requested field within the LREC 2016 submission, we expect that several other LRs will be allocated an ISLRN number by the conference date. With this expansion, this number aims to be a spreadly-used LR citation instrument within works referring to LRs.
The ELRA License Wizard
Valérie Mapelli
Vladimir Popescu
Lin Liu
Meritxell Fernández Barrera
Khalid Choukri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
To allow an easy understanding of the various licenses that exist for the use of Language Resources (ELRA’s, META-SHARE’s, Creative Commons’, etc.), ELRA has developed a License Wizardto help the right-holders share/distribute their resources under the appropriate license. It also aims to be exploited by users to better understand the legal obligations that apply in various licensing situations. The present paper elaborates on the License Wizard functionalities of this web configurator, which enables to select a number of legal features and obtain the user license adapted to the users selection, to define which user licenses they would like to select in order to distribute their Language Resources, to integrate the user license terms into a Distribution Agreement that could be proposed to ELRA or META-SHARE for further distribution through the ELRA Catalogue of Language Resources. Thanks to a flexible back office, the structure of the legal feature selection can easily be reviewed to include other features that may be relevant for other licenses. Integrating contributions from other initiatives thus aim to be one of the obvious next steps, with a special focus on CLARIN and Linked Data experiences.
Review on the Existing Language Resources for Languages of France
Thibault Grouas
Valérie Mapelli
Quentin Samier
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
With the support of the DGLFLF, ELDA conducted an inventory of existing language resources for the regional languages of France. The main aim of this inventory was to assess the exploitability of the identified resources within technologies. A total of 2,299 Language Resources were identified. As a second step, a deeper analysis of a set of three language groups (Breton, Occitan, overseas languages) was carried out along with a focus of their exploitability within three technologies: automatic translation, voice recognition/synthesis and spell checkers. The survey was followed by the organisation of the TLRF2015 Conference which aimed to present the state of the art in the field of the Technologies for Regional Languages of France. The next step will be to activate the network of specialists built up during the TLRF conference and to begin the organisation of a second TLRF conference. Meanwhile, the French Ministry of Culture continues its actions related to linguistic diversity and technology, in particular through a project with Wikimedia France related to contributions to Wikipedia in regional languages, the upcoming new version of the “Corpus de la Parole” and the reinforcement of the DGLFLF’s Observatory of Linguistic Practices.
ELRA’s Consolidated Services for the HLT Community
Victoria Arranz
Khalid Choukri
Valérie Mapelli
Hélène Mazo
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper emphasises on ELRAs contribution to the HLT field thanks to the consolidation of its services since LREC 2012. Among the most recent contributions is the establishment of the International Standard Language Resource Number (ISLRN), with the creation and exploitation of an associated web portal to enable the procurement of unique identifiers for Language Resources. Interoperability, consolidation and synchronization remain also a strong focus in ELRAs cataloguing work, in particular with ELRAs involvement in the META-SHARE project, whose platform is to become ELRAs next instrument of sharing LRs. Since last LREC, ELRA has continued its action to offer free LRs to the research community. Cooperation is another watchword within ELRAs activities on multiple aspects: 1) at the legal level, ELRA is supporting the EC in identifying the gaps to be fulfilled to reach harmonized copyright regulations for the HLT community in Europe; 2) at the production level, ELRA is participating in several international projects, in the field of LR production and evaluation of technologies; 3) at the communication level, ELRA has organised the NLP12 meeting with the aim of boosting co-operation and strengthening the bridges between various communities.
ELRA in the heart of a cooperative HLT world
Valérie Mapelli
Victoria Arranz
Matthieu Carré
Hélène Mazo
Djamel Mostefa
Khalid Choukri
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper aims at giving an overview of ELRAs recent activities. The first part elaborates on ELRAs means of boosting the sharing Language Resources (LRs) within the HLT community through its catalogues, LRE-Map initiative, as well as its work towards the integration of its LRs within the META-SHARE open infrastructure. The second part shows how ELRA helps in the development and evaluation of HLT, in particular through its numerous participations to collaborative projects for the production of resources and platforms to facilitate their production and exploitation. A third part focuses on ELRAs work for clearing IPR issues in a HLT-oriented context, one of its latest initiative being its involvement in a Fair Research Act proposal to promote the easy access to LRs to the widest community. Finally, the last part elaborates on recent actions for disseminating information and promoting cooperation in the field, e.g. an the Language Library being launched at LREC2012 and the creation of an International Standard LR Number, a LR unique identifier to enable the accurate identification of LRs. Among the other messages ELRA will be conveying the attendees are the announcement of a set of freely available resources, the establishment of a LR and Evaluation forum, etc.
The REPERE Corpus : a multimodal corpus for person recognition
Aude Giraudel
Matthieu Carré
Valérie Mapelli
Juliette Kahn
Olivier Galibert
Ludovic Quintard
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
The REPERE Challenge aims to support research on people recognition in multimodal conditions. To assess the technology progression, annual evaluation campaigns will be organized from 2012 to 2014. In this context, the REPERE corpus, a French videos corpus with multimodal annotation, has been developed. This paper presents datasets collected for the dry run test that took place at the beginning of 2012. Specific annotation tools and guidelines are mainly described. At the time being, 6 hours of data have been collected and annotated. Last section presents analyses of annotation distribution and interaction between modalities in the corpus.
The META-SHARE Metadata Schema for the Description of Language Resources
Maria Gavrilidou
Penny Labropoulou
Elina Desipri
Stelios Piperidis
Haris Papageorgiou
Monica Monachini
Francesca Frontini
Thierry Declerck
Gil Francopoulo
Victoria Arranz
Valerie Mapelli
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper presents a metadata model for the description of language resources proposed in the framework of the META-SHARE infrastructure, aiming to cover both datasets and tools/technologies used for their processing. It places the model in the overall framework of metadata models, describes the basic principles and features of the model, elaborates on the distinction between minimal and maximal versions thereof, briefly presents the integrated environment supporting the LRs description and search and retrieval processes and concludes with work to be done in the future for the improvement of the model.
A Metadata Schema for the Description of Language Resources (LRs)
Maria Gavrilidou
Penny Labropoulou
Stelios Piperidis
Monica Monachini
Francesca Frontini
Gil Francopoulo
Victoria Arranz
Valérie Mapelli
Proceedings of the Workshop on Language Resources, Technology and Services in the Sharing Paradigm
Latest Developments in ELRA’s Services
Valérie Mapelli
Victoria Arranz
Hélène Mazo
Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
This paper describes the latest developments in ELRAs services within the field of Language Resources (LR). These developments focus on 4 main groups of activities: the identification and distribution of Language Resources; the production of LRs; the evaluation of Human Language Technology (HLT), and the dissemination of information in the field. ELRAs initial work on the distribution of language resources has evolved throughout the years, currently covering a much wider range of activities that have been considered crucial for the current needs of the R&D community and the good health of the LR world. Regarding distribution, considerable work has been done on a broader identification, which does not only consider resources to be immediately negotiated for distribution but which aims to inform on all available resources. This has been the seed for the Universal Catalogue. Furthermore, a Catalogue of LRs with favourable conditions for R&D has also been created. Moreover, the different activities in what regards identification on demand, production within different frameworks, evaluation of language technologies and participation in evaluation campaigns, as well as our very specific focus on information dissemination are described in detail in this paper.
A Guide for the Production of Reusable Language Resources
Victoria Arranz
Franck Gandcher
Valérie Mapelli
Khalid Choukri
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The project described in this paper is funded by the French Ministry of Research. It aims at providing producers of Language Resources, and HLT players in general, with a guide which offers technical, legal and strategic recommendations/guidelines for the reuse of their Language Resources. The guide is dedicated in particular to academic laboratories which produce Language Resources and may benefit from further advice to start development, but also to any HLT player who wishes to follow the best practices in this field. The guidelines focus on different steps of a Language Resources life, i.e. specifications, production, validation, distribution, and maintenance. This paper gives a brief overview of the guide, and describes a) technical formats, standards and best practices which correspond to the current state of the art, for different types of resources, whether written or spoken, at different steps of the production line, b) legal issues and models/templates which can be used for the dissemination of Language Resources as widely as possible, c) strategic issues, by offering a dissemination plan which takes into account all types of constraints faced by HLT community players.
ENABLER Thematic Network of National Projects: Technical, Strategic and Political Issues of LRs
Nicoletta Calzolari
Khalid Choukri
Maria Gavrilidou
Bente Maegaard
Paola Baroni
Hanne Fersøe
Alessandro Lenci
Valérie Mapelli
Monica Monachini
Stelios Piperidis
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
Technolangue: A Permanent Evaluation and Information Infrastructure
Valérie Mapelli
Maria Nava
Sylvain Surcin
Djamel Mostefa
Khalid Choukri
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus
Emanuela Cresti
Massimo Moneglia
Fernanda Bacelar do Nascimento
Antonio Moreno Sandoval
Jean Veronis
Philippe Martin
Kalid Choukri
Valerie Mapelli
Daniele Falavigna
Antonio Cid
Claude Blum
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
For a Repository of NLP Tools
Stéphane Chaudiron
Khalid Choukri
Audrey Mance
Valérie Mapelli
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
Recent Developments within the European Language Resources Association (ELRA)
Khalid Choukri
Audrey Mance
Valérie Mapelli
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)