2024
pdf
bib
abs
SPACE-IDEAS: A Dataset for Salient Information Detection in Space Innovation
Andres Garcia-Silva
|
Cristian Berrio
|
Jose Manuel Gomez-Perez
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Detecting salient parts in text using natural language processing has been widely used to mitigate the effects of information overflow. Nevertheless, most of the datasets available for this task are derived mainly from academic publications. We introduce SPACE-IDEAS, a dataset for salient information detection from innovation ideas related to the Space domain. The text in SPACE-IDEAS varies greatly and includes informal, technical, academic and business-oriented writing styles. In addition to a manually annotated dataset we release an extended version that is annotated using a large generative language model. We train different sentence and sequential sentence classifiers, and show that the automatically annotated dataset can be leveraged using multitask learning to train better classifiers.
2022
pdf
bib
abs
Generating Quizzes to Support Training on Quality Management and Assurance in Space Science and Engineering
Andres Garcia-Silva
|
Cristian Berrio Aroca
|
Jose Manuel Gomez-Perez
|
Jose Martinez
|
Patrick Fleith
|
Stefano Scaglioni
Proceedings of the 15th International Conference on Natural Language Generation: System Demonstrations
Quality management and assurance is key for space agencies to guarantee the success of space missions, which are high-risk and extremely costly. In this paper, we present a system to generate quizzes, a common resource to evaluate the effectiveness of training sessions, from documents about quality assurance procedures in the Space domain. Our system leverages state of the art auto-regressive models like T5 and BART to generate questions, and a RoBERTa model to extract answers for such questions, thus verifying their suitability.
2021
pdf
bib
abs
European Language Grid: A Joint Platform for the European Language Technology Community
Georg Rehm
|
Stelios Piperidis
|
Kalina Bontcheva
|
Jan Hajic
|
Victoria Arranz
|
Andrejs Vasiļjevs
|
Gerhard Backfried
|
Jose Manuel Gomez-Perez
|
Ulrich Germann
|
Rémi Calizzano
|
Nils Feldhus
|
Stefanie Hegele
|
Florian Kintzel
|
Katrin Marheinecke
|
Julian Moreno-Schneider
|
Dimitris Galanis
|
Penny Labropoulou
|
Miltos Deligiannis
|
Katerina Gkirtzou
|
Athanasia Kolovou
|
Dimitris Gkoumas
|
Leon Voukoutis
|
Ian Roberts
|
Jana Hamrlova
|
Dusan Varis
|
Lukas Kacena
|
Khalid Choukri
|
Valérie Mapelli
|
Mickaël Rigault
|
Julija Melnika
|
Miro Janosik
|
Katja Prinz
|
Andres Garcia-Silva
|
Cristian Berrio
|
Ondrej Klejch
|
Steve Renals
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.
2020
pdf
bib
abs
European Language Grid: An Overview
Georg Rehm
|
Maria Berger
|
Ela Elsholz
|
Stefanie Hegele
|
Florian Kintzel
|
Katrin Marheinecke
|
Stelios Piperidis
|
Miltos Deligiannis
|
Dimitris Galanis
|
Katerina Gkirtzou
|
Penny Labropoulou
|
Kalina Bontcheva
|
David Jones
|
Ian Roberts
|
Jan Hajič
|
Jana Hamrlová
|
Lukáš Kačena
|
Khalid Choukri
|
Victoria Arranz
|
Andrejs Vasiļjevs
|
Orians Anvari
|
Andis Lagzdiņš
|
Jūlija Meļņika
|
Gerhard Backfried
|
Erinç Dikici
|
Miroslav Janosik
|
Katja Prinz
|
Christoph Prinz
|
Severin Stampler
|
Dorothea Thomas-Aniola
|
José Manuel Gómez-Pérez
|
Andres Garcia Silva
|
Christian Berrío
|
Ulrich Germann
|
Steve Renals
|
Ondrej Klejch
Proceedings of the Twelfth Language Resources and Evaluation Conference
With 24 official EU and many additional languages, multilingualism in Europe and an inclusive Digital Single Market can only be enabled through Language Technologies (LTs). European LT business is dominated by hundreds of SMEs and a few large players. Many are world-class, with technologies that outperform the global players. However, European LT business is also fragmented – by nation states, languages, verticals and sectors, significantly holding back its impact. The European Language Grid (ELG) project addresses this fragmentation by establishing the ELG as the primary platform for LT in Europe. The ELG is a scalable cloud platform, providing, in an easy-to-integrate way, access to hundreds of commercial and non-commercial LTs for all European languages, including running tools and services as well as data sets and resources. Once fully operational, it will enable the commercial and non-commercial European LT community to deposit and upload their technologies and data sets into the ELG, to deploy them through the grid, and to connect with other resources. The ELG will boost the Multilingual Digital Single Market towards a thriving European LT community, creating new jobs and opportunities. Furthermore, the ELG project organises two open calls for up to 20 pilot projects. It also sets up 32 national competence centres and the European LT Council for outreach and coordination purposes.
pdf
bib
abs
Making Metadata Fit for Next Generation Language Technology Platforms: The Metadata Schema of the European Language Grid
Penny Labropoulou
|
Katerina Gkirtzou
|
Maria Gavriilidou
|
Miltos Deligiannis
|
Dimitris Galanis
|
Stelios Piperidis
|
Georg Rehm
|
Maria Berger
|
Valérie Mapelli
|
Michael Rigault
|
Victoria Arranz
|
Khalid Choukri
|
Gerhard Backfried
|
José Manuel Gómez-Pérez
|
Andres Garcia-Silva
Proceedings of the Twelfth Language Resources and Evaluation Conference
The current scientific and technological landscape is characterised by the increasing availability of data resources and processing tools and services. In this setting, metadata have emerged as a key factor facilitating management, sharing and usage of such digital assets. In this paper we present ELG-SHARE, a rich metadata schema catering for the description of Language Resources and Technologies (processing and generation services and tools, models, corpora, term lists, etc.), as well as related entities (e.g., organizations, projects, supporting documents, etc.). The schema powers the European Language Grid platform that aims to be the primary hub and marketplace for industry-relevant Language Technology in Europe. ELG-SHARE has been based on various metadata schemas, vocabularies, and ontologies, as well as related recommendations and guidelines.
2019
pdf
bib
abs
An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection
Andres Garcia-Silva
|
Cristian Berrio
|
José Manuel Gómez-Pérez
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Fine-tuning pre-trained language models has significantly advanced the state of art in a wide range of NLP downstream tasks. Usually, such language models are learned from large and well-formed text corpora from e.g. encyclopedic resources, books or news. However, a significant amount of the text to be analyzed nowadays is Web data, often from social media. In this paper we consider the research question: How do standard pre-trained language models generalize and capture the peculiarities of rather short, informal and frequently automatically generated text found in social media? To answer this question, we focus on bot detection in Twitter as our evaluation task and test the performance of fine-tuning approaches based on language models against popular neural architectures such as LSTM and CNN combined with pre-trained and contextualized embeddings. Our results also show strong performance variations among the different language model approaches, which suggest further research.