Radovan Garabík

Also published as: Radovan Garabik


2024

pdf bib
From Linguistic Linked Data to Big Data
Dimitar Trajanov | Elena Apostol | Radovan Garabik | Katerina Gkirtzou | Dagmar Gromann | Chaya Liebeskind | Cosimo Palma | Michael Rosner | Alexia Sampri | Gilles Sérasset | Blerina Spahiu | Ciprian-Octavian Truică | Giedre Valunaite Oleskeviciene
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

With advances in the field of Linked (Open) Data (LOD), language data on the LOD cloud has grown in number, size, and variety. With an increased volume and variety of language data, optimizations of methods for distributing, storing, and querying these data become more central. To this end, this position paper investigates use cases at the intersection of LLOD and Big Data, existing approaches to utilizing Big Data techniques within the context of linked data, and discusses the challenges and benefits of this union.

pdf bib
MultiLexBATS: Multilingual Dataset of Lexical Semantic Relations
Dagmar Gromann | Hugo Goncalo Oliveira | Lucia Pitarch | Elena-Simona Apostol | Jordi Bernad | Eliot Bytyçi | Chiara Cantone | Sara Carvalho | Francesca Frontini | Radovan Garabik | Jorge Gracia | Letizia Granata | Fahad Khan | Timotej Knez | Penny Labropoulou | Chaya Liebeskind | Maria Pia Di Buono | Ana Ostroški Anić | Sigita Rackevičienė | Ricardo Rodrigues | Gilles Sérasset | Linas Selmistraitis | Mahammadou Sidibé | Purificação Silvano | Blerina Spahiu | Enriketa Sogutlu | Ranka Stanković | Ciprian-Octavian Truică | Giedre Valunaite Oleskeviciene | Slavko Zitnik | Katerina Zdravkova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Understanding the relation between the meanings of words is an important part of comprehending natural language. Prior work has either focused on analysing lexical semantic relations in word embeddings or probing pretrained language models (PLMs), with some exceptions. Given the rarity of highly multilingual benchmarks, it is unclear to what extent PLMs capture relational knowledge and are able to transfer it across languages. To start addressing this question, we propose MultiLexBATS, a multilingual parallel dataset of lexical semantic relations adapted from BATS in 15 languages including low-resource languages, such as Bambara, Lithuanian, and Albanian. As experiment on cross-lingual transfer of relational knowledge, we test the PLMs’ ability to (1) capture analogies across languages, and (2) predict translation targets. We find considerable differences across relation types and languages with a clear preference for hypernymy and antonymy as well as romance languages.

2023

pdf bib
Validation of the Bigger Analogy Test Set Translation into Croatian, Lithuanian and Slovak
Radovan Garabík | Ana Ostroški Anić | Sigita Rackevičienė | Giedrė Valūnaitė-Oleškevičienė | Linas Selmistraitis | Andrius Utka
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

pdf bib
Introducing the CURLICAT Corpora: Seven-language Domain Specific Annotated Corpora from Curated Sources
Tamás Váradi | Bence Nyéki | Svetla Koeva | Marko Tadić | Vanja Štefanec | Maciej Ogrodniczuk | Bartłomiej Nitoń | Piotr Pęzik | Verginica Barbu Mititelu | Elena Irimia | Maria Mitrofan | Dan Tufiș | Radovan Garabík | Simon Krek | Andraž Repar
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This article presents the current outcomes of the CURLICAT CEF Telecom project, which aims to collect and deeply annotate a set of large corpora from selected domains. The CURLICAT corpus includes 7 monolingual corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing selected samples from respective national corpora. These corpora are automatically tokenized, lemmatized and morphologically analysed and the named entities annotated. The annotations are uniformly provided for each language specific corpus while the common metadata schema is harmonised across the languages. Additionally, the corpora are annotated for IATE terms in all languages. The file format is CoNLL-U Plus format, containing the ten columns specific to the CoNLL-U format and three extra columns specific to our corpora as defined by Varádi et al. (2020). The CURLICAT corpora represent a rich and valuable source not just for training NMT models, but also for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

pdf bib
Curated Multilingual Language Resources for CEF AT (CURLICAT): overall view
Tamás Váradi | Marko Tadić | Svetla Koeva | Maciej Ogrodniczuk | Dan Tufiş | Radovan Garabík | Simon Krek | Andraž Repar
Proceedings of the 23rd Annual Conference of the European Association for Machine Translation

The work in progress on the CEF Action CURLICA T is presented. The general aim of the Action is to compile curated datasets in seven languages of the con- sortium in domains of relevance to Euro- pean Digital Service Infrastructures (DSIs) in order to enhance the eTransla- tion services.

2020

pdf bib
The European Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the Twelfth Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI – including many opportunities, synergies but also misconceptions – has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib
The MARCELL Legislative Corpus
Tamás Váradi | Svetla Koeva | Martin Yamalov | Marko Tadić | Bálint Sass | Bartłomiej Nitoń | Maciej Ogrodniczuk | Piotr Pęzik | Verginica Barbu Mititelu | Radu Ion | Elena Irimia | Maria Mitrofan | Vasile Păiș | Dan Tufiș | Radovan Garabík | Simon Krek | Andraz Repar | Matjaž Rihtar | Janez Brank
Proceedings of the Twelfth Language Resources and Evaluation Conference

This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.

2014

pdf bib
The Strategic Impact of META-NET on the Regional, National and International Level
Georg Rehm | Hans Uszkoreit | Sophia Ananiadou | Núria Bel | Audronė Bielevičienė | Lars Borin | António Branco | Gerhard Budin | Nicoletta Calzolari | Walter Daelemans | Radovan Garabík | Marko Grobelnik | Carmen García-Mateo | Josef van Genabith | Jan Hajič | Inma Hernáez | John Judge | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Joseph Mariani | John McNaught | Maite Melero | Monica Monachini | Asunción Moreno | Jan Odijk | Maciej Ogrodniczuk | Piotr Pęzik | Stelios Piperidis | Adam Przepiórkowski | Eiríkur Rögnvaldsson | Michael Rosner | Bolette Pedersen | Inguna Skadiņa | Koenraad De Smedt | Marko Tadić | Paul Thompson | Dan Tufiş | Tamás Váradi | Andrejs Vasiļjevs | Kadri Vider | Jolanta Zabarskaite
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.

pdf bib
Semi-Automatic Detection of Multiword Expressions in the Slovak Dependency Treebank
Daniela Majchrakova | Ondrej Dusek | Jan Hajic | Agata Karcova | Radovan Garabik
Proceedings of the First International Conference on Computational Linguistics in Bulgaria (CLIB 2014)

We describe a method for semi-automatic extraction of Slovak multiword expressions (MWEs) from a dependency treebank. The process uses an automatic conversion from dependency syntactic trees to deep syntax and automatic tagging of verbal argument nodes based on a valency dictionary. Both the valency dictionary and the treebank conversion were adapted from the corresponding Czech versions; the automatically translated valency dictionary has been manually proofread and corrected. There are two main achievements – a valency dictionary of Slovak MWEs with direct links to corresponding expressions in the Czech dictionary, PDT-Vallex, and a method of extraction of MWEs from the Slovak Dependency Treebank. The extraction reached very high precision but lower recall in a manual evaluation. This is a work in progress, the overall goal of which is twofold: to create a Slovak language valency dictionary paralleling the Czech one, with bilingual links; and to use the extracted verbal frames in a collocation dictionary of Slovak verbs.

2011

pdf bib
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora
Kiril Simov | Petya Osenova | Jörg Tiedemann | Radovan Garabik
Proceedings of the Second Workshop on Annotation and Exploitation of Parallel Corpora

Search
Co-authors
Venues