2024
pdf
bib
abs
LLODIA: A Linguistic Linked Open Data Model for Diachronic Analysis
Florentina Armaselu
|
Chaya Liebeskind
|
Paola Marongiu
|
Barbara McGillivray
|
Giedre Valunaite Oleskeviciene
|
Elena-Simona Apostol
|
Ciprian-Octavian Truica
|
Daniela Gifu
Proceedings of the 9th Workshop on Linked Data in Linguistics @ LREC-COLING 2024
This article proposes a linguistic linked open data model for diachronic analysis (LLODIA) that combines data derived from diachronic analysis of multilingual corpora with dictionary-based evidence. A humanities use case was devised as a proof of concept that includes examples in five languages (French, Hebrew, Latin, Lithuanian and Romanian) related to various meanings of the term “revolution” considered at different time intervals. The examples were compiled through diachronic word embedding and dictionary alignment.
pdf
bib
abs
From Linguistics to Practice: a Case Study of Offensive Language Taxonomy in Hebrew
Chaya Liebeskind
|
Marina Litvak
|
Natalia Vanetik
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
The perception of offensive language varies based on cultural, social, and individual perspectives. With the spread of social media, there has been an increase in offensive content online, necessitating advanced solutions for its identification and moderation. This paper addresses the practical application of an offensive language taxonomy, specifically targeting Hebrew social media texts. By introducing a newly annotated dataset, modeled after the taxonomy of explicit offensive language of (Lewandowska-Tomaszczyk et al., 2023)„ we provide a comprehensive examination of various degrees and aspects of offensive language. Our findings indicate the complexities involved in the classification of such content. We also outline the implications of relying on fixed taxonomies for Hebrew.
pdf
bib
abs
Self-Evaluation of Generative AI Prompts for Linguistic Linked Open Data Modelling in Diachronic Analysis
Florentina Armaselu
|
Chaya Liebeskind
|
Giedre Valunaite Oleskeviciene
Proceedings of the Workshop on Deep Learning and Linked Data (DLnLD) @ LREC-COLING 2024
This article addresses the question of evaluating generative AI prompts designed for specific tasks such as linguistic linked open data modelling and refining of word embedding results. The prompts were created to assist the pre-modelling phase in the construction of LLODIA, a linguistic linked open data model for diachronic analysis. We present a self-evaluation framework based on the method known in literature as LLM-Eval. The discussion includes prompts related to the RDF-XML conception of the model, and neighbour list refinement, dictionary alignment and contextualisation for the term revolution in French, Hebrew and Lithuanian, as a proof of concept.
pdf
bib
abs
From Linguistic Linked Data to Big Data
Dimitar Trajanov
|
Elena Apostol
|
Radovan Garabik
|
Katerina Gkirtzou
|
Dagmar Gromann
|
Chaya Liebeskind
|
Cosimo Palma
|
Michael Rosner
|
Alexia Sampri
|
Gilles Sérasset
|
Blerina Spahiu
|
Ciprian-Octavian Truică
|
Giedre Valunaite Oleskeviciene
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
With advances in the field of Linked (Open) Data (LOD), language data on the LOD cloud has grown in number, size, and variety. With an increased volume and variety of language data, optimizations of methods for distributing, storing, and querying these data become more central. To this end, this position paper investigates use cases at the intersection of LLOD and Big Data, existing approaches to utilizing Big Data techniques within the context of linked data, and discusses the challenges and benefits of this union.
pdf
bib
abs
MultiLexBATS: Multilingual Dataset of Lexical Semantic Relations
Dagmar Gromann
|
Hugo Goncalo Oliveira
|
Lucia Pitarch
|
Elena-Simona Apostol
|
Jordi Bernad
|
Eliot Bytyçi
|
Chiara Cantone
|
Sara Carvalho
|
Francesca Frontini
|
Radovan Garabik
|
Jorge Gracia
|
Letizia Granata
|
Fahad Khan
|
Timotej Knez
|
Penny Labropoulou
|
Chaya Liebeskind
|
Maria Pia Di Buono
|
Ana Ostroški Anić
|
Sigita Rackevičienė
|
Ricardo Rodrigues
|
Gilles Sérasset
|
Linas Selmistraitis
|
Mahammadou Sidibé
|
Purificação Silvano
|
Blerina Spahiu
|
Enriketa Sogutlu
|
Ranka Stanković
|
Ciprian-Octavian Truică
|
Giedre Valunaite Oleskeviciene
|
Slavko Zitnik
|
Katerina Zdravkova
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Understanding the relation between the meanings of words is an important part of comprehending natural language. Prior work has either focused on analysing lexical semantic relations in word embeddings or probing pretrained language models (PLMs), with some exceptions. Given the rarity of highly multilingual benchmarks, it is unclear to what extent PLMs capture relational knowledge and are able to transfer it across languages. To start addressing this question, we propose MultiLexBATS, a multilingual parallel dataset of lexical semantic relations adapted from BATS in 15 languages including low-resource languages, such as Bambara, Lithuanian, and Albanian. As experiment on cross-lingual transfer of relational knowledge, we test the PLMs’ ability to (1) capture analogies across languages, and (2) predict translation targets. We find considerable differences across relation types and languages with a clear preference for hypernymy and antonymy as well as romance languages.
2023
pdf
bib
abs
PARSEME corpus release 1.3
Agata Savary
|
Cherifa Ben Khelil
|
Carlos Ramisch
|
Voula Giouli
|
Verginica Barbu Mititelu
|
Najet Hadj Mohamed
|
Cvetana Krstev
|
Chaya Liebeskind
|
Hongzhi Xu
|
Sara Stymne
|
Tunga Güngör
|
Thomas Pickard
|
Bruno Guillaume
|
Eduard Bejček
|
Archna Bhatia
|
Marie Candito
|
Polona Gantar
|
Uxoa Iñurrieta
|
Albert Gatt
|
Jolanta Kovalevskaite
|
Timm Lichte
|
Nikola Ljubešić
|
Johanna Monti
|
Carla Parra Escartín
|
Mehrnoush Shamsfard
|
Ivelina Stoyanova
|
Veronika Vincze
|
Abigail Walsh
Proceedings of the 19th Workshop on Multiword Expressions (MWE 2023)
We present version 1.3 of the PARSEME multilingual corpus annotated with verbal multiword expressions. Since the previous version, new languages have joined the undertaking of creating such a resource, some of the already existing corpora have been enriched with new annotated texts, while others have been enhanced in various ways. The PARSEME multilingual corpus represents 26 languages now. All monolingual corpora therein use Universal Dependencies v.2 tagset. They are (re-)split observing the PARSEME v.1.2 standard, which puts impact on unseen VMWEs. With the current iteration, the corpus release process has been detached from shared tasks; instead, a process for continuous improvement and systematic releases has been introduced.
pdf
bib
abs
JCT_DM at SemEval-2023 Task 10: Detection of Online Sexism: from Classical Models to Transformers
Efrat Luzzon
|
Chaya Liebeskind
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
This paper presents the experimentation of systems for detecting online sexism relying on classical models, deep learning models, and transformer-based models. The systems aim to provide a comprehensive approach to handling the intricacies of online language, including slang and neologisms. The dataset consists of labeled and unlabeled data from Gab and Reddit, which allows for the development of unsupervised or semi-supervised models. The system utilizes TF-IDF with classical models, bidirectional models with embedding, and pre-trained transformer models. The paper discusses the experimental setup and results, demonstrating the effectiveness of the system in detecting online sexism.
pdf
bib
Towards a Conversational Web? A Benchmark for Analysing Semantic Change with Conversational Knowledge Bots and Linked Open Data
Florentina Armaselu
|
Elena-Simona Apostol
|
Christian Chiarcos
|
Anas Fahad Khan
|
Chaya Liebeskind
|
Barbara McGillivray
|
Ciprian-Octavian Truica
|
Andrius Utka
|
Giedrė Valūnaitė-Oleškevičienė
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
Workflow Reversal and Data Wrangling in Multilingual Diachronic Analysis and Linguistic Linked Open Data Modelling
Florentina Armaselu
|
Barbara McGillivray
|
Chaya Liebeskind
|
Giedrė Valūnaitė Oleškevičienė
|
Andrius Utka
|
Daniela Gifu
|
Anas Fahad Khan
|
Elena-Simona Apostol
|
Ciprian-Octavian Truica
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
Validation of Language Agnostic Models for Discourse Marker Detection
Mariana Damova
|
Kostadin Mishev
|
Giedrė Valūnaitė-Oleškevičienė
|
Chaya Liebeskind
|
Purificação Silvano
|
Dimitar Trajanov
|
Ciprian-Octavian Truica
|
Elena-Simona Apostol
|
Christian Chiarcos
|
Anna Baczkowska
Proceedings of the 4th Conference on Language, Data and Knowledge
pdf
bib
Multi-word Expressions as Discourse Markers in Multilingual TED-ELH Parallel Corpus
Giedrė Valūnaitė-Oleškevičienė
|
Chaya Liebeskind
Proceedings of the 4th Conference on Language, Data and Knowledge
2022
pdf
bib
abs
Cross-Lingual Link Discovery for Under-Resourced Languages
Michael Rosner
|
Sina Ahmadi
|
Elena-Simona Apostol
|
Julia Bosque-Gil
|
Christian Chiarcos
|
Milan Dojchinovski
|
Katerina Gkirtzou
|
Jorge Gracia
|
Dagmar Gromann
|
Chaya Liebeskind
|
Giedrė Valūnaitė Oleškevičienė
|
Gilles Sérasset
|
Ciprian-Octavian Truică
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper, we provide an overview of current technologies for cross-lingual link discovery, and we discuss challenges, experiences and prospects of their application to under-resourced languages. We rst introduce the goals of cross-lingual linking and associated technologies, and in particular, the role that the Linked Data paradigm (Bizer et al., 2011) applied to language data can play in this context. We de ne under-resourced languages with a speci c focus on languages actively used on the internet, i.e., languages with a digitally versatile speaker community, but limited support in terms of language technology. We argue that languages for which considerable amounts of textual data and (at least) a bilingual word list are available, techniques for cross-lingual linking can be readily applied, and that these enable the implementation of downstream applications for under-resourced languages via the localisation and adaptation of existing technologies and resources.
pdf
bib
abs
ISO-based Annotated Multilingual Parallel Corpus for Discourse Markers
Purificação Silvano
|
Mariana Damova
|
Giedrė Valūnaitė Oleškevičienė
|
Chaya Liebeskind
|
Christian Chiarcos
|
Dimitar Trajanov
|
Ciprian-Octavian Truică
|
Elena-Simona Apostol
|
Anna Baczkowska
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Discourse markers carry information about the discourse structure and organization, and also signal local dependencies or epistemological stance of speaker. They provide instructions on how to interpret the discourse, and their study is paramount to understand the mechanism underlying discourse organization. This paper presents a new language resource, an ISO-based annotated multilingual parallel corpus for discourse markers. The corpus comprises nine languages, Bulgarian, Lithuanian, German, European Portuguese, Hebrew, Romanian, Polish, and Macedonian, with English as a pivot language. In order to represent the meaning of the discourse markers, we propose an annotation scheme of discourse relations from ISO 24617-8 with a plug-in to ISO 24617-2 for communicative functions. We describe an experiment in which we applied the annotation scheme to assess its validity. The results reveal that, although some extensions are required to cover all the multilingual data, it provides a proper representation of discourse markers value. Additionally, we report some relevant contrastive phenomena concerning discourse markers interpretation and role in discourse. This first step will allow us to develop deep learning methods to identify and extract discourse relations and communicative functions, and to represent that information as Linguistic Linked Open Data (LLOD).
pdf
bib
abs
Offensive language detection in Hebrew: can other languages help?
Marina Litvak
|
Natalia Vanetik
|
Chaya Liebeskind
|
Omar Hmdia
|
Rizek Abu Madeghem
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Unfortunately, offensive language in social media is a common phenomenon nowadays. It harms many people and vulnerable groups. Therefore, automated detection of offensive language is in high demand and it is a serious challenge in multilingual domains. Various machine learning approaches combined with natural language techniques have been applied for this task lately. This paper contributes to this area from several aspects: (1) it introduces a new dataset of annotated Facebook comments in Hebrew; (2) it describes a case study with multiple supervised models and text representations for a task of offensive language detection in three languages, including two Semitic (Hebrew and Arabic) languages; (3) it reports evaluation results of cross-lingual and multilingual learning for detection of offensive content in Semitic languages; and (4) it discusses the limitations of these settings.
pdf
bib
abs
Morphological Complexity of Children Narratives in Eight Languages
Gordana Hržica
|
Chaya Liebeskind
|
Kristina Š. Despot
|
Olga Dontcheva-Navratilova
|
Laura Kamandulytė-Merfeldienė
|
Sara Košutar
|
Matea Kramarić
|
Giedrė Valūnaitė Oleškevičienė
Proceedings of the Thirteenth Language Resources and Evaluation Conference
The aim of this study was to compare the morphological complexity in a corpus representing the language production of younger and older children across different languages. The language samples were taken from the Frog Story subcorpus of the CHILDES corpora, which comprises oral narratives collected by various researchers between 1990 and 2005. We extracted narratives by typically developing, monolingual, middle-class children. Additionally, samples of Lithuanian language, collected according to the same principles, were added. The corpus comprises 249 narratives evenly distributed across eight languages: Croatian, English, French, German, Italian, Lithuanian, Russian and Spanish. Two subcorpora were formed for each language: a younger children corpus and an older children corpus. Four measures of morphological complexity were calculated for each subcorpus: Bane, Kolmogorov, Word entropy and Relative entropy of word structure. The results showed that younger children corpora had lower morphological complexity than older children corpora for all four measures for Spanish and Russian. Reversed results were obtained for English and French, and the results for the remaining four languages showed variation. Relative entropy of word structure proved to be indicative of age differences. Word entropy and relative entropy of word structure show potential to demonstrate typological differences.
2021
pdf
bib
abs
JCT at SemEval-2021 Task 1: Context-aware Representation for Lexical Complexity Prediction
Chaya Liebeskind
|
Otniel Elkayam
|
Shmuel Liebeskind
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
In this paper, we present our contribution in SemEval-2021 Task 1: Lexical Complexity Prediction, where we integrate linguistic, statistical, and semantic properties of the target word and its context as features within a Machine Learning (ML) framework for predicting lexical complexity. In particular, we use BERT contextualized word embeddings to represent the semantic meaning of the target word and its context. We participated in the sub-task of predicting the complexity score of single words
pdf
bib
Multiword expressions as discourse markers in Hebrew and Lithuanian
Giedre Valunaite Oleskeviciene
|
Chaya Liebeskind
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age
2020
pdf
bib
abs
JCT at SemEval-2020 Task 1: Combined Semantic Vector Spaces Models for Unsupervised Lexical Semantic Change Detection
Efrat Amar
|
Chaya Liebeskind
Proceedings of the Fourteenth Workshop on Semantic Evaluation
In this paper, we present our contribution in SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection, where we systematically combine existing models for unsupervised capturing of lexical semantic change across time in text corpora of German, English, Latin and Swedish. In particular, we analyze the score distribution of existing models. Then we define a general threshold, adjust it independently to each of the models and measure the models’ score reliability. Finally, using both the threshold and score reliability, we aggregate the models for the two sub- tasks: binary classification and ranking.
pdf
bib
abs
Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions
Carlos Ramisch
|
Agata Savary
|
Bruno Guillaume
|
Jakub Waszczuk
|
Marie Candito
|
Ashwini Vaidya
|
Verginica Barbu Mititelu
|
Archna Bhatia
|
Uxoa Iñurrieta
|
Voula Giouli
|
Tunga Güngör
|
Menghan Jiang
|
Timm Lichte
|
Chaya Liebeskind
|
Johanna Monti
|
Renata Ramisch
|
Sara Stymne
|
Abigail Walsh
|
Hongzhi Xu
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
We present edition 1.2 of the PARSEME shared task on identification of verbal multiword expressions (VMWEs). Lessons learned from previous editions indicate that VMWEs have low ambiguity, and that the major challenge lies in identifying test instances never seen in the training data. Therefore, this edition focuses on unseen VMWEs. We have split annotated corpora so that the test corpora contain around 300 unseen VMWEs, and we provide non-annotated raw corpora to be used by complementary discovery methods. We released annotated and raw corpora in 14 languages, and this semi-supervised challenge attracted 7 teams who submitted 9 system results. This paper describes the effort of corpus creation, the task design, and the results obtained by the participating systems, especially their performance on unseen expressions.
pdf
bib
abs
Automatic Construction of Aramaic-Hebrew Translation Lexicon
Chaya Liebeskind
|
Shmuel Liebeskind
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Aramaic is an ancient Semitic language with a 3,000 year history. However, since the number of Aramaic speakers in the world hasdeclined, Aramaic is in danger of extinction. In this paper, we suggest a methodology for automatic construction of Aramaic-Hebrew translation Lexicon. First, we generate an initial translation lexicon by a state-of-the-art word alignment translation model. Then,we filter the initial lexicon using string similarity measures of three types: similarity between terms in the target language, similarity between a source and a target term, and similarity between terms in the source language. In our experiments, we use a parallel corporaof Biblical Aramaic-Hebrew sentence pairs and evaluate various string similarity measures for each type of similarity. We illustratethe empirical benefit of our methodology and its effect on precision and F1. In particular, we demonstrate that our filtering methodsignificantly exceeds a filtering approach based on the probability scores given by a state-of-the-art word alignment translation model.
2018
pdf
bib
Automatic Thesaurus Construction for Modern Hebrew
Chaya Liebeskind
|
Ido Dagan
|
Jonathan Schler
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
bib
abs
Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions
Carlos Ramisch
|
Silvio Ricardo Cordeiro
|
Agata Savary
|
Veronika Vincze
|
Verginica Barbu Mititelu
|
Archna Bhatia
|
Maja Buljan
|
Marie Candito
|
Polona Gantar
|
Voula Giouli
|
Tunga Güngör
|
Abdelati Hawwari
|
Uxoa Iñurrieta
|
Jolanta Kovalevskaitė
|
Simon Krek
|
Timm Lichte
|
Chaya Liebeskind
|
Johanna Monti
|
Carla Parra Escartín
|
Behrang QasemiZadeh
|
Renata Ramisch
|
Nathan Schneider
|
Ivelina Stoyanova
|
Ashwini Vaidya
|
Abigail Walsh
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)
This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.
2016
pdf
bib
abs
A Lexical Resource of Hebrew Verb-Noun Multi-Word Expressions
Chaya Liebeskind
|
Yaakov HaCohen-Kerner
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
A verb-noun Multi-Word Expression (MWE) is a combination of a verb and a noun with or without other words, in which the combination has a meaning different from the meaning of the words considered separately. In this paper, we present a new lexical resource of Hebrew Verb-Noun MWEs (VN-MWEs). The VN-MWEs of this resource were manually collected and annotated from five different web resources. In addition, we analyze the lexical properties of Hebrew VN-MWEs by classifying them to three types: morphological, syntactic, and semantic. These two contributions are essential for designing algorithms for automatic VN-MWEs extraction. The analysis suggests some interesting features of VN-MWEs for exploration. The lexical resource enables to sample a set of positive examples for Hebrew VN-MWEs. This set of examples can either be used for training supervised algorithms or as seeds in unsupervised bootstrapping algorithms. Thus, this resource is a first step towards automatic identification of Hebrew VN-MWEs, which is important for natural language understanding, generation and translation systems.
pdf
bib
abs
Semantically Motivated Hebrew Verb-Noun Multi-Word Expressions Identification
Chaya Liebeskind
|
Yaakov HaCohen-Kerner
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
Identification of Multi-Word Expressions (MWEs) lies at the heart of many natural language processing applications. In this research, we deal with a particular type of Hebrew MWEs, Verb-Noun MWEs (VN-MWEs), which combine a verb and a noun with or without other words. Most prior work on MWEs classification focused on linguistic and statistical information. In this paper, we claim that it is essential to utilize semantic information. To this end, we propose a semantically motivated indicator for classifying VN-MWE and define features that are related to various semantic spaces and combine them as features in a supervised classification framework. We empirically demonstrate that our semantic feature set yields better performance than the common linguistic and statistical feature sets and that combining semantic features contributes to the VN-MWEs identification task.
2015
pdf
bib
Integrating Query Performance Prediction in Term Scoring for Diachronic Thesaurus
Chaya Liebeskind
|
Ido Dagan
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)
2013
pdf
bib
Semi-automatic Construction of Cross-period Thesaurus
Chaya Liebeskind
|
Ido Dagan
|
Jonathan Schler
Proceedings of the 7th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities
2012
pdf
bib
Statistical Thesaurus Construction for a Morphologically Rich Language
Chaya Liebeskind
|
Ido Dagan
|
Jonathan Schler
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)