Meeting of the Conference on Language, Data and Knowledge (2025)


up

pdf (full)
bib (full)
Proceedings of the 5th Conference on Language, Data and Knowledge

pdf bib
Proceedings of the 5th Conference on Language, Data and Knowledge
Mehwish Alam | Andon Tchechmedjiev | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov

pdf bib
DiaSafety-CC: Annotating Dialogues with Safety Labels and Reasons for Cross-Cultural Analysis
Tunde Oluwaseyi Ajayi | Mihael Arcan | Paul Buitelaar

A dialogue dataset developed in a language can have diverse safety annotations when presented to raters from different cultures. What is considered acceptable in one culture can be perceived as offensive in another culture. Cultural differences in dialogue safety annotation is yet to be fully explored. In this work, we use the geopolitical entity, Country, as our base for cultural study. We extend DiaSafety, an existing English dialogue safety dataset that was originally annotated by raters from Western culture, to create a new dataset, DiaSafety-CC. In our work, three raters each from Nigeria and India reannotate the DiaSafety dataset and provide reasons for their choice of labels. We perform pairwise comparisons of the annotations across the cultures studied. Furthermore, we compare the representative labels of each rater group to that of an existing large language model (LLM). Due to the subjectivity of the dialogue annotation task, 32.6% of the considered dialogues achieve unanimous annotation consensus across the labels of DiaSafety and the six raters. In our analyses, we observe that the Unauthorized Expertise and Biased Opinion categories have dialogues with the highest label disagreement ratio across the cultures studied. On manual inspection of the reasons provided for the choice of labels, we observe that raters across the cultures in DiaSafety-CC are sensitive to dialogues directed at target groups compared to dialogues directed at individuals. We also observe that GPT-4o annotation shows a more positive agreement with DiaSafety labels in terms of F1 score and phi coefficient.

pdf bib
The Leibniz List as Linguistic Linked Data in the LiLa Knowledge Base
Lisa Sophie Albertelli | Giulia Calvi | Francesco Mambrini

This paper presents the integration of the Leibniz List, a concept list from the Concepticon project, into the LiLa Knowledge Base of Latin interoperable resources. The modeling experiment was conducted using W3C standards like Ontolex and SKOS. This work, which originated in a project for a university course, is limited to a short list of words, but it already enables interoperability between the Concepticon and the language resources in a LOD architecture like LiLa. The integration enriches the LiLa ecosystem, allowing users to explore Latin lexicon from an onomasiological perspective and links concepts to lexical entries from various dictionaries and corpus attestations. The work showcases how standard Semantic Web technologies can effectively model and connect historical concept lists within larger linguistic knowledge infrastructures and provides an example for further experiments with the Concepticon’s data.

pdf bib
Benchmarking Hindi Term Extraction in Education: A Dataset and Analysis
Shubhanker Banerjee | Bharathi Raja Chakravarthi | John Philip McCrae

This paper introduces the HTEC HindiTerm Extraction Dataset 2.0, a resourcedesigned to support terminology extractionand classification tasks within the education domain. HTEC 2.0 has been developed with the objective of providing a high-quality benchmark dataset for the evaluation of term recognition and classification methodologies in Hindi educationaldiscourse. The dataset consists of 97 documents sourced from Hindi Wikipedia, covering a diverse range of topics relevant tothe education sector. Within these documents, 1,702 terms have been manuallyannotated where each term is defined as asingle-word or multi-word expression thatconveys a domain-specific meaning. Theannotated terms in HTEC 2.0 are systematically categorized into seven distinct classes.Furthermore, this paper outlines the development of annotation guidelines, detailingthe criteria used to determine term boundaries and category assignments. By offeringa structured dataset with clearly definedterm classifications, HTEC 2.0 serves as avaluable resource for researchers workingon terminology extraction, domain-specificnamed entity recognition, and text classification in Hindi.

pdf bib
CoWoYTP1Att: A Social Media Comment Dataset on Gender Discourse with Appraisal Theory Annotations
Valentina Tretti Beckles | Adrian Vergara Heidke | Natalia Molina-Valverde

This paper presents the Corpus on Women in YouTube on Performance with Attitude Annotations (CoWoYTP1Att), developed based on Appraisal Theory (Martin & White, 2005). Between September 2020 and May 2021, 14,883 comments were extracted from a YouTube video featuring a compilation of the performance “Un violador en tu camino” (A Rapist in Your Path) by the feminist collective LasTesis, published on the channel of the Costa Rican newspaper La Nación. The extracted comments were manually and automatically classified based on several criteria to determine their relevance to the video. As a result, 5,939 comments were identified as related to the video. These comments were annotated with the three attitude subdomains (affect, judgement, and appreciation) proposed on the Appraisal Theory (Martin & White, 2005), as well as their polarity, target, fragment, and whether the attitude was implicit or explicit. The statistical analysis of the corpus highlights the predominant negative evaluation of individuals present in the comments on this social media platform.

pdf bib
Detecting Changing Culinary Trends Through Historical Recipes
Gauri Bhagwat | Marieke van Erp | Teresa Paccosi | Rik Hoekstra

Culinary trends evolve in response to social, economic, and cultural influences, reflecting broader historical transformations. We present an exploration into Dutch culinary trends from 1910 to 1995 by analysing recipes from housekeeping school cookbooks and newspaper recipe collections. Using computational techniques, we extract and examine ingredient frequency, recipe complexity, and shifts in recipe categories to identify trends in Dutch cuisine from a quantitative point of view. Additionally, we experimented with Large Language Models (LLMs) to structure and extract recipes’ features, demonstrating their potential for historical recipe parsing.

pdf bib
Towards Multilingual Haikus: Representing Accentuation to Build Poems
Fernando Bobillo | Maxim Ionov | Eduardo Mena | Carlos Bobed

The paradigm of neuro-symbolic Artificial Intelligence is receiving an increasing attention in the last years to improve the results of intelligent systems by combining symbolic and subsymbolic methods. For example, existing Large Language Models (LLMs) could be enriched by taking into account background knowledge encoded using semantic technologies, such as Linguistic Linked Data (LLD). In this paper, we claim that LLD can aid Large Language Models by providing the necessary information to compute the number of poetic syllables, which would help LLMs to correctly generate poems with a valid metric. To do so, we propose an encoding for syllabic structure based on an extension of RDF vocabularies widely used in the field: POSTDATA and OntoLex-Lemon.

pdf bib
Assigning FrameNet Frames to a Croatian Verb Lexicon
Ivana Brač | Ana Ostroški Anić

This paper presents the Croatian verb lexicon Verbion that describes verbs on multiple levels. The semantic level includes verb senses, corresponding semantic classes according to VerbNet and WordNet, as well as semantic frames based on FrameNet. Each verb sense is linked to one or more valency frames, which include corpus-based examples accompanied by syntactic, morphological, and semantic analyses of each argument. This study focuses on assigning FrameNet frames to the verb misliti ‘think’ and its prefixed forms. Based on 170 manually annotated sentences, the paper discusses the advantages and challenges of assigning semantic frames to Croatian verbs.

pdf bib
Putting Low German on the Map (of Linguistic Linked Open Data)
Christian Chiarcos | Tabea Gröger | Christian Fäth

We describe the creation of a cross-dialectal lexical resource for Low German, a regional language spoken primarily in Germany and the Netherlands, based on the application of Linguistic Linked Open Data (LLOD) technologies. We argue that this approach is particularly well-suited for a language without a written standard, but with multiple, incompatible orthographies and considerable internal variation in phonology, spelling and grammar. A major hurdle in the preservation and documentation of and in the creation of educational materials (such as texts and dictionaries) for this variety is its internal degree of linguistic and orthographic variation, intensified by mutually exclusive influences from different national languages and their respective orthographies. We thus aim to provide a “digital Rosetta stone” to unify lexical materials from different dialects through linking dictionaries and mapping corresponding words without the need for a standardvariety. This involves two components, a mapping between different orthographies and phonological systems, and a technology for linking regional dictionaries maintained by different hosts and developed by or for different communities of speakers.

pdf bib
Tracing Organisation Evolution in Wikidata
Marieke van Erp | Jiaqi Zhu | Vera Provatorova

Entities change over time, and while information about entity change is contained in knowledge graphs (KGs), it is often not stated explicitly. This makes KGs less useful for investigating entities over time, or downstream tasks such as historical entity linking. In this paper, we present an approach and experiments that make explicit entity change in Wikidata. Our contributions are a mapping between an existing change ontology and Wikidata properties to identify types of change, and a dataset of entities with explicit evolution information and analytics on this dataset.

pdf bib
Automated Concept Map Extraction from Text
Martina Galletti | Inès Blin | Eleni Ilkou

Concept Maps are semantic graph summary representations of relations between concepts in text. They are particularly beneficial for students with difficulty in reading comprehension, such as those with special educational needs and disabilities. Currently, the field of concept map extraction from text is outdated, relying on old baselines, limited datasets, and limited performances with F1 scores below 20%. We propose a novel neuro-symbolic pipeline and a GPT3.5-based method for automated concept map extraction from text evaluated over the WIKI dataset. The pipeline is a robust, modularized, and open-source architecture, the first to use semantic and neural techniques for automatic concept map extraction while also using a preliminary summarization component to reduce processing time and optimize computational resources. Furthermore, we investigate the large language model in zero-shot, one-shot, and decomposed prompting for concept map generation. Our approaches achieve state-of-the-art results in METEOR metrics, with F1 scores of 25.7 and 28.5, respectively, and in ROUGE-2 recall, with respective scores of 24.3 and 24.3. This contribution advances the task of automated concept map extraction from text, opening doors to wider applications such as education and speech-language therapy. The code is openly available.

pdf bib
Ligt: Towards an Ecosystem for Managing Interlinear Glossed Texts with Linguistic Linked Data
Maxim Ionov

Ligt is an RDF vocabulary developed for representing Interlinear Glossed Text, a common representation of language material used in particular in field linguistics and linguistic typology. In this paper, we look at its current status and different aspects of its adoption. More specifically, we explore the questions of data conversion, storage, and exploitation. We present ligttools, a set of newly developed converters, report on a series of experiments regarding querying Ligt datasets, and analyse the performance with various infrastructure configurations.

pdf bib
A Corpus of Early Modern Decision-Making - the Resolutions of the States General of the Dutch Republic
Marijn Koolen | Rik Hoekstra

This paper presents a corpus of early modern Dutch resolutions made in the daily meetings of the States General, the central governing body of the Dutch Republic, over a period of 220 years, from 1576 to 1796. This corpus has been digitised from over half a million scans of mostly handwritten text, segmented into individual resolutions (decisions) and enriched with named entities and metadata extracted from the text of the resolutions. We developed a pipeline for automatic text recognition for historic Dutch, and a document segmentation approach that combines ML classifiers trained on annotated data with rule-based fuzzy matching of the highly formulaic language of the resolutions. The decisions that the States General made were often based on propositions (requests or proposals) submitted in writing, by other governing bodies and by citizens of the republic. The resolutions contain information about these submitted propositions, including the persons and organisations who submitted them. The second part of this paper includes an analysis of the information about these proposition documents that can be extracted from the resolutions, and the potential to link the resolutions to their corresponding propositions using named entities and extracted metadata. We find that for the overwhelming majority of propositions, we can identify the name of person or organisation who submitted it, making it feasible to (semi-)automatically link the resolutions to their corresponding proposition documents. This will allow historians and genealogists to study not only the decision making of the States General in the early modern period, but also the concerns put forward by both high-ranking officials and regular citizens of the Republic.

pdf bib
Culturally Aware Content Moderation for Facebook Reels: A Cross-Modal Attention-Based Fusion Model for Bengali Code-Mixed Data
Momtazul Arefin Labib | Samia Rahman | Hasan Murad

The advancement of high-speed internet and affordable bandwidth has led to a significant increase in video content and has brought challenges in content moderation due to the spread of unsafe or harmful narratives quickly. The rise of short-form videos like “Reels”, which is easy to create and consume, has intensified these challenges even more. In case of Bengali culture-specific content, the existing content moderation system struggles. To tackle these challenges within the culture-specific Bengali codemixed domain, this paper introduces “UNBER” a novel dataset of 1,111 multimodal Bengali codemixed Facebook Reels categorized into four classes: Safe, Adult, Harmful, and Suicidal. Our contribution also involves the development of a unique annotation tool “ReelAn” to enable an efficient annotation process of reels. While many existing content moderation techniques have focused on resource-rich or monolingual languages, approaches for multimodal datasets in Bengali are rare. To fill this gap, we propose a culturally aware cross-modal attention-based fusion framework to enhance the analysis of these fast-paced videos, which achieved a macro F1 score of 0.75. Our contributions aim to significantly advance multimodal content moderation and lay the groundwork for future research in this area.

pdf bib
LiITA: a Knowledge Base of Interoperable Resources for Italian
Eleonora Litta | Marco Carlo Passarotti | Valerio Basile | Cristina Bosco | Andrea Di Fabio | Paolo Brasolin

This paper describes the LiITA Knowledge Base of interoperable linguistic resources for Italian.By adhering to the Linked Open Data principles, LiITA ensures and facilitates interoperability between distributed resources. The paper outlines the lemma-centered architecture of the Knowledge Base and details its core component: the Lemma Bank, a collection of Italian lemmas designed to interlink distributed lexical and textual resources.

pdf bib
On the Feasibility of LLM-based Automated Generation and Filtering of Competency Questions for Ontologies
Zola Mahlaza | C. Maria Keet | Nanee Chahinian | Batoul Haydar

Competency questions for ontologies are used in a number of ontology development tasks. The questions’ sentences structure have been analysed to inform ontology authoring and validation. One of the problems to make this a seamless process is the hurdle of writing good CQs manually or offering automated assistance in writing CQs. In this paper, we propose an enhanced and automated pipeline where one can trace meticulously through each step, using a mini-corpus, T5, and the SQuAD dataset to generate questions, and the CLaRO controlled language, semantic similarity, and other steps for filtering. This was evaluated with two corpora of different genre in the same broad domain and evaluated with domain experts. The final output questions across the experiments were around 25% for scope and relevance and 45% of unproblematic quality. Technically, it provided ample insight into trade-offs in generation and filtering, where relaxing filtering increased sentence structure diversity but also led to more spurious sentences that required additional processing

pdf bib
Terminology Enhanced Retrieval Augmented Generation for Spanish Legal Corpora
Patricia Martín Chozas | Pablo Calleja | Carlos Rodríguez Limón

This paper intends to highlight the importance of reusing terminologies in the context of Large Language Models (LLMs), particularly within a Retrieval-Augmented Generation (RAG) scenario. We explore the application of query expansion techniques using a controlled terminology enriched with synonyms. Our case study focuses on the Spanish legal domain, investigating both query expansion and improvements in retrieval effectiveness within the RAG model. The experimental setup includes various LLMs, such as Mistral, LLaMA3.2, and Granite 3, along with multiple Spanish-language embedding models. The results demonstrate that integrating current neural approaches with linguistic resources enhances RAG performance, reinforcing the role of structured lexical and terminological knowledge in modern NLP pipelines.

pdf bib
Cuaċ: Fast and Small Universal Representations of Corpora
John Philip McCrae | Bernardo Stearns | Alamgir Munir Qazi | Shubhanker Banerjee | Atul Kr. Ojha

The increasing size and diversity of corpora in natural language processing requires highly efficient processing frameworks. Building on the universal corpus format, Teanga, we present Cuaċ, a format for the compact representation of corpora. We describe this methodology based on short-string compression and indexing techniques and show that the files created with this methodology are similar to compressed human-readable serializations and can be further compressed using lossless compression. We also show that this introduces no computational penalty on the time to process files. This methodology aims to speed up natural language processing pipelines and is the basis for a fast database system for corpora.

pdf bib
Systematic Textual Availability of Manuscripts
Hadar Miller | Samuel Londner | Tsvi Kuflik | Daria Vasyutinsky Shapira | Nachum Dershowitz | Moshe Lavee

The digital era has made millions of manuscript images in Hebrew available to all. However, despite major advancements in handwritten text recognition over the past decade, an efficient pipeline for large scale and accurate conversion of these manuscripts into useful machine-readable form is still sorely lacking.We propose a pipeline that significantly improves recognition models for automatic transcription of Hebrew manuscripts. Transfer learning is used to fine-tune pretrained models. For post-recognition correction, it leverages text reuse, a common phenomenon in medieval manuscripts, and state-of-the-art large language models for medieval Hebrew.The framework successfully handles noisy transcriptions and consistently suggests alternate, better readings. Initial results show that word level accuracy increased by 10% for new readings proposed by text-reuse detection. Moreover, the character level accuracy improved by 18% by fine-tuning models on the first few pages of each manuscript.

pdf bib
Towards Semantic Integration of Opinions: Unified Opinion Concepts Ontology and Extraction Task
Gaurav Negi | Dhairya Dalal | Omnia Zayed | Paul Buitelaar

This paper introduces the Unified Opinion Concepts (UOC) ontology to integrate opinions within their semantic context. The UOC ontology bridges the gap between the semantic representation of opinion across different formulations. It is a unified conceptualisation based on the facets of opinions studied extensively in NLP and semantic structures described through symbolic descriptions. We further propose the Unified Opinion Concept Extraction (UOCE) task of extracting opinions from the text with enhanced expressivity. Additionally, we provide a manually extended and re-annotated evaluation dataset for this task and tailored evaluation metrics to assess the adherence of extracted opinions to UOC semantics. Finally, we establish baseline performance for the UOCE task using state-of-the-art generative models.

pdf bib
Creating and enriching a repository of 177k interlinearized examples in 1611 mostly lesser-resourced languages
Sebastian Nordhoff

Much of NLP is concerned with languages for which dictionaries, thesauri, word nets or treebanks are available. This contribution focuses on languages for which all we have might be some isolated examples with word-to-word translation. We detail the collection, aggregation, storage and querying of this database of 177k examples from 1611 languages with a special eye on enrichment via Named Entity Recognition and links to the Wikidata ontology. We also discuss pitfalls of the approach and discuss the legal status of interlinear examples.

pdf bib
Linking the Lexicala Latin-French Dictionary to the LiLa Knowledge Base
Adriano De Paoli | Marco Carlo Passarotti | Paolo Ruffolo | Giovanni Moretti | Ilan Kernerman

This paper presents the integration of the Lexicala Latin–French Dictionary into the LiLa Knowledge Base of linguistic resources for Latin made interoperable through their publication as Linked Open Data. The entries of the dictionary are linked to the large collection of Latin lemmas of LiLa (Lemma Bank), enabling interaction with the other resources published therein. The paper details the data modelling process, the linking methodology, and a couple of practical use cases, showing how interlinking resources via LOD can support advancement in (multilingual) linguistic research.

pdf bib
DynaMorphPro: A New Diachronic and Multilingual Lexical Resource in the LLOD ecosystem
Matteo Pellegrini | Valeria Irene Boano | Francesco Gardani | Francesco Mambrini | Giovanni Moretti | Marco Carlo Passarotti

This paper describes the release as Linguistic Linked Open Data of DynaMorphPro, a lexical resource recording loanwords, conversions and class-shifts from Latin to Old Italian. We show how existing vocabularies are reused and integrated to allow for a rich semantic representation of these data. Our main reference is the OntoLex-lemon model for lexical information, but classes and properties from many other ontologies are also reused to express other aspects. In particular, we identify the CIDOC Concept Reference Model as the ideal tool to convey chronological information on historical processes of lexical innovation and change, and describe how it can be integrated with OntoLex-lemon.

pdf bib
Exploring Medium-Sized LLMs for Knowledge Base Construction
Tomás Cerveira Da Cruz Pinto | Hugo Gonçalo Oliveira | Chris-Bennet Fleger

Knowledge base construction (KBC) is one of the great challenges in Natural Language Processing (NLP) and of fundamental importance to the growth of the Semantic Web. Large Language Models (LLMs) may be useful for extracting structured knowledge, including subject-predicate-object triples. We tackle the LM-KBC 2023 Challenge by leveraging LLMs for KBC, utilizing its dataset and benchmarking our results against challenge participants. Prompt engineering and ensemble strategies are tested for object prediction with pretrained LLMs in the 0.5-2B parameter range, which is between the limits of tracks 1 and 2 of the challenge.Selected models are assessed in zero-shot and few-shot learning approaches when predicting the objects of 21 relations. Results demonstrate that instruction-tuned LLMs outperform generative baselines by up to four times, with relation-adapted prompts playing a crucial role in performance. The ensemble approach further enhances triple extraction, with a relation-based selection strategy achieving the highest F1 score. These findings highlight the potential of medium-sized LLMs and prompt engineering methods for efficient KBC.

pdf bib
Breaking Ties: Some Methods for Refactoring RST Convergences
Andrew Potter

Among the set of schemata specified by Rhetorical Structure Theory is a pattern known variously as the request schema, satellite tie, multisatellite nucleus, or convergence. The essential feature of this schema is that it permits multiple satellites to attach to a single nucleus. Although the schema has long been considered fundamental to RST, it has never been subjected to detailed evaluation. This paper provides such an assessment. Close examination shows that it results in structures that are ambiguous, disjoint, incomplete, and sometimes incoherent. Fortunately, however, further examination shows it to be unnecessary. This paper describes the difficulties with convergences and presents methods for refactoring them as explicit specifications of text structure. The study shows that convergences can be more clearly rendered not as flat relational conjunctions, but rather as organized expressions of cumulative rhetorical moves, wherein each move asserts an identifiable structural integrity and the expressions conform to specifiable scoping rules.

pdf bib
Enhancing Information Extraction with Large Language Models: A Comparison with Human Annotation and Rule-Based Methods in a Real Estate Case Study
Renzo Alva Principe | Marco Viviani | Nicola Chiarini

Information Extraction (IE) is a key task in Natural Language Processing (NLP) that transforms unstructured text into structured data. This study compares human annotation, rule-based systems, and Large Language Models (LLMs) for domain-specific IE, focusing on real estate auction documents. We assess each method in terms of accuracy, scalability, and cost-efficiency, highlighting the associated trade-offs. Our findings provide valuable insights into the effectiveness of using LLMs for the considered task and, more broadly, offer guidance on how organizations can balance automation, maintainability, and performance when selecting the most suitable IE solution.

pdf bib
When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection
Alamgir Munir Qazi | John Philip McCrae | Jamal Nasir

The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.

pdf bib
Old Reviews, New Aspects: Aspect Based Sentiment Analysis and Entity Typing for Book Reviews with LLMs
Andrea Schimmenti | Stefano De Giorgis | Fabio Vitali | Marieke van Erp

This paper faces the problem of the limited availability of datasets for Aspect-Based Sentiment Analysis (ABSA) in the Cultural Heritage domain. Currently, the main datasets for ABSA are product or restaurant reviews. We expand this to book reviews. Our methodology employs an LLM to maintain domain relevance while preserving the linguistic authenticity and natural variations found in genuine reviews. Entity types are annotated through the tool Text2AMR2FRED and evaluated manually. Additionally, we finetuned Llama 3.1 8B as a baseline model that not only performs ABSA, but also performs Entity Typing (ET) with a set of classes from DOLCE foundational ontology, enabling precise categorization of target aspects within book reviews. We present three key contributions as a step forward expanding ABSA: 1) a semi-synthetic set of book reviews, 2) an evaluation of Llama-3-1-Instruct 8B on the ABSA task, and 3) a fine-tuned version of Llama-3-1-Instruct 8B for ABSA.

pdf bib
Making Sign Language Research Findable: The sign-lang@LREC Anthology and the Sign Language Dataset Compendium
Marc Schulder | Thomas Hanke | Maria Kopf

Resources and research on sign languages are sparse and can often be difficult to locate. Few centralised sources of information exist. This article presents two repositories that aim to improve the findability of such information through the implementation of open science best practices. The sign-lang@LREC Anthology is a repository of publications on sign languages in the series of sign-lang@LREC workshops and related events, enhanced with indices cataloguing what datasets, tools, languages and projects are addressed by these publications. The Sign Language Dataset Compendium provides an overview of existing linguistic corpora, lexical resources and data collection tasks. We describe the evolution of these repositories, covering topics such as supplementary information structures, rich metadata, interoperability, and dealing with the challenges of reference rot.

pdf bib
Conversational Lexicography: Querying Lexicographic Data on Knowledge Graphs with SPARQL through Natural Language
Kilian Sennrich | Sina Ahmadi

Knowledge graphs offer an excellent solution for representing the lexical-semantic structures of lexicographic data. However, working with the SPARQL query language represents a considerable hurdle for many non-expert users who could benefit from the advantages of this technology. This paper addresses the challenge of creating natural language interfaces for lexicographic data retrieval on knowledge graphs such as Wikidata. We develop a multidimensional taxonomy capturing the complexity of Wikidata’s lexicographic data ontology module through four dimensions and create a template-based dataset with over 1.2 million mappings from natural language utterances to SPARQL queries. Our experiments with GPT-2 (124M), Phi-1.5 (1.3B), and GPT-3.5-Turbo reveal significant differences in model capabilities. While all models perform well on familiar patterns, only GPT-3.5-Turbo demonstrates meaningful generalization capabilities, suggesting that model size and diverse pre-training are crucial for adaptability in this domain. However, significant challenges remain in achieving robust generalization, handling diverse linguistic data, and developing scalable solutions that can accommodate the full complexity of lexicographic knowledge representation.

pdf bib
GrEma: an HTR model for automated transcriptions of the Girifalco asylum’s medical records
Grazia Serratore | Emanuela Nicole Donato | Erika Pasceri | Antonietta Folino | Maria Chiaravalloti

This paper deals with the digitization and transcription of medical records from the historical archive of the former psychiatric hospital of Girifalco (Catanzaro, Italy). The digitization is carried out in the premises where the asylum once stood and the historical archive is stored. Using the ScanSnap SV600 flatbed scanner, a copy compliant with the original for each document contained within the medical records is returned. Subsequently the different training phases of a Handwritten Text Recognition model with the Transkribus tool are presented. The transcription aims to obtain texts in an interoperable format, and it was applied exclusively to the clinical documents, such as the informative form, the nosological table and the clinical diary. This paper describes the training phases of a customized model for medical record transcription, named GrEma, presenting its benefits, limitations and possible future applications. This work was carried out ensuring compliance with current legislation on the protection of personal data. It also highlights the importance of digitization and transcription for the recovery and preservation of historical archives from former psychiatric institutions, ensuring these valuable documents remain accessible for future research and potential users.

pdf bib
Constructing a liberal identity via political speech: Tracking lifespan change in the Icelandic Gigaword Corpus
Lilja Björk Stefánsdóttir | Johanna Mechler | Anton Karl Ingason

We examine individual lifespan change in the speech of an Icelandic MP, Þorgerður Gunnarsdóttir, who style-shifts after she switches parties, by becoming less formal as her political stance becomes more liberal. We make use of the resources of the Icelandic Gigaword Corpus, more specifically the Parliament section of that corpus, demonstrating how the reinvention of an identity in politics can be tracked by studying the collection of speeches given by a politician over time.

pdf bib
Towards Sense to Sense Linking across DBnary Languages
Gilles Sérasset

Since 2012, the DBnary project extracts lexical information from different Wiktionary language editions (26 editions in 2025) and makes it available to the community as queryable RDF data (modeled using ontolex-lemon ontology). This dataset contains more than 12M translations linking languages at the level of Lexical Entries. This paper presents an effort to automatically link the DBnary languages at the Lexical Sense level. For this we explore different ways to compute cross-lingual semantic similarity, using multilingual language models.

pdf bib
Empowering Recommender Systems using Automatically Generated Knowledge Graphs and Reinforcement Learning
Ghanshyam Verma | Simanta Sarkar | Devishree Pillai | Huan Chen | John Philip McCrae | János A. Perge | Shovon Sengupta | Paul Buitelaar

Personalized recommender systems play a crucial role in direct marketing, particularly in financial services, where delivering relevant content can enhance customer engagement and promote informed decision-making. This study explores interpretable knowledge graph (KG)-based recommender systems by proposing two distinct approaches for personalized article recommendations within a multinational financial services firm. The first approach leverages Reinforcement Learning (RL) to traverse a KG constructed from both structured (tabular) and unstructured (textual) data, enabling interpretability through Path Directed Reasoning (PDR). The second approach employs the XGBoost algorithm, with post-hoc explainability techniques such as SHAP and ELI5 to enhance transparency. By integrating machine learning with automatically generated KGs, our methods not only improve recommendation accuracy but also provide interpretable insights, facilitating more informed decision-making in customer relationship management.

pdf bib
The EuroVoc Thesaurus: Management, Applications, and Future Directions
Lucy Walhain | Sébastien Albouze | Anikó Gerencsér | Mihai Paunescu | Vassilis Tzouvaras | Cosimo Palma

This paper provides a comprehensive overview of EuroVoc, the European Union’s multilingual thesaurus. The paper highlights EuroVoc’s significance in the legislative and publications domain, examining its applications in improving information retrieval systems and multi-label text classification methods. Various technological tools developed specifically for EuroVoc classification, including JEX, PyEuroVoc, and KEVLAR, are reviewed, demonstrating the evolution from basic classification systems to sophisticated neural architectures. Additionally, the paper addresses the management practices managing EuroVoc’s continuous updating and expansion through collaborative tools such as VocBench, emphasising the role of interinstitutional committees and specialised teams in maintaining the thesaurus’s accuracy and relevance.A substantial part of the paper is dedicated to EuroVoc’s alignment with other semantic resources like Wikidata and UNESCO, detailing the challenges and methodologies adopted to facilitate semantic interoperability across diverse information systems. Finally, the paper identifies future directions that include modular extensions of EuroVoc, federated models, linked data approaches, thematic hubs, selective integration, and collaborative governance frameworks.

up

pdf (full)
bib (full)
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion

pdf bib
Proceedings of the 5th Conference on Language, Data and Knowledge: Fifth Workshop on Language Technology for Equality, Diversity, Inclusion
Katerina Gkirtzou | Slavko Žitnik | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov

pdf bib
SSNCSE@LT-EDI-2025:Detecting Misogyny Memes using Pretrained Deep Learning models
Sreeja K | Bharathi B

Misogyny meme detection is identifying memes that are harmful or offensive to women. These memes can hide hate behind jokes or images, making them difficult to identify. It’s important to detect them for a safer and respectful internet for everyone. Our model proposed a multimodal method for misogyny meme detection in Chinese social media by combining both textual and visual aspects of memes. The training and evaluation data were part of a shared task on detecting misogynistic content. We used a pretrained ResNet-50 architecture to extract visual representations of the memes and processed the meme transcriptions with BERT. The model fused modality-specific representations with a feed-forward neural net for classification. The selected pretrained models were frozen to avoid overfitting and to enhance generalization across all classes, and only the final classifier was fine-tuned on labelled meme recollection. The model was trained and evaluated using test data to achieve a macro F1-score of 0.70345. As a result, we have validated lightweight combining approaches for multimodal fusion techniques on noisy social media and how they can be validated in the context of hostile meme detection tasks.

pdf bib
SSNCSE@LT-EDI-2025:Speech Recognition for Vulnerable Individuals in Tamil
Sreeja K | Bharathi B

Speech recognition is a helpful tool for accessing technology and allowing people to interact with technology naturally. This is especially true for people who want to access technology but may encounter challenges interacting with technology in traditional formats. Some examples of these people include the elderly or people from the transgender community. This research presents an Automatic Speech Recognition (ASR) system developed for Tamil-speaking elderly and transgender people who are generally underrepresented in mainstream ASR training datasets. The proposed work used the speech data shared by the task organisers of LT-EDI2025. In the proposed work used the fine-tuned model of OpenAI’s Whisper model with Parameter-Efficient Fine-Tuning (P-EFT) with Low-Rank Adaptation (LoRA) along with SpecAugment, and used the AdamW optimization method. The model’s work led to an overall Word Error Rate (WER) of 42.3% on the untranscribed test data. A key feature of our work is that it demonstrates potential equitable and accessible ASR systems addressing the linguistic and acoustic features of vulnerable groups.

pdf bib
CrewX@LT-EDI-2025: Transformer-Based Tamil ASR Fine-Tuning with AVMD Denoising and GRU-VAD for Enhanced Transcription Accuracy
Ganesh Sundhar S | Hari Krishnan N | Arun Prasad T D | Shruthikaa V | Jyothish Lal G

This research presents an improved Tamil Automatic Speech Recognition (ASR) system designed to enhance accessibility for elderly and transgender populations by addressing unique language challenges. We address the challenges of Tamil ASR—including limited high-quality curated datasets, unique phonetic characteristics, and word-merging tendencies—through a comprehensive pipeline. Our methodology integrates Adaptive Variational Mode Decomposition (AVMD) for selective noise reduction based on signal characteristics, Silero Voice Activity Detection (VAD) with GRU architecture to eliminate non-speech segments, and fine-tuning of OpenAI’s Whisper model optimized for Tamil transcription. The system employs beam search decoding during inference to further improve accuracy. Our approach achieved state-of-the-art performance with a Word Error Rate (WER) of 31.9,winning first place in the LT-EDI 2025 shared task.

pdf bib
JUNLP@LT-EDI-2025: Efficient Low-Rank Adaptation of Whisper for Inclusive Tamil Speech Recognition Targeting Vulnerable Populations
Priyobroto Acharya | Soham Chaudhuri | Sayan Das | Dipanjan Saha | Dipankar Das

Speech recognition has received extensive research attention in recent years. It becomes much more challenging when the speaker’s age, gender and other factors introduce variations in the speech. In this work, we propose a fine-tuned automatic speech recognition model derived from OpenAI’s whisperlarge-v2. Though we experimented with both Whisper-large and Wav2vec2-XLSR-large, the reduced WER of whisper-large proved to be a superior model. We secured 4th rank in the LT-EDI-2025 shared task. Our implementation details and code are available at our GitHub repository1.

pdf bib
SKVtrio@LT-EDI-2025: Hybrid TF-IDF and BERT Embeddings for Multilingual Homophobia and Transphobia Detection in Social Media Comments
Konkimalla Laxmi Vignesh | Mahankali Sri Ram Krishna | Dondluru Keerthana | Premjith B

This paper presents a description of the paper submitted to the Shared Task on Homophobia and Transphobia Detection in Social Media Comments, LT-EDI at LDK 2025. We propose a hybrid approach to detect homophobic and transphobic content in low-resource languages using Term Frequency-Inverse Document Frequency (TF-IDF) and Bidirectional Encoder Representations from Transformers (BERT) for contextual embeddings. The TF-IDF helps capture the token’s importance, whereas BERT generates contextualized embeddings. This hybridization subsequently generates an embedding that contains statistical surface-level patterns and deep semantic understanding. The system uses principal component analysis (PCA) and a random forest classifier. The application of PCA converts a sparse, very high-dimensional embedding into a dense representation by keeping only the most relevant features. The model achieved robust performance across eight Indian languages, with the highest accuracy in Hindi. However, lower performance in Marathi highlights challenges in low-resource settings. Combining TF-IDF and BERT embeddings leads to better classification results, showing the benefits of integrating simple and complex language models. Limitations include potential feature redundancy and poor performance in languages with complex word forms, indicating a need for future adjustments to support multiple languages and address imbalances.

pdf bib
Dll5143A@LT-EDI 2025: Bias-Aware Detection of Racial Hoaxes in Code-Mixed Social Media Data (BaCoHoax)
Ashok Yadav | Vrijendra Singh

The proliferation of racial hoaxes that associate individuals or groups with fabricated crimes or incidents presents unique challenges in multilingual social media contexts. This paper introduces BaCoHoax, a novel framework for detecting race-based misinformation in code-mixed content. We address this problem by participating in the “Shared Task Detecting Racial Hoaxes in Code-Mixed Hindi-English Social Media Data: LT-EDI@LDK 2025.” BaCoHoax is a bias-aware detection system built on a DeBERTa-based architecture, enhanced with disentangled attention mechanisms, a dynamic bias discovery module that adapts to emerging narrative patterns, and an adaptive contrastive learning objective. We evaluated BaCoHoax on the HoaxMixPlus corpus, a collection of 5,105 YouTube comments annotated for racial hoaxes, achieved a competitive macro F1 score of 0.67 and securing 7th place among participating teams in the shared task.Our findings contribute to the growing field of multilingual misinformation detection and highlight the importance of culturally informed approaches to identifying harmful content in linguistically diverse online spaces.

pdf bib
Hope_for_best@LT-EDI 2025: Detecting Racial Hoaxes in Code-Mixed Hindi-English Social Media Data using a multi-phase fine-tuning strategy
Abhishek Singh Yadav | Deepawali Sharma | Aakash Singh | Vivek Kumar Singh

In the age of digital communication, social media platforms have become a medium for the spread of misinformation, with racial hoaxes posing a particularly insidious threat. These hoaxes falsely associate individuals or communities with crimes or misconduct, perpetuating harmful stereotypes and inflaming societal tensions. This paper describes the team “Hope_for_best” submission that addresses the challenge of detecting racial hoaxes in codemixed Hindi-English (Hinglish) social media content and secured the 2nd rank in the shared task (Chakravarthi et al., 2025). To address this challenge, the study employs the HoaxMix Plus dataset, developed by LT-EDI 2025, and adopts a multi-phase fine-tuning strategy. Initially, models are sensitized using the THAR dataset—targeted hate speech against religion (Sharma et al., 2024) —to adjust weights toward contextually relevant biases. Further fine-tuning was performed on the HoaxMix Plus dataset. This work employed data balancing sampling strategies to mitigate class imbalance. Among the evaluated models, Hing BERT achieved the highest macro F1-score of 73% demonstrating promising capabilities in detecting racially charged misinformation in code-mixed Hindi-English texts.

pdf bib
CVF-NITT@LT-EDI-2025:MisogynyDetection
Radhika K T | Sitara K

Online platforms have enabled users to create and share multimodal content, fostering new forms of personal expression and cultural interaction. Among these, memes—combinations of images and text—have become a prevalent mode of digital communication, often used for humor, satire, or social commentary. However, memes can also serve as vehicles for spreading misogynistic messages, reinforcing harmful gender stereotypes, and targeting individuals based on gender. In this work, we investigate the effectiveness of various multimodal models for detecting misogynistic content in memes. We propose a BERT+CLIP+LR model that integrates BERT’s deep contextual language understanding with CLIP’s powerful visual encoder, followed by Logistic Regression for classification. This approach leverages complementary strengths of vision-language models for robust cross-modal representation. We compare our proposed model with several baselines, including the original CLIP+LR, and traditional early fusion methods such as BERT + ResNet50 and CNN + InceptionV3. Our focus is on accurately identifying misogynistic content in Chinese memes, with careful attention to the interplay between visual elements and textual cues. Experimental results show that the BERT+CLIP+LR model achieves a macro F1 score of 0.87, highlighting the effectiveness of vision-language models in addressing harmful content on social media platforms.

pdf bib
Wise@LT-EDI-2025: Combining Classical and Neural Representations with Multi-scale Ensemble Learning for Code-mixed Hate Speech Detection
Ganesh Sundhar S | Durai Singh K | Gnanasabesan G | Hari Krishnan N | Mc Dhanush

Detecting hate speech targeting caste and migration communities in code-mixed Tamil-English social media content is challenging due to limited resources and socio-cultural complexities. This paper proposes a multi-scale hybrid architecture combining classical and neural representations with hierarchical ensemble learning. We employ advanced preprocessing including transliteration and character repetition removal, then extract features using classical TF-IDF vectors at multiple scales (512, 1024, 2048) processed through linear layers, alongside contextual embeddings from five transformer models-Google BERT, XLM-RoBERTa (Base and Large), SeanBenhur BERT, and IndicBERT. These concatenated representations encode both statistical and contextual information, which are input to multiple ML classification heads (Random Forest, SVM, etc). A three-level hierarchical ensemble strategy combines predictions across classifiers, transformer-TF-IDF combinations, and dimensional scales for enhanced robustness. Our method scored an F1-score of 0.818, ranking 3rd in the LT-EDI-2025 shared task, showing the efficacy of blending classical and neural methods with multi-level ensemble learning for hate speech detection in low-resource languages.

pdf bib
CUET’s_White_Walkers@LT-EDI 2025: Racial Hoax Detection in Code-Mixed on Social Media Data
Md. Mizanur Rahman | Jidan Al Abrar | Md. Siddikul Imam Kawser | Ariful Islam | Md. Mubasshir Naib | Hasan Murad

False narratives that manipulate racial tensions are increasingly prevalent on social media, often blending languages and cultural references to enhance reach and believability. Among them, racial hoaxes produce unique harm by fabricating events targeting specific communities, social division and fueling misinformation. This paper presents a novel approach to detecting racial hoaxes in code-mixed Hindi-English social media data. Using a carefully constructed training pipeline, we have fine-tuned the XLM-RoBERTa-base multilingual transformer for training the shared task data. Our approach has incorporated task-specific preprocessing, clear methodology, and extensive hyperparameter tuning. After developing our model, we tested and evaluated it on the LT-EDI@LDK 2025 shared task dataset. Our system achieved the highest performance among all the international participants with an F1-score of 0.75, ranking 1st on the official leaderboard.

pdf bib
CUET’s_White_Walkers@LT-EDI-2025: A Multimodal Framework for the Detection of Misogynistic Memes in Chinese Online Content
Md. Mubasshir Naib | Md. Mizanur Rahman | Jidan Al Abrar | Md. Mehedi Hasan | Md. Siddikul Imam Kawser | Mohammad Shamsul Arefin

Memes, combining visual and textual elements, have emerged as a prominent medium for both expression and the spread of harmful ideologies, including misogyny. To address this issue in Chinese online content, we present a multimodal framework for misogyny meme detection as part of the LT-EDI@LDK 2025 Shared Task. Our study investigates a range of machine learning (ML) methods such as Logistic Regression, Support Vector Machines, and Random Forests, as well as deep learning (DL) architectures including CNNs and hybrid models like BiLSTM-CNN and CNN-GRU for extracting textual features. On the transformer side, we explored multiple pretrained models including mBERT, MuRIL, and BERT- base-chinese to capture nuanced language representations. These textual models were fused with visual features extracted from pretrained ResNet50 and DenseNet121 architectures using both early and decision-level fusion strategies. Among all evaluated configurations, the BERT-base-chinese + ResNet50 early fusion model achieved the best overall performance, with a macro F1-score of 0.8541, ranking 4th in the shared task. These findings underscore the effectiveness of combining pretrained vision and language models for tackling multimodal hate speech detection.

pdf bib
CUET’s_White_Walkers@LT-EDI 2025: Transformer-Based Model for the Detection of Caste and Migration Hate Speech
Jidan Al Abrar | Md. Mizanur Rahman | Ariful Islam | Md. Mehedi Hasan | Md. Mubasshir Naib | Mohammad Shamsul Arefin

Hate speech on social media is an evolving problem, particularly in low-resource languages like Tamil, where traditional hate speech detection approaches remain under developed. In this work, we provide a focused solution for cast and migration-based hate speech detection using Tamil-BERT, a Tamil-specialized pre-trained transformer model. One of the key challenges in hate speech detection is the severe class imbalance in the dataset, with hate speech being the minority class. We solve this using focal loss, a loss function that gives more importance to harder-to-classify examples, improving the performance of the model in detecting minority classes. We train our model on a publicly available labeled dataset of Tamil text as hate and non-hate speech. Under strict evaluation, our approach achieves impressive results, outperforming baseline models by a considerable margin. The model achieves an F1 score of 0.8634 and good precision, recall, and accuracy, making it a robust solution for hate speech detection in Tamil. The results show that fine-tuning transformer-based models like Tamil-BERT, coupled with techniques like focal loss, can substantially improve performance in hate speech detection for low-resource languages. This work is a contribution to this growing amount of research and provides insights on how to tackle class imbalance for NLP tasks.

pdf bib
NS@LT-EDI-2025 CasteMigration based hate speech Detection
Nishanth S | Shruthi Rengarajan | Sachin Kumar S

Hate speech directed at caste and migrant communities is a widespread problem on social media, frequently taking the form of insults specific to a given region, coded language, and disparaging slurs. This type of abuse seriously jeopardizes both individual well-being and social harmony in addition to perpetuating discrimination. In order to promote safer and more inclusive digital environments, it is imperative that this challenge be addressed. However, linguistic subtleties, code-mixing, and the lack of extensive annotated datasets make it difficult to detect such hate speech in Indian languages like Tamil. We suggest a supervised machine learning system that uses FastText embeddings specifically designed for Tamil-language content and Whisper-based speech recognition to address these issues. This strategy aims to precisely identify hate speech connected to caste and migration, supporting the larger endeavor to reduce online abuse in low resource languages like Tamil.

pdf bib
SSN_IT_HATE@LT-EDI-2025: Caste and Migration Hate Speech Detection
Maria Nancy C | Radha N | Swathika R

This paper proposes a transformer-based methodology for detecting hate speech in Tamil, developed as part of the shared task on Caste and Migration Hate Speech Detection. Leveraging the multilingual BERT (mBERT) model, we fine-tune it to classify Tamil social media content into caste/migration-related hate speech and non hate speech categories. Our approach achieves a macro F1-score of 0.72462 in the development dataset, demonstrating the effectiveness of multilingual pretrained models in low-resource language settings. The code for this work is available on github Hate-Speech Deduction.

pdf bib
ItsAllGoodMan@LT-EDI-2025: Fusing TF-IDF and MuRIL Embeddings for Detecting Caste and Migration Hate Speech
Amritha Nandini K L | Vishal S | Giri Prasath R | Anerud Thiyagarajan | Sachin Kumar S

Caste and migration hate speech detection is a critical task in the context of increasingly multilingual and diverse online discourse. In this work, we address the problem of identifying hate speech targeting caste and migrant communities across a multilingual social media dataset containing Tamil, Tamil written in English script, and English. We explore and compare different feature representations, including TF-IDF vectors and embeddings from pretrained transformer-based models, to train various machine learning classifiers. Our experiments show that a Soft Voting Classifier that make use of both TF-IDF vectors and MuRIL embeddings performs best, achieving a macro F1 score of 0.802 on the test set. This approach was evaluated as part of the Shared Task on Caste and Migration Hate Speech Detection at LT-EDI@LDK 2025, where it ranked 6th overall.

pdf bib
NSR_LT-EDI-2025 Automatic speech recognition in Tamil
Nishanth S | Shruthi Rengarajan | Burugu Rahul | Jyothish Lal G

Automatic Speech Recognition (ASR) technology can potentially make marginalized communities more accessible. However, older adultsand transgender speakers are usually highly disadvantaged in accessing valuable services due to low digital literacy and social biases. In Tamil-speaking regions, these are further compounded by the inability of ASR models to address their unique speech types, accents, and spontaneous speaking styles. To bridge this gap, the LT-EDI-2025 shared task is designed to develop robust ASR systems for Tamil speech from vulnerable populations. Using whisper based models, this task is designed to improve recognition rates in speech data collected from older adults and transgender speakers in naturalistic settings such as banks, hospitals and public offices. By bridging the linguistic heterogeneity and acoustic variability among this underrepresented population, the shared task is designed to develop inclusive AI solutions that break communication barriers and empower vulnerable populations in Tamil Nadu.

pdf bib
Solvers@LT-EDI-2025: Caste and Migration Hate Speech Detection in Tamil-English Code-Mixed Text
Ananthakumar S | Bharath P | Devasri A | Anirudh Sriram K S | Mohanapriya K T

Hate speech detection in low-resource languages such as Tamil presents significant challenges due to linguistic complexity, limited annotated data, and the sociocultural sensitivity of the subject matter. This study focuses on identifying caste- and migration-related hate speech in Tamil social media texts, as part of the LT-EDI@LDK 2025 Shared Task. The dataset used consists of 5,512 training instances and 787 development instances, annotated for binary classification into caste/migration-related and non-caste/migration-related hate speech. We employ a range of models, including Support Vector Machines (SVM), Convolutional Neural Networks (CNN), and transformer-based architectures such as BERT and multilingual BERT (mBERT). A central focus of this work is evaluating model performance using macro F1-score, which provides a balanced assessment across this imbalanced dataset. Experimental results demonstrate that transformer-based models, particularly mBERT, significantly outperform traditional approaches by effectively capturing the contextual and implicit nature of hate speech. This research underscores the importance of culturally informed NLP solutions for fostering safer online environments in underrepresented linguistic communities such as Tamil.

pdf bib
CUET_N317@LT-EDI2025: Detecting Hate Speech Related to Caste and Migration with Transformer Models
Md. Nur Siddik Ruman | Md. Tahfim Juwel Chowdhury | Hasan Murad

Language that criticizes threatens, or discriminates against people or groups because of their caste, social rank, or status is known as caste and migration hate speech and it has grown in credibly common on social media. Such speech not only contributes to social disruption and in equity, but it also puts at risk the safety and mental health of the targeted groups. Due to the absence of labeled data, the subtlety of culturally unique insults, and the lack of strong linguistic resources for deep text recognition, it is especially difficult to detect caste and migration hate speech in low-resource Dravidian languages like Tamil. In this work, we address the Caste and Migration Hate Speech Detection task, aiming to automatically classify user-generated content as either hateful or non-hateful. We evaluate a range of approaches, including a traditional TF-IDF-based machine learning pipeline using SVM and Logistic Regression, alongside five transformer-based models: mBERT, XLM-R, MuRIL, Tamil BERT, and Tamilhate-BERT.Among these, the domain-adapted Tamilhate BERT achieved the highest macro-F1 score of 0.88 on the test data, securing 1st place in the Shared Task on Caste and Migration Hate Speech Detection at DravidianLangTech@LT-EDI 2025. Our findings highlight the strong performance of transformer models, particularly those fine-tuned on domain-specific data, in detecting nuanced hate speech in low-resource, code-mixed languages like Tamil.

pdf bib
KEC-Elite-Analysts@LT-EDI 2025: Leveraging Deep Learning for Racial Hoax Detection in Code-Mixed Hindi-English Tweets
Malliga Subramanian | Aruna A | Amudhavan M | Jahaganapathi S | Kogilavani Shanmugavadivel

Detecting misinformation in code-mixed languages, particularly Hindi-English, poses significant challenges in natural language processing due to the linguistic diversity found on social media. This paper focuses on racial hoax detection—false narratives that target specific communities—within Hindi-English YouTube comments. We evaluate the effectiveness of several machine learning models, including Logistic Regression, Random Forest, Support Vector Machine, Naive Bayes, and Multi-Layer Perceptron, using a dataset of 5,105 annotated comments. Model performance is assessed using accuracy, precision, recall, and F1-score. Experimental results indicate that neural and ensemble models consistently outperform traditional classifiers. Future work will explore the use of transformer-based architectures and data augmentation techniques to enhance detection in low-resource, code-mixed scenarios.

pdf bib
Team_Luminaries_0227@LT-EDI-2025: A Transformer-Based Fusion Approach to Misogyny Detection in Chinese Memes
Adnan Faisal | Shiti Chowdhury | Momtazul Arefin Labib | Hasan Murad

Memes, originally crafted for humor or cultural commentary, have evolved into powerful tools for spreading harmful content, particularly misogynistic ideologies. These memes sustain damaging gender stereotypes, further entrenching social inequality and encouraging toxic behavior across online platforms. While progress has been made in detecting harmful memes in English, identifying misogynistic content in Chinese remains challenging due to the language’s complexities and cultural subtleties. The multimodal nature of memes, combining text and images, adds to the detection difficulty. In the LT-EDI@LDK 2025 Shared Task on Misogyny Meme Detection, we have focused on analyzing both text and image elements to identify misogynistic content in Chinese memes. For text-based models, we have experimented with Chinese BERT, XLM-RoBERTa and DistilBERT, with Chinese BERT yielding the highest performance, achieving an F1 score of 0.86. In terms of image models, VGG16 outperformed ResNet and ViT, also achieving an F1 score of 0.85. Among all model combinations, the integration of Chinese BERT with VGG16 emerged as the most impactful, delivering superior performance, highlighting the benefit of a multimodal approach. By exploiting these two modalities, our model has effectively captured the subtle details present in memes, improving its ability to accurately detect misogynistic content. This approach has resulted in a macro F1 score of 0.90355, securing 3rd rank in the task.

pdf bib
Hinterwelt@LT-EDI 2025: A Transformer-Based Approach for Identifying Racial Hoaxes in Code-Mixed Hindi-English Social Media Narratives
Md. Abdur Rahman | Md. Al Amin | Sabik Aftahee | Md. Ashiqur Rahman

This paper presents our system for the detection of racial hoaxes in code-mixed Hindi-English social media narratives, which is in reality a form of debunking of online disinformation claiming fake incidents against a racial group. We experiment with different modeling techniques on HoaxMixPlus dataset of 5,102 annotated YouTube comments. In our approach, we utilize traditional machine learning classifiers (SVM, LR, RF), deep learning models (CNN, CNN-LSTM, CNN-BiLSTM), and transformer-based architectures (MuRIL, XLM-RoBERTa, HingRoBERTa-mixed). Experiments show that transformer-based methods substantially outperform traditional approaches, and the HingRoBERTa-mixed model is the best one with an F1 score of 0.7505. An error analysis identifies the difficulty of recognizing implicit bias and nuanced contexts in complex hoaxes. Our team was 5th place in the challenge with an F1 score of 0.69. This work contributes to combating online misinformation in low-resource linguistic environments and highlights the effectiveness of specialized language models for code-mixed content.

pdf bib
CUET_12033@LT-EDI-2025: Misogyny Detection
Mehreen Rahman | Faozia Fariha | Nabilah Tabassum | Samia Rahman | Hasan Murad

Misogynistic memes spread harmful stereotypes and toxic content across social media platforms, often combining sarcastic text and offensive visuals that make them difficult to detect using traditional methods. Our research has been part of the the Shared Task on Misogyny Meme Detection - LT- EDI@LDK 2025, identifying misogynistic memes using deep learning-based multimodal approach that leverages both textual and visual information for accurate classification of such memes. We experiment with various models including CharBERT, BiLSTM, and CLIP for text and image encoding, and explore fusion strategies like early and gated fusion. Our best performing model, CharBERT + BiLSTM + CLIP with gated fusion, achieves strong results, showing the effectiveness of combining features from both modalities. To address challenges like language mixing and class imbalance, we apply preprocessing techniques (e.g., Romanizing Chinese text) and data augmentation (e.g., image transformations, text back-translation). The results demonstrate significant improvements over unimodal baselines, highlighting the value of multimodal learning in detecting subtle and harmful content online.

pdf bib
CUET_Blitz_Aces@LT-EDI-2025: Leveraging Transformer Ensembles and Majority Voting for Hate Speech Detection
Shahriar Farhan Karim | Anower Sha Shajalal Kashmary | Hasan Murad

The rapid growth of the internet and social media has given people an open space to share their opinions, but it has also led to a rise in hate speech targeting different social, cultural, and political groups. While much of the research on hate speech detection has focused on widely spoken languages, languages like Tamil, which are less commonly studied, still face significant gaps in this area. To tackle this, the Shared Task on Caste and Migration Hate Speech Detection was organized at the Fifth Workshop on Language Technology for Equality, Diversity, and Inclusion (LT-EDI-2025). This paper aims to create an automatic system that can detect caste and migration-related hate speech in Tamil-language social media content. We broke down our approach into two phases: in the first phase, we tested seven machine learning models and five transformer-based models. In the second phase, we combined the predictions from the fine-tuned transformers using a majority voting technique. This ensemble approach outperformed all other models, achieving the highest macro F1 score of 0.81682, which earned us 4th place in the competition.

pdf bib
Hinterwelt@LT-EDI 2025: A Transformer-Based Detection of Caste and Migration Hate Speech in Tamil Social Media
Md. Al Amin | Sabik Aftahee | Md. Abdur Rahman | Md. Sajid Hossain Khan | Md. Ashiqur Rahman

This paper presents our system for detecting caste and migration-related hate speech in Tamil social media comments, addressing the challenges in this low-resource language setting. We experimented with multiple approaches on a dataset of 7,875 annotated comments. Our methodology encompasses traditional machine learning classifiers (SVM, Random Forest, KNN), deep learning models (CNN, CNN-BiLSTM), and transformer-based architectures (MuRIL, IndicBERT, XLM-RoBERTa). Comprehensive evaluations demonstrate that transformer-based models substantially outperform traditional approaches, with MuRIL-large achieving the highest performance with a macro F1 score of 0.8092. Error analysis reveals challenges in detecting implicit and culturally-specific hate speech expressions requiring deeper socio-cultural context. Our team ranked 5th in the LT-EDI@LDK 2025 shared task with an F1 score of 0.80916. This work contributes to combating harmful online content in low-resource languages and highlights the effectiveness of large pre-trained multilingual models for nuanced text classification tasks.

pdf bib
EM-26@LT-EDI 2025: Detecting Racial Hoaxes in Code-Mixed Social Media Data
Tewodros Achamaleh | Fatima Uroosa | Nida Hafeez | Tolulope Olalekan Abiola | Mikiyas Mebraihtu | Sara Getachew | Grigori Sidorov | Rolando Quintero

Social media platforms and user-generated content, such as tweets, comments, and blog posts often contain offensive language, including racial hate speech, personal attacks, and sexual harassment. Detecting such inappropriate language is essential to ensure user safety and to prevent the spread of hateful behavior and online aggression. Approaches base on conventional machine learning and deep learning have shown robust results for high-resource languages like English and find it hard to deal with code-mixed text, which is common in bilingual communication. We participated in the shared task “LT-EDI@LDK 2025” organized by DravidianLangTech, applying the BERT-base multilingual cased model and achieving an F1 score of 0.63. These results demonstrate how our model effectively processes and interprets the unique linguistic features of code-mixed content. The source code is available on GitHub.1

pdf bib
EM-26@LT-EDI 2025: Caste and Migration Hate Speech Detection in Tamil-English Code-Mixed Social Media Texts
Tewodros Achamaleh | Tolulope Olalekan Abiola | Mikiyas Mebraihtu | Sara Getachew | Grigori Sidorov

In this paper, we describe the system developed by Team EM-26 for the Shared Task on Caste and Migration Hate Speech Detection at LTEDI@LDK 2025. The task addresses the challenge of recognizing caste-based and migration related hate speech in Tamil social media text, a language that is both nuanced and under resourced for machine learning. To tackle this, we fine-tuned the multilingual transformer XLM-RoBERTa-Large on the provided training data, leveraging its cross-lingual strengths to detect both explicit and implicit hate speech. To improve performance, we applied social media focused preprocessing techniques, including Tamil text normalization and noise removal. Our model achieved a macro F1-score of 0.6567 on the test set, highlighting the effectiveness of multilingual transformers for low resource hate speech detection. Additionally, we discuss key challenges and errors in Tamil hate speech classification, which may guide future work toward building more ethical and inclusive AI systems. The source code is available on GitHub.1

pdf bib
Hoax Terminators@LT-EDI 2025: CharBERT’s dominance over LLM Models in the Detection of Racial Hoaxes in Code-Mixed Hindi-English Social Media Data
Abrar Hafiz Rabbani | Diganta Das Droba | Momtazul Arefin Labib | Samia Rahman | Hasan Murad

This paper presents our system for the LT-EDI 2025 Shared Task on Racial Hoax Detection, addressing the critical challenge of identifying racially charged misinformation in code-mixed Hindi-English (Hinglish) social media—a low-resource, linguistically complex domain with real-world impact. We adopt a two-pronged strategy, independently fine-tuning a transformer-based model and a large language model. CharBERT was optimized using Optuna, while XLM-RoBERTa and DistilBERT were fine-tuned for the classification task. FLAN-T5-base was fine-tuned with SMOTE-based oversampling, semantic-preserving back translation, and prompt engineering, whereas LLaMA was used solely for inference. Our preprocessing included Hinglish-specific normalization, noise reduction, sentiment-aware corrections and a custom weighted loss to emphasize the minority Hoax class. Despite using FLAN-T5-base due to resource limits, our models performed well. CharBERT achieved a macro F1 of 0.70 and FLAN-T5 followed at 0.69, both outperforming baselines like DistilBERT and LLaMA-3.2-1B. Our submission ranked 4th of 11 teams, underscoring the promise of our approach for scalable misinformation detection in code-switched contexts. Future work will explore larger LLMs, adversarial training and context-aware decoding.

pdf bib
CUET_Ignite@LT-EDI-2025: A Multimodal Transformer-Based Approach for Detecting Misogynistic Memes in Chinese Social Media
Md. Mahadi Rahman | Mohammad Minhaj Uddin | Mohammad Oman | Mohammad Shamsul Arefin

Misogynistic content in memes on social me dia platforms poses a significant challenge for content moderation, particularly in languages like Chinese, where cultural nuances and multi modal elements complicate detection. Address ing this issue is critical for creating safer online environments, A shared task on multimodal misogyny identification in Chinese memes, or ganized by LT-EDI@LDK 2025, provided a curated dataset for this purpose. Since memes mix pictures and words, we used two smart tools: ResNet-50 to understand the images and Chinese RoBERTa to make sense of the text. The data set consisted of Chinese social media memes annotated with binary labels (Misogynistic and Non-Misogynistic), capturing explicit misogyny, implicit biases, and stereo types. Our experiments demonstrated that ResNet-50 combined with Chinese RoBERTa achieved a macro F1 score of 0.91, placing second in the competition and underscoring its effectiveness in handling the complex interplay of text and visuals in Chinese memes. This research advances multimodal misogyny detection and contributes to natural language and vision processing for low-resource languages, particularly in combating gender-based abuse online.

pdf bib
girlsteam@LT-EDI-2025: Caste/Migration based hate speech Detection
Towshin Hossain Tushi | Walisa Alam | Rehenuma Ilman | Samia Rahman

The proliferation of caste- and migration-based hate speech on social media poses a significant challenge, particularly in low-resource languages like Tamil. This paper presents our approach to the LT-EDI@ACL 2025 shared task, addressing this issue through a hybrid transformer-based framework. We explore a range of Machine Learning (ML), Deep Learning (DL), and multilingual transformer models, culminating in a novel m-BERT+BiLSTM hybrid architecture. This model integrates contextual embeddings from m-BERT with lexical features from TF-IDF and FastText, feeding the enriched representations into a BiLSTM to capture bidirectional semantic dependencies. Empirical results demonstrate the superiority of this hybrid architecture, achieving a macro-F1 score of 0.76 on the test set and surpassing the performance of standalone models such as MuRIL and IndicBERT. These results affirm the effectiveness of hybrid multilingual models for hate speech detection in low-resource and culturally complex linguistic settings.

pdf bib
CUET_320@LT-EDI-2025: A Multimodal Approach for Misogyny Meme Detection in Chinese Social Media
Madiha Ahmed Chowdhury | Lamia Tasnim Khan | Md. Shafiqul Hasan | Ashim Dey

Detecting misogyny in memes is challenging due to their complex interplay of images and text that often disguise offensive content. Current AI models struggle with these cross-modal relationships and contain inherent biases. We tested multiple approaches for the Misogyny Meme Detection task at LT-EDI@LDK 2025: ChineseBERT, mBERT, and XLM-R for text; DenseNet, ResNet, and InceptionV3 for images. Our best-performing system fused fine-tuned ChineseBERT and DenseNet features, concatenating them before final classification through a fully connected network. This multimodal approach achieved a 0.93035 macro F1-score, winning 1st place in the competition and demonstrating the effectiveness of our strategy for analyzing the subtle ways misogyny manifests in visual-textual content.

pdf bib
Speech Personalization using Parameter Efficient Fine-Tuning for Nepali Speakers
Kiran Pantha | Rupak Raj Ghimire | Bal Krishna Bal

The performance of Automatic Speech Recognition (ASR) systems has improved significantly, driven by advancements in large-scale pre-trained models. However, adapting such models to low-resource languages such as Nepali is challenging due to the lack of labeled data and computational resources. Additionally, adapting the unique speech parameters of the speaker to a model is also a challenging task. Personalization helps to target the model to fit the particular speaker. This work investigates parameter-efficient fine-tuning (PEFT) methods like Low-Rank Adaptation (LoRA) and Decomposed Weight Low-Rank Adaptation (DoRA) to improve the performance of fine-tuned Whisper ASR models for Nepali ASR tasks by Personalization. These experiments demonstrate that the PEFT methods obtain competitive results while significantly reducing the number of trainable parameters compared to full fine-tuning. LoRA and DoRA show a relative WER to FTBase increment of 34.93% and 36.79%, respectively, and a relative CER to FTBase increment of 49.50% and 50.03%, respectively. Furthermore, the results highlight a 99.74% reduction in total training parameters.

pdf bib
An Overview of the Misogyny Meme Detection Shared Task for Chinese Social Media
Bharathi Raja Chakravarthi | Rahul Ponnusamy | Ping Du | Xiaojian Zhuang | Saranya Rajiakodi | Paul Buitelaar | Premjith B | Bhuvaneswari Sivagnanam | Anshid K A | Sk Lavanya

The increasing prevalence of misogynistic content in online memes has raised concerns about their impact on digital discourse. The culture specific images and informal usage of text in the memes present considerable challenges for the automatic detection systems, especially in low-resource languages. While previous shared tasks have addressed misogyny detection in English and several European languages, misogynistic meme detection in the Chinese has remained largely unexplored. To address this gap, we introduced a shared task focused on binary classification of Chinese language memes as misogynistic or non-misogynistic. The task featured memes collected from the Chinese social media and annotated by native speakers. A total of 45 teams registered, with 8 teams submitting predictions from their multimodal models integrating textual and visual features through diverse fusion strategies. The best-performing system achieved a macro F1-score of 0.93035, highlighting the effectiveness of lightweight pretrained encoder fusion. This system used the Chinese BERT and DenseNet-121 for text and image feature extraction, respectively. A feedforward network was trained as a classifier using the features obtained by concatenating text and image features.

pdf bib
Findings of the Shared Task Multilingual Bias and Propaganda Annotation in Political Discourse
Shunmuga Priya Muthusamy Chinnan | Bharathi Raja Chakravarthi | Meghann Drury-Grogan | Senthil Kumar B | Saranya Rajiakodi | Angel Deborah S

The Multilingual Bias and Propaganda Annotation task focuses on annotating biased and propagandist content in political discourse across English and Tamil. This paper presents the findings of the shared task on bias and propaganda annotation task. This task involves two sub tasks, one in English and another in Tamil, both of which are annotation task where a text comment is to be labeled. With a particular emphasis on polarizing policy debates such as the US Gender Policy and India’s Three Language Policy, this shared task invites participants to build annotation systems capable of labeling textual bias and propaganda. The dataset was curated by collecting comments from YouTube videos. Our curated dataset consists of 13,010 English sentences on US Gender Policy, Russia-Ukraine War and 5,880 Tamil sentences on Three Language Policy. Participants were instructed to annotate following the guidelines at sentence level with the bias labels that are fine-grained, domain specific and 4 propaganda labels. Participants were encouraged to leverage existing tools or develop novel approaches to perform fine-grained annotations that capture the complex socio-political nuances present in the data.

pdf bib
Findings of the Shared Task Caste and Migration Hate Speech Detection
Saranya Rajiakodi | Bharathi Raja Chakravarthi | Rahul Ponnusamy | Shunmuga Priya Muthusamy Chinnan | Prasanna Kumar Kumaresan | Sathiyaraj Thangasamy | Bhuvaneswari Sivagnanam | Balasubramanian Palani | Kogilavani Shanmugavadivel | Abirami Murugappan | Charmathi Rajkumar

Hate speech targeting caste and migration communities is a growing concern in online platforms, particularly in linguistically diverse regions. By focusing on Tamil language text content, this task provides a unique opportunity to tackle caste or migration related hate speech detection in a low resource language Tamil, contributing to a safer digital space. We present the results and main findings of the shared task caste and migration hate speech detection. The task is a binary classification determining whether a text is caste/migration related hate speech or not. The task attracted 17 participating teams, experimenting with a wide range of methodologies from traditional machine learning to advanced multilingual transformers. The top performing system achieved a macro F1-score of 0.88105, enhancing an ensemble of fine-tuned transformer models including XLM-R and MuRIL. Our analysis highlights the effectiveness of multilingual transformers in low resource, ensemble learning, and culturally informed socio political context based techniques.

pdf bib
Overview of the Shared Task on Detecting Racial Hoaxes in Code-Mixed Hindi-English Social Media Data
Bharathi Raja Chakravarthi | Prasanna Kumar Kumaresan | Shanu Dhawale | Saranya Rajiakodi | Sajeetha Thavareesan | Subalalitha Chinnaudayar Navaneethakrishnan | Thenmozhi Durairaj

The widespread use of social media has made it easier for false information to proliferate, particularly racially motivated hoaxes that can encourage violence and hatred. Such content is frequently shared in code-mixed languages in multilingual nations like India, which presents special difficulties for automated detection systems because of the casual language, erratic grammar, and rich cultural background. The shared task on detecting racial hoaxes in code mixed social media data aims to identify the racial hoaxes in Hindi-English data. It is a binary classification task with more than 5,000 labeled instances. A total of 11 teams participated in the task, and the results are evaluated using the macro-F1 score. The team that employed XLM-RoBERTa secured the first position in the task.

pdf bib
Overview of Homophobia and Transphobia Span Detection in Social Media Comments
Prasanna Kumar Kumaresan | Bharathi Raja Chakravarthi | Ruba Priyadharshini | Paul Buitelaar | Malliga Subramanian | Kishore Kumar Ponnusamy

The rise and the intensity of harassment and hate speech in social media platforms against LGBTQ+ communities is a growing concern. This work is an initiative to address this problem by conducting a shared task focused on the detection of homophobic and transphobic content in multilingual settings. The task comprises two subtasks: (1) multi-class classification of content into Homophobia, Transphobia, or Non-anti-LGBT+ categories across eight languages and (2) span-level detection to identify specific toxic segments within comments in English, Tamil, and Marathi. This initiative helps the development of explainable and socially re- sponsible AI tools for combating identity-based harm in digital spaces. Multiple teams registered for the task, however only two teams submitted their results, and the results were evaluated using the macro F1 score.

pdf bib
Overview of the Fifth Shared Task on Speech Recognition for Vulnerable Individuals in Tamil
Bharathi B | Bharathi Raja Chakravarthi | Sripriya N | Rajeswari Natarajan | Ratnavel Rajalakshmi | Suhasini S

In this paper, an overview of the shared task on speech recognition for vulnerable individuals in Tamil (LT-EDI@LDK2025) is described. The work comes with a Tamil dataset that was collected from elderly individuals who identify as male, female, or transgender. The audio samples were taken in public places such as markets, vegetable shops, hospitals, etc. The training phase and the testing phase are when the dataset is made available. The task required of the participants was to handle audio signals using various models and techniques and then turn in their results as transcriptions of the provided test samples. The participant’s results were assessed using WER (Word Error Rate). The transformer-based approach was used by participants to achieve automatic voice recognition. This overview paper discusses the findings and various pre-trained transformer-based models that the participants employed.

up

pdf (full)
bib (full)
Proceedings of the 5th Conference on Language, Data and Knowledge: The 5th OntoLex Workshop

pdf bib
Proceedings of the 5th Conference on Language, Data and Knowledge: The 5th OntoLex Workshop
Katerina Gkirtzou | Slavko Žitnik | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov

pdf bib
Inferring Adjective Hypernyms with Language Models to Increase the Connectivity of Open English Wordnet
Lorenzo Augello | John Philip McCrae

Open English Wordnet is a key resource published in OntoLex-lemon as part of the linguistic linked open data cloud. There are, however, many links missing in the resource, and in this paper, we look at how we can establish hyper-ymy between adjectives. We present a theoretical discussion of the hypernymy relation and how it differs for adjectives in contrast to nouns and verbs. We develop a new resource for adjective hypernymy and fine-tune large language models to predict adjective hypernymy, showing that the methodology of TaxoLLaMa can be adapted to this task.

pdf bib
Bringing IATE into the Semantic Web Family
Paula Diez Ibarbia | Patricia Martín Chozas | Elena Montiel Ponsoda

This paper is an extension of previous work by the authors and other researchers that studies the application of the OntoLex-lemon model for representing the InterActive Terminology for Europe (IATE) database in the Semantic Web. While traditional XML-based approaches have been effective for multilingual terminological work, the Semantic Web enables richer, more interoperable representations. The study evaluates the suitability of OntoLex-lemon for modeling IATE’s complex structure and identifies limitations in existing vocabularies. To address these, this paper tries to identify orher existing vocabularies and ontologies that could satisfy those limitations, which include term reliability, regional usage, lifecycle statuses, lookup forms, and concept cross-references. Still, some representation requirements are not covered by existing vocabularies and may need to be further discussed within the community.

pdf bib
Ontologies for historical languages: using the LiLa and OntoLex-Lemon framework to build a Lemma Bank for Old Irish
Theodorus Fransen

This paper presents a Linked Data approach to digitising and structuring Old Irish linguistic resources using the Lila (Linking Latin) ontology, which is itself largely based on the OntoLex-Lemon framework (Cimiano et al., 2016). Old Irish, as an ancient Celtic language with fragmented textual traditions, presents unique challenges for the creation and interoperability of digital resources. This work is part of the MOLOR project, whose aim is to create a knowledge base for Old Irish by interlinking texts, lexicons, and inflectional data. The first step in this ambitious endeavour is described here: the creation of an RDF linguistic Linked Data hub known as a Lemma Bank, similar to the one created as part of the LiLa project, addressing specific linguistic challenges and opportunities while adhering to the LiLa ontology.

pdf bib
A Lightweight String Based Method of Encoding Etymologies in Linked Data Lexical Resources
Anas Fahad Khan | Maxim Ionov | Paola Marongiu | Ana Salgado

In this submission we propose an approach to encoding etymological information as strings (“etymology strings”). We begin by discussing the advantages of such an approach with respect to one in which etymologies and etymons are explicitly represented as RDF individuals. Next we give a formal description of the regular language underlying our approach as an Extended Backus-Naur Form grammar (EBNF). We use the Chamuça Hindi lexicon as a test case for our approach and show some of the kinds of SPARQL queries which can be made using etymological strings.

pdf bib
Ontolex-Lemon in Wikidata and other Wikibase instances
David Lindemann

This paper provides insight into how the core elements of the Ontolex-Lemon model are inte- grated in the Wikibase Ontology, the data model fundamental to any instance of the Wikibase software, including Wikidata lexemes, which today is probably the largest Ontolex-Lemon use case, a dataset collaboratively built by the community of Wikidata users. We describe how lexical entries are modeled on a Wikibase, including the linguistic description of lexemes, the linking of lexical entries, lexical senses and lexical forms across resources, and links across the domain of lexemes and the ontological part of a Wikibase knowledge graph. Our aim is to present Wikibase as a solution for storing and collaboratively editing lexical data follow- ing Semantic Web standards, and to identify relevant research questions to be addressed in future work.

pdf bib
Philosophising Lexical Meaning as an OntoLex-Lemon Extension
Veruska Zamborlini | Jiaqi Zhu | Marieke van Erp | Arianna Betti

OntoLex-Lemon is a model for representing lexical information, focusing on the use of lexical entries in texts rather than their definitions. This work proposes an extension to the model that aims to capture the definition of senses attributed to lexical entries. We explicitly represent a conceptual setup authored by an agent that operates on lexical content. It either proposes new senses for existing lexical entries in a language or coins new terms to express proposed senses. It provides textual and/or formal definitions to senses/concepts, and can serve as an interpretation of other senses/concepts through rephrasing, translation, formalization, or comparison. Because a conceptual setup and its interpretations may not be unanimously accepted, it is important to support the selection of relevant meanings, as for example, those proposed by a certain author. We illustrate the application of our proposed extension with two case studies, one about the philosophical definition of the concept of idea and its interpretations, and one about historical attributions of meaning to the Dutch East India Company (VOC).

up

pdf (full)
bib (full)
Proceedings of the 5th Conference on Language, Data and Knowledge: TermTrends 2025

pdf bib
Proceedings of the 5th Conference on Language, Data and Knowledge: TermTrends 2025
Katerina Gkirtzou | Slavko Žitnik | Jorge Gracia | Dagmar Gromann | Maria Pia di Buono | Johanna Monti | Maxim Ionov

pdf bib
The LegISTyr Test Set: Investigating Off-the-Shelf Instruction-Tuned LLMs for Terminology-Constrained Translation in a Low-Resource Language Variety
Paolo Di Natale | Egon W. Stemle | Elena Chiocchetti | Marlies Alber | Natascia Ralli | Isabella Stanizzi | Elena Benini

We investigate the effect of terminology injection for terminology-constrained translation in a low-resource language variety, with a particular focus on off-the-shelf instruction-tuned Large Language Models (LLMs). We compare a total of 9 models: 4 instruction-tuned LLMs from the Tower and EuroLLM suites, which have been specifically trained for translation-related tasks; 2 generic open-weight LLMs (LLaMA-8B and Mistral-7B); 3 Neural Machine Translation (NMT) systems (an adapted version of MarianMT and ModernMT with and without the glossary function). To this end, we release LegISTyr, a manually curated test set of 2,000 Italian sentences from the legal domain, paired with source Italian terms and target terms in the South Tyrolean standard variety of German. We select only real-world sources and design constraints on length, syntactic clarity, and referential coherence to ensure high quality. LegISTyr includes a homonym subset, which challenges systems on the selection of the correct homonym where sense disambiguation is deducible from the context. Results show that while generic LLMs achieve the highest raw term insertion rates (approximately 64%), translation-specialized LLMs deliver superior fluency (∆ COMET up to 0.04), reduce incorrect homonym selection by half, and generate more controllable output. We posit that models trained on translation-related data are better able to focus on source-side information, producing more coherent translations.

pdf bib
Terminology Management Meets AI: The ISO/TC 37/SC 3/WG 6 Initiative
Mohamed Khemakhem | Cristina Valentini | Natascia Ralli | Sérgio Barros | Georg Löckinger | Federica Vezzani | Ana Salgado | Zhenling Zhang | Sabine Mahr | Sara Carvalho | Klaus Fleischmann | Rute Costa

The integration of artificial intelligence (AI) with terminology management (TM) has opened new avenues for enhancing efficiency and precision in both fields, necessitating standardized approaches to ensure interoperability and ethical application. The newly formed ISO/TC 37/SC 3/WG 6 represents the first dedicated initiative to study the standardization of the mutual improvements of AI and TM. This group aims to develop standardized frameworks and guidelines that optimize the interaction between AI technologies and terminology resources, benefiting professionals, systems, and practices in both domains. This article presents the state-of-the-art in the mutual relationship between AI and TM, highlighting opportunities for bidirectional advancements. It also addresses limitations and challenges from a standardization perspective. By tackling these issues, ISO/TC 37/SC 3/WG 6 seeks to establish principles that ensure scalability, precision, and ethical considerations, shaping future standards to support global communication and knowledge exchange.

pdf bib
Inferring Semantic Relations Between Terms with Large Language Models
Giulia Speranza

The purpose of this paper is to investigate the ability of Large Language Models (LLMs) to identify relations among terms, with the goal of facilitating and accelerating the construction of thesauri and terminological resources. We investigate whether the use of LLMs in this context can provide a valuable initial set of relations, serving as a basis upon which professional terminologists can build, validate, and enrich domain-specific knowledge representations.