The Vedic Compound Dataset

Sven Sellmer, Oliver Hellwig


Abstract
This paper introduces the Vedic Compound Dataset (VCD), the first resource providing annotated compounds from Vedic Sanskrit, a South Asian Indo-European language used from ca. 1500 to 500 BCE. The VCD aims at facilitating the study of language change in early Indo-Iranian and offers comparative material for quantitative cross-linguistic research on compounds. The process of annotating Vedic compounds is complex as they contain five of the six basic types of compounds defined by Scalise & Bisetto (2005), which are, however, not consistently marked in morphosyntax, making their automatic classification a significant challenge. The paper details the process of collecting and preprocessing the relevant data, with a particular focus on the question of how to distinguish exocentric from endocentric usage. It further discusses experiments with a simple ML classifier that uses compound internal syntactic relations, outlines the composition of the dataset, and sketches directions for future research.
Anthology ID:
2024.mwe-1.8
Volume:
Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Archna Bhatia, Gosse Bouma, A. Seza Doğruöz, Kilian Evang, Marcos Garcia, Voula Giouli, Lifeng Han, Joakim Nivre, Alexandre Rademaker
Venues:
MWE | UDW | WS
SIGs:
SIGLEX | SIGPARSE
Publisher:
ELRA and ICCL
Note:
Pages:
50–55
Language:
URL:
https://aclanthology.org/2024.mwe-1.8
DOI:
Bibkey:
Cite (ACL):
Sven Sellmer and Oliver Hellwig. 2024. The Vedic Compound Dataset. In Proceedings of the Joint Workshop on Multiword Expressions and Universal Dependencies (MWE-UD) @ LREC-COLING 2024, pages 50–55, Torino, Italia. ELRA and ICCL.
Cite (Informal):
The Vedic Compound Dataset (Sellmer & Hellwig, MWE-UDW-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.mwe-1.8.pdf