Embeddings models for Buddhist Sanskrit

Ligeia Lugli, Matej Martinc, Andraž Pelicon, Senja Pollak


Abstract
The paper presents novel resources and experiments for Buddhist Sanskrit, broadly defined here including all the varieties of Sanskrit in which Buddhist texts have been transmitted. We release a novel corpus of Buddhist texts, a novel corpus of general Sanskrit and word similarity and word analogy datasets for intrinsic evaluation of Buddhist Sanskrit embeddings models. We compare the performance of word2vec and fastText static embeddings models, with default and optimized parameter settings, as well as contextual models BERT and GPT-2, with different training regimes (including a transfer learning approach using the general Sanskrit corpus) and different embeddings construction regimes (given the encoder layers). The results show that for semantic similarity the fastText embeddings yield the best results, while for word analogy tasks BERT embeddings work the best. We also show that for contextual models the optimal layer combination for embedding construction is task dependant, and that pretraining the contextual embeddings models on a reference corpus of general Sanskrit is beneficial, which is a promising finding for future development of embeddings for less-resourced languages and domains.
Anthology ID:
2022.lrec-1.411
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3861–3871
Language:
URL:
https://aclanthology.org/2022.lrec-1.411
DOI:
Bibkey:
Cite (ACL):
Ligeia Lugli, Matej Martinc, Andraž Pelicon, and Senja Pollak. 2022. Embeddings models for Buddhist Sanskrit. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3861–3871, Marseille, France. European Language Resources Association.
Cite (Informal):
Embeddings models for Buddhist Sanskrit (Lugli et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.411.pdf