Cross-Level Semantic Similarity for Serbian Newswire Texts

Vuk Batanović, Maja Miličević Petrović


Abstract
Cross-Level Semantic Similarity (CLSS) is a measure of the level of semantic overlap between texts of different lengths. Although this problem was formulated almost a decade ago, research on it has been sparse, and limited exclusively to the English language. In this paper, we present the first CLSS dataset in another language, in the form of CLSS.news.sr – a corpus of 1000 phrase-sentence and 1000 sentence-paragraph newswire text pairs in Serbian, manually annotated with fine-grained semantic similarity scores using a 0–4 similarity scale. We describe the methodology of data collection and annotation, and compare the resulting corpus to its preexisting counterpart in English, SemEval CLSS, following up with a preliminary linguistic analysis of the newly created dataset. State-of-the-art pre-trained language models are then fine-tuned and evaluated on the CLSS task in Serbian using the produced data, and their settings and results are discussed. The CLSS.news.sr corpus and the guidelines used in its creation are made publicly available.
Anthology ID:
2022.lrec-1.180
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1691–1699
Language:
URL:
https://aclanthology.org/2022.lrec-1.180
DOI:
Bibkey:
Cite (ACL):
Vuk Batanović and Maja Miličević Petrović. 2022. Cross-Level Semantic Similarity for Serbian Newswire Texts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1691–1699, Marseille, France. European Language Resources Association.
Cite (Informal):
Cross-Level Semantic Similarity for Serbian Newswire Texts (Batanović & Miličević Petrović, LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.180.pdf