A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian

Aleksandra Miletić, Filip Miletić


Abstract
Bosnian, Croatian, Montenegrin and Serbian are the official standard linguistic varieties in Bosnia and Herzegovina, Croatia, Montenegro, and Serbia, respectively. When these four countries were part of the former Yugoslavia, the varieties were considered to share a single linguistic standard. After the individual countries were established, the national standards emerged. Today, a central question about these varieties remains the following: How different are they from each other? How hard is it to distinguish them? While this has been addressed in NLP as part of the task on Distinguishing Between Similar Languages (DSL), little is known about human performance, making it difficult to contextualize system results. We tackle this question by reannotating the existing BCMS dataset for DSL with annotators from all target regions. We release a new gold standard, replacing the original single-annotator, single-label annotation by a multi-annotator, multi-label one, thus improving annotation reliability and explicitly coding the existence of ambiguous instances. We reassess a previously proposed DSL system on the new gold standard and establish the human upper bound on the task. Finally, we identify sources of annotation difficulties and provide linguistic insights into the BCMS dialect continuum, with multiple indicators highlighting an intermediate position of Bosnian and Montenegrin.
Anthology ID:
2024.humeval-1.4
Volume:
Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Simone Balloccu, Anya Belz, Rudali Huidrom, Ehud Reiter, Joao Sedoc, Craig Thomson
Venues:
HumEval | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
36–46
Language:
URL:
https://aclanthology.org/2024.humeval-1.4
DOI:
Bibkey:
Cite (ACL):
Aleksandra Miletić and Filip Miletić. 2024. A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian. In Proceedings of the Fourth Workshop on Human Evaluation of NLP Systems (HumEval) @ LREC-COLING 2024, pages 36–46, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Gold Standard with Silver Linings: Scaling Up Annotation for Distinguishing Bosnian, Croatian, Montenegrin and Serbian (Miletić & Miletić, HumEval-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.humeval-1.4.pdf