Knowledge and Rule-Based Diacritic Restoration in Serbian

Cvetana Krstev, Ranka Stanković, Duško Vitas


Abstract
In this paper we present a procedure for the restoration of diacritics in Serbian texts written using the degraded Latin alphabet. The procedure relies on the comprehensive lexical resources for Serbian: the morphological electronic dictionaries, the Corpus of Contemporary Serbian and local grammars. Dictionaries are used to identify possible candidates for the restoration, while the data obtained from SrpKor and local grammars assists in making a decision between several candidates in cases of ambiguity. The evaluation results reveal that, depending on the text, accuracy ranges from 95.03% to 99.36%, while the precision (average 98.93%) is always higher than the recall (average 94.94%).
Anthology ID:
2018.clib-1.7
Volume:
Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018)
Month:
May
Year:
2018
Address:
Sofia, Bulgaria
Venue:
CLIB
SIG:
Publisher:
Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences
Note:
Pages:
41–51
Language:
URL:
https://aclanthology.org/2018.clib-1.7
DOI:
Bibkey:
Cite (ACL):
Cvetana Krstev, Ranka Stanković, and Duško Vitas. 2018. Knowledge and Rule-Based Diacritic Restoration in Serbian. In Proceedings of the Third International Conference on Computational Linguistics in Bulgaria (CLIB 2018), pages 41–51, Sofia, Bulgaria. Department of Computational Linguistics, Institute for Bulgarian Language, Bulgarian Academy of Sciences.
Cite (Informal):
Knowledge and Rule-Based Diacritic Restoration in Serbian (Krstev et al., CLIB 2018)
Copy Citation:
PDF:
https://aclanthology.org/2018.clib-1.7.pdf