A Parallel English - Serbian - Bulgarian - Macedonian Lexicon of Named Entities

Aleksandar Petrovski


Abstract
This paper describes the creation of a parallel multilingual lexicon of named entities from English to three South Slavic languages: Serbian, Bulgarian and Macedonian, with Wikipedia as a source. The basics of the proposed methodology are well known. This methodology provides a cheap opportunity to build multilingual lexicons, without having expertise in target languages. Wikipedia’s database dump can be freely downloaded in SQL and XML formats. The method presented here has been used to build a Python application that extracts the English – Serbian – Bulgarian – Macedonian parallel titles from Wikipedia and classifies them using the English Wikipedia category system. The extracted named entity sets have been classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). It has been achieved using Wikipedia metadata. The quality of classification has been checked manually on 1,000 randomly chosen named entities. The following are the results obtained: 97% for precision and 90% for recall.
Anthology ID:
2022.clib-1.17
Volume:
Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022)
Month:
September
Year:
2022
Address:
Sofia, Bulgaria
Venue:
CLIB
SIG:
Publisher:
Department of Computational Linguistics, IBL -- BAS
Note:
Pages:
146–151
Language:
URL:
https://aclanthology.org/2022.clib-1.17
DOI:
Bibkey:
Cite (ACL):
Aleksandar Petrovski. 2022. A Parallel English - Serbian - Bulgarian - Macedonian Lexicon of Named Entities. In Proceedings of the 5th International Conference on Computational Linguistics in Bulgaria (CLIB 2022), pages 146–151, Sofia, Bulgaria. Department of Computational Linguistics, IBL -- BAS.
Cite (Informal):
A Parallel English - Serbian - Bulgarian - Macedonian Lexicon of Named Entities (Petrovski, CLIB 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.clib-1.17.pdf