Dealing with Abbreviations in the Slovenian Biographical Lexicon

Angel Daza, Antske Fokkens, Tomaž Erjavec


Abstract
Abbreviations present a significant challenge for NLP systems because they cause tokenization and out-of-vocabulary errors. They can also make the text less readable, especially in reference printed books, where they are extensively used. Abbreviations are especially problematic in low-resource settings, where systems are less robust to begin with. In this paper, we propose a new method for addressing the problems caused by a high density of domain-specific abbreviations in a text. We apply this method to the case of a Slovenian biographical lexicon and evaluate it on a newly developed gold-standard dataset of 51 Slovenian biographies. Our abbreviation identification method performs significantly better than commonly used ad-hoc solutions, especially at identifying unseen abbreviations. We also propose and present the results of a method for expanding the identified abbreviations in context.
Anthology ID:
2022.emnlp-main.596
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8715–8720
Language:
URL:
https://aclanthology.org/2022.emnlp-main.596
DOI:
10.18653/v1/2022.emnlp-main.596
Bibkey:
Cite (ACL):
Angel Daza, Antske Fokkens, and Tomaž Erjavec. 2022. Dealing with Abbreviations in the Slovenian Biographical Lexicon. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8715–8720, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Dealing with Abbreviations in the Slovenian Biographical Lexicon (Daza et al., EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.596.pdf