Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text

Nikola Ljubešić, Tomaž Erjavec, Darja Fišer


Abstract
In this paper we present the adaptations of a state-of-the-art tagger for South Slavic languages to non-standard texts on the example of the Slovene language. We investigate the impact of introducing in-domain training data as well as additional supervision through external resources or tools like word clusters and word normalization. We remove more than half of the error of the standard tagger when applied to non-standard texts by training it on a combination of standard and non-standard training data, while enriching the data representation with external resources removes additional 11 percent of the error. The final configuration achieves tagging accuracy of 87.41% on the full morphosyntactic description, which is, nevertheless, still quite far from the accuracy of 94.27% achieved on standard text.
Anthology ID:
W17-1410
Volume:
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Tomaž Erjavec, Jakub Piskorski, Lidia Pivovarova, Jan Šnajder, Josef Steinberger, Roman Yangarber
Venue:
BSNLP
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
Note:
Pages:
60–68
Language:
URL:
https://aclanthology.org/W17-1410/
DOI:
10.18653/v1/W17-1410
Bibkey:
Cite (ACL):
Nikola Ljubešić, Tomaž Erjavec, and Darja Fišer. 2017. Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 60–68, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Adapting a State-of-the-Art Tagger for South Slavic Languages to Non-Standard Text (Ljubešić et al., BSNLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1410.pdf