New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian

Nikola Ljubešić; Filip Klubicka; Željko Agić; Ivo-Pavao Jazbec

New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian

Nikola Ljubešić, Filip Klubička, Željko Agić, Ivo-Pavao Jazbec

Abstract

In this paper we present newly developed inflectional lexcions and manually annotated corpora of Croatian and Serbian. We introduce hrLex and srLex - two freely available inflectional lexicons of Croatian and Serbian - and describe the process of building these lexicons, supported by supervised machine learning techniques for lemma and paradigm prediction. Furthermore, we introduce hr500k, a manually annotated corpus of Croatian, 500 thousand tokens in size. We showcase the three newly developed resources on the task of morphosyntactic annotation of both languages by using a recently developed CRF tagger. We achieve best results yet reported on the task for both languages, beating the HunPos baseline trained on the same datasets by a wide margin.

Anthology ID:: L16-1676
Volume:: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:: May
Year:: 2016
Address:: Portorož, Slovenia
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 4264–4270
Language:
URL:: https://aclanthology.org/L16-1676/
DOI:
Bibkey:
Cite (ACL):: Nikola Ljubešić, Filip Klubička, Željko Agić, and Ivo-Pavao Jazbec. 2016. New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 4264–4270, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):: New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian (Ljubešić et al., LREC 2016)
Copy Citation:
PDF:: https://aclanthology.org/L16-1676.pdf

PDF Cite Search Fix data