Training and Evaluation of POS Taggers on the French MULTITAG Corpus

Alexandre Allauzen; Hélène Bonneau-Maynard

Training and Evaluation of POS Taggers on the French MULTITAG Corpus

Alexandre Allauzen, Hélène Bonneau-Maynard

Abstract

The explicit introduction of morphosyntactic information into statistical machine translation approaches is receiving an important focus of attention. The current freely available Part of Speech (POS) taggers for the French language are based on a limited tagset which does not account for some flectional particularities. Moreover, there is a lack of a unified framework of training and evaluation for these kinds of linguistic resources. Therefore in this paper, three standard POS taggers (Treetagger, Brills tagger and the standard HMM POS tagger) are trained and evaluated in the same conditions on the French MULTITAG corpus. This POS-tagged corpus provides a tagset richer than the usual ones, including gender and number distinctions, for example. Experimental results show significant differences of performance between the taggers. According to the tagging accuracy estimated with a tagset of 300 items, taggers may be ranked as follows: Treetagger (95.7%), Brills tagger (94.6%), HMM tagger (93.4%). Examples of translation outputs illustrate how considering gender and number distinctions in the POS tagset can be relevant.

Anthology ID:: L08-1368
Volume:: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Month:: May
Year:: 2008
Address:: Marrakech, Morocco
Editors:: Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2008/pdf/856_paper.pdf
DOI:
Bibkey:
Cite (ACL):: Alexandre Allauzen and Hélène Bonneau-Maynard. 2008. Training and Evaluation of POS Taggers on the French MULTITAG Corpus. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08), Marrakech, Morocco. European Language Resources Association (ELRA).
Cite (Informal):: Training and Evaluation of POS Taggers on the French MULTITAG Corpus (Allauzen & Bonneau-Maynard, LREC 2008)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2008/pdf/856_paper.pdf

PDF Cite Search Fix data