Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping

Fahad Albogamy, Allan Ramsay


Abstract
Part-of-Speech(POS) tagging is a key step in many NLP algorithms. However, tweets are difficult to POS tag because they are short, are not always written maintaining formal grammar and proper spelling, and abbreviations are often used to overcome their restricted lengths. Arabic tweets also show a further range of linguistic phenomena such as usage of different dialects, romanised Arabic and borrowing foreign words. In this paper, we present an evaluation and a detailed error analysis of state-of-the-art POS taggers for Arabic when applied to Arabic tweets. On the basis of this analysis, we combine normalisation and external knowledge to handle the domain noisiness and exploit bootstrapping to construct extra training data in order to improve POS tagging for Arabic tweets. Our results show significant improvements over the performance of a number of well-known taggers for Arabic.
Anthology ID:
L16-1238
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1500–1506
Language:
URL:
https://aclanthology.org/L16-1238/
DOI:
Bibkey:
Cite (ACL):
Fahad Albogamy and Allan Ramsay. 2016. Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1500–1506, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Fast and Robust POS tagger for Arabic Tweets Using Agreement-based Bootstrapping (Albogamy & Ramsay, LREC 2016)
Copy Citation:
PDF:
https://aclanthology.org/L16-1238.pdf