Embed More Ignore Less (EMIL): Exploiting Enriched Representations for Arabic NLP

Ahmed Younes; Julie Weeds

Embed More Ignore Less (EMIL): Exploiting Enriched Representations for Arabic NLP

Abstract

Our research focuses on the potential improvements of exploiting language specific characteristics in the form of embeddings by neural networks. More specifically, we investigate the capability of neural techniques and embeddings to represent language specific characteristics in two sequence labeling tasks: named entity recognition (NER) and part of speech (POS) tagging. In both tasks, our preprocessing is designed to use enriched Arabic representation by adding diacritics to undiacritized text. In POS tagging, we test the ability of a neural model to capture syntactic characteristics encoded within these diacritics by incorporating an embedding layer for diacritics alongside embedding layers for words and characters. In NER, our architecture incorporates diacritic and POS embeddings alongside word and character embeddings. Our experiments are conducted on 7 datasets (4 NER and 3 POS). We show that embedding the information that is encoded in automatically acquired Arabic diacritics improves the performance across all datasets on both tasks. Embedding the information in automatically assigned POS tags further improves performance on the NER task.

Anthology ID:: 2020.wanlp-1.13
Volume:: Proceedings of the Fifth Arabic Natural Language Processing Workshop
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Imed Zitouni, Muhammad Abdul-Mageed, Houda Bouamor, Fethi Bougares, Mahmoud El-Haj, Nadi Tomeh, Wajdi Zaghouani
Venue:: WANLP
SIG:: SIGARAB
Publisher:: Association for Computational Linguistics
Note:
Pages:: 139–154
Language:
URL:: https://aclanthology.org/2020.wanlp-1.13/
DOI:
Bibkey:
Cite (ACL):: Ahmed Younes and Julie Weeds. 2020. Embed More Ignore Less (EMIL): Exploiting Enriched Representations for Arabic NLP. In Proceedings of the Fifth Arabic Natural Language Processing Workshop, pages 139–154, Barcelona, Spain (Online). Association for Computational Linguistics.
Cite (Informal):: Embed More Ignore Less (EMIL): Exploiting Enriched Representations for Arabic NLP (Younes & Weeds, WANLP 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.wanlp-1.13.pdf

PDF Cite Search Fix data