LAMAD: A Linguistic Attentional Model for Arabic Text Diacritization

Raeed Al-Sabri, Jianliang Gao


Abstract
In Arabic Language, diacritics are used to specify meanings as well as pronunciations. However, diacritics are often omitted from written texts, which increases the number of possible meanings and pronunciations. This leads to an ambiguous text and makes the computational process on undiacritized text more difficult. In this paper, we propose a Linguistic Attentional Model for Arabic text Diacritization (LAMAD). In LAMAD, a new linguistic feature representation is presented, which utilizes both word and character contextual features. Then, a linguistic attention mechanism is proposed to capture the important linguistic features. In addition, we explore the impact of the linguistic features extracted from the text on Arabic text diacritization (ATD) by introducing them to the linguistic attention mechanism. The extensive experimental results on three datasets with different sizes illustrate that LAMAD outperforms the existing state-of-the-art models.
Anthology ID:
2021.findings-emnlp.317
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2021
Month:
November
Year:
2021
Address:
Punta Cana, Dominican Republic
Editors:
Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
Venue:
Findings
SIG:
SIGDAT
Publisher:
Association for Computational Linguistics
Note:
Pages:
3757–3764
Language:
URL:
https://aclanthology.org/2021.findings-emnlp.317
DOI:
10.18653/v1/2021.findings-emnlp.317
Bibkey:
Cite (ACL):
Raeed Al-Sabri and Jianliang Gao. 2021. LAMAD: A Linguistic Attentional Model for Arabic Text Diacritization. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3757–3764, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Cite (Informal):
LAMAD: A Linguistic Attentional Model for Arabic Text Diacritization (Al-Sabri & Gao, Findings 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.findings-emnlp.317.pdf
Video:
 https://aclanthology.org/2021.findings-emnlp.317.mp4
Data
Arabic Text Diacritization