LAMAD: A Linguistic Attentional Model for Arabic Text Diacritization

In Arabic Language, diacritics are used to specify meanings as well as pronunciations. However, diacritics are often omitted from written texts, which increases the number of possible meanings and pronunciations. This leads to an ambiguous text and makes the computational process on undiacritized text more difﬁcult. In this paper, we propose a Linguistic Attentional Model for Arabic text Diacritization (LAMAD). In LAMAD, a new linguistic feature representation is presented, which uti-lizes both word and character contextual features. Then, a linguistic attention mechanism is proposed to capture the important linguistic features. In addition, we explore the impact of the linguistic features extracted from the text on Arabic text diacritization (ATD) by introducing them to the linguistic attention mechanism. The extensive experimental results on three datasets with different sizes illustrate that LAMAD outperforms the existing state-of-the-art models.


Introduction
Arabic is one of the most widely spoken Semitic languages, the official language for about 27 countries, and spoken by more than 400 million speakers around the world (Ma et al., 2020). Diacritics are marks added above or below letters to give the word correct meaning and pronunciation (Alansary, 2018). However, More than 97% of Arabic texts (e.g., magazines, newspapers, books,etc ) are not diacritized (Neme and Paumier, 2020) increasing the text ambiguity which poses a challenge for diacritized texts based computational models (Abbad and Xiong, 2020;Hadjir et al., 2019). For example, translating the undiacritized Arabic sentences using Arabic machine translators face some difficulties. Figure 1 shows two Arabic sentences translation results without/with diacritization using Google Translate 1 . It is noticeable that the undiacritized sentences are wrongly translated. For example, the undiacritized sentence is translated into "According to Ahmed, his money." while the correct meaning is "Ahmed calculated his money." The main reason for the failure of machine translation is that there are too many possible meanings for an undiacritized word, and the exact meaning can be revealed by the word context and diacritization, which is difficult for machine translation to figure out even with the presence of some Arabic contextual undiacritized words. On the other hand, the diacritized sentences are translated correctly due to the diacritics specify the word meanings. Arabic diacritizers can help reveal this ambiguity and improve the performance of various diacritized-text based NLP applications: automatic speech recognition (Abed et al., 2019), Arabic machine translation (Ameur et al., 2020), text to speech (Zine and Meziane, 2017), Part-of-Speech (POS) tagging (AbuZeina and Abdalbaset, 2019), and indexing diacritized text enable the search engine to exclude the unwanted matches.
Recent diacritization models adapting convolutional or recurrent neural networks show an improvement of Arabic text diacritization (ATD) (Mubarak et al., 2019;Alqahtani et al., 2020;Abbad and Xiong, 2020;Fadel et al., 2019b;Zalmout and Habash, 2017). However, most of those models are based on the character-level representation, which helps generalize the model but loses some useful linguistic features such as part of speech, word number, etc. To solve this problem, we propose a novel linguistic attentional model in which we introduce a linguistic feature representation at the character-level utilizing word and character linguistic features, and investigate their impact on ATD. Then, a linguistic attention mechanism for ATD is proposed to capture the most crucial features which influence the word diacritization. Our main contributions are summarized as the following: • We propose a novel linguistic attentional model for ATD by introducing a new linguistic feature representation utilizing word and contextual character features and presenting a linguistic attention mechanism to focus on the effective features.
• We conduct extensive experiments on three benchmark datasets to explore the impact of the linguistic features on ATD, which show that linguistic features efficiently improve the diacritization performance and our proposed model outperforms the various state-of-the-art models.
2 The Proposed LAMAD Figure 2 shows the architecture of the proposed linguistic attentional model. In our model, we introduce a new linguistic feature representation to extract the linguistic contextual features. Then, the linguistic attention mechanism for ATD is presented. The linguistic attention mechanism is adopted to distinguish the different importance of those features and capture the most crucial textual ones and have a decisive effect on diacritization, which is designed for the first time in Arabic diacritization and proved its effectiveness.

Initial Linguistic Feature
We extract contextual linguistic features which affect ATD according to the Arabic diacritization. In Arabic, diacritizing word is affected by linguistic features in text contexts such as part of speech, gender, named entity, word number, etc. Therefore, we utilize them in our model and present a new linguistic character-level representation to help the model improve the accuracy of the diacritization, which is the first time introduced for the diacritization problem (more details see Appendix A.3).

Linguistic Feature Embedding
In our model, the aim of the linguistic feature embedding is to convert the sequence of linguistic features into a low-dimensional vector sequence. The linguistic feature embedding layer receives the character features and produces a predefined vector representation for the features. Given a vector consisting of T linguistic features C = {f 1 , f 2 , ...., f T }, every feature f i is presented as a real-valued vector x i . For each feature in C, the embedding matrix C cf ∈ R d c |L| is looked up, where L is fixed-size character features, and d c is the character-feature embedding size. The parameter that will be learned is the matrix C cf and d c is a hyper-parameter chosen by the user. The character linguistic feature f i is converted into feature embedding x i using the matrix-vector multiplication as: where l i is a vector of size |L|.

Linguistic Feature Learning
Considering the close relevance between two turns of characters, we utilize Bi-LSTM as an encoder to capture the features from both sides. The input into this layer is a set of embedded characterlevel linguistic features as a real-valued vector embs = {x 1 , x 2 , .., x T }. LSTM composes of three main components: the forget gate f t , which removes unnecessary information, input gate i t , which adds information to cells, and the output gate, which filters and outputs necessary information. The current input x i , the state generated by the previous step h i−1 , and the state of the current state of the cell c i−1 (peephole) are used to decide  Figure 2: A Linguistic Attentional Model for ATD (LAMAD) whether to take the inputs, forget the stored memory, and output the state that can be expressed by the following equations: where σ is the sigmoid function, and f, i, c, and o are forget, input memory cell activation and output vectors, respectively. The b bais and W vectors are learned while training. Each Bi-LSTM cell is composed of two LSTM cells. One LSTM cell processes input data from right to left and the other from left to right.

Linguistic Attention mechanism
The linguistic attention mechanism aims to capture the most effective linguistic features on ATD. Let H be a matrix consisting of output vectors [h 1 , h 2 , ..., h T ] that Bi-LSTM layer produced, where T is the length of the character linguistic feature vector. The attention weight α i is formed as follows: where w and b are trained parameter vectors of the attention layer. The dimensional vector and its elements are the weights corresponding to each feature in the input character features. Therefore, the output representation r i is given by: 3 Experiments and Results

Datasets
Three diacritized datasets, which cover various genres and different sizes, are used: Quran, The Holy Islamic book and most correct diacritized dataset (Hamed and Zesch, 2017), Tashkeela (Fadel et al., 2019a), assembled from articles, books, speeches, etc., and Sahih Al-Bukhary (Al-Thubaity et al., 2020), a collection of Islamic hadith. The datasets include small and large datasets with long and short sentences. Table 1 shows the datasets statistics.

Evaluation Metrics
Diacritization Error Rate (DER) and Word Error Rate (WER) are the two main evaluation metrics used to test the performance of Arabic diacritizer (Hamed and Zesch, 2017). DER is the proportion of characters that are labelled with incorrect diacritic. WER is the percentage of words in which at least one letter has been incorrectly diacritized.

Baselines
To evaluate the performance of our model (LAMAD), it is compared with five state-of-theart models that use the character-level representa-  also compare our model to a hybrid system that use morphological and syntactic diacritics rules and statistical treatments (a hybrid system) (Chennoufi and Mazroui, 2017) and Byte Pair Encoding (BPE) method with sub-word units dictionary to (Hifny, 2019), which are computationally cost and fail to be generalized for text in different context.

Preprocessing
Due to datasets are not in unified structures, preprocessing step is performed: the datasets are divided into lines, each line is a sentence, where "?", "!" and "." are used as separators. Then, we removed the extra diacritics such as Sukun from " " and duplicated diacritics and diacritics that appear on non-Arabic letters. We also unify the position of the compound diacritics such as the Shaddah should come first. The diacritics are unified to be after the letter if it appeared before the letter. Moreover, the lines that are non-diacritized or have less than 80% of the diacritized characters are removed.

Results and Comparisons
In this paper, we first investigate the performance of linguistic features on ATD using the proposed linguistic attentional model. We explore the impact of different linguistic representations to choose the    In our work, we also report the diacritization errors with/without case endings. In addition, the redefinition of DER and WER by (Fadel et al., 2019b) in which the irrelevant characters such as punctuations and numbers are excluded while counting the percentage of mislabeled characters, is also used. Tables 5 and 6 show the diacritization errors DER and WER with/without case endings and also with excluding and including no diacritic characters.
For more in-depth analysis, we randomly choose 100 words that have diacritization errors for each testing dataset to analyze the types of common errors. Table 7 displays the number of words with one, two, and three or more diacritization errors. The results show that the model almost behaves similarly for each dataset in terms of the number of diacritization errors per words such as most words (88 on average) have one diacritic error. These observations demonstrate the consistency of the performance of the model regardless of data type.     Table 8 shows the diacritization errors that appeared at the beginning, middle, and end of the words. We observed that the model is sensitive to the syntactic roles since most of the errors appeared at the end of the words. Table 9 shows the diacritization error distributions over three major Arabic POS categories: verbs, nouns, and particles. We observed that most of errors appeared on nouns. The main reason is that nouns in Arabic occur more frequently.

Conclusion
In this paper, we propose a linguistic attentional model to tackle the Arabic diacritization problem. A new features representation method is presented, and the impact of morphological and syntactic information, extracted as features, is investigated. To evaluate the proposed system, three diacritized Arabic corpora are used; two of them are small datasets and one large dataset with long sentences. The proposed LAMAD achieved the best results compared to the state-of-the-art models. 59 36 5 Sahih AlBukhary 87 10 3 Tashkeela 82 16 2

A.2 Arabic Diacritics
Arabic word is composed of letters (Arabic Alphabets), always written, and diacritics (Table 10), ignored in most of Arabic written texts due to timeconsuming and only reliable on Arabic linguistic experts. Diacritics are marks that appear above or below the word letters giving it a pronunciation, meaning, syntactic role and form distinction in various languages such as Arabic (Rashwan et al., 2015) and Yorùbá (Orife, 2018).

Feature
Motivation Characters Due to the diacritization of Arabic words is affected by the sentence context, each sentence character is represented as a 40-dimensional and 60-dimensional vector for short and long sentence datasets, respectively. The first half elements in the vector are the undiacritized characters before the current character in the sentence while the last half-elements are the undiacritized characters after it, including the current character. A padding token is used when there is no character to feed. Prior This feature is represented by a binary 15-dimensional vector for each character indicating whether the character can accept any of the Arabic diacritic marks which is decided from the diacritics observed per word segment in the training set. Part of Speech (POS) The diacritization of Arabic word varies according to the POS of word (Chennoufi and Mazroui, 2017). Determining the POS of word in which the character appears can help the model to predict the appropriate character diacritic. The POS tagger presented in (Zhang et al., 2015) is used. Gender and Number The agreement of gender (Male, Female, or unknown) and number (singular, plural, double,or unknown) of the word may allow or disallow specific case ending diacritization. The number/gender tagger introduced in (Zhang et al., 2015) is used to extract gender and number information. Named Entity This feature is a binary value that determines whether the word in which the character appears is named entity. Arabic named entities mostly have Sukun case endings. As a result, this feature may help to predict the diacritic of the case ending of named entity word. The simple approach for Named entity recognition from Arabic text presented in (Darwish and Gao, 2014) is used to extract the named entity from the corpora.

Segment Position
The position of the character in a word segment may affect its diacritization. For example, the character that comes at the beginning or the middle never has Tanween diacritics. We mark the characters that comes at the beginning, middle or end of the segments as "B", "M" and "E". If the character appears in a single character segment, it marks as "S". Farasa segmenter (Darwish and Mubarak, 2016) that have achieved high accuracy segmentation is used.

Affixes and Stem
Determining whether a character appears in the affixes or the word stem influences the character diacritization (Chennoufi and Mazroui, 2017). A binary 3-dimensional vector represents each character to decide whether the character came in the prefix, stem, or suffix part of the word. Case Ending A binary value feature which determines whether the character expect case ending or core word diacritic.