Improving Arabic Diacritization with Regularized Decoding and Adversarial Training

Arabic diacritization is a fundamental task for Arabic language processing. Previous studies have demonstrated that automatically generated knowledge can be helpful to this task. However, these studies regard the auto-generated knowledge instances as gold references, which limits their effectiveness since such knowledge is not always accurate and inferior instances can lead to incorrect predictions. In this paper, we propose to use regularized decoding and adversarial training to appropriately learn from such noisy knowledge for diacritization. Experimental results on two benchmark datasets show that, even with quite flawed auto-generated knowledge, our model can still learn adequate diacritics and outperform all previous studies, on both datasets.


Introduction
Modern standard Arabic (MSA) is generally written without diacritics, which poses a challenge to text processing and understanding in downstream applications, such as text-to-speech generation (Drago et al., 2008) and reading comprehension (Hermena et al., 2015). Restoration of such diacritics, known as diacritization, becomes an important task for Arabic natural language processing (NLP). Among different diacritization methods (Pasha et al., 2014;Shahrour et al., 2015;Zitouni et al., 2006;Habash and Rambow, 2007;Darwish et al., 2017), the neural ones (Abandah et al., 2015a;Fadel et al., 2019a,b;Zalmout andHabash, 2019, 2020;Darwish et al., 2020) achieve the best performance due to their better capability in incorporating contextual features. To further improve diacritization, automatically generated knowledge * Equal contribution. † Corresponding author. 1 The code and models involved in this paper are released at https://github.com/cuhksz-nlp/AD-RDAT. from off-the-shelf toolkits, such as morphological features, parts-of-speech tags, and automatic diacritization results, have been extensively applied to this task (Zitouni et al., 2006;Arabiyat, 2015;Darwish et al., 2017Darwish et al., , 2020. However, current models treat such knowledge instances as gold references and always directly concatenate them with input embeddings (Arabiyat, 2015;Darwish et al., 2020), which may lead to inferior results since the knowledge may be inaccurate, especially if the toolkits were trained on data with different criteria.
Diacritization can be performed by characterbased sequence labeling (Zitouni et al., 2006;Belinkov and Glass, 2015;Fadel et al., 2019b). We follow this paradigm and propose a neural approach in this paper, using regularized decoding and adversarial training to incorporate auto-generated knowledge (i.e., the diacritization results generated from off-the-shelf toolkits). Specifically, the regularized decoder treats the auto-generated knowledge as separate gold labels and learns to predict them in a separate decoding process, which is then used to update the main model. The adversarial training is applied to the encoding process by determining whether the diacritization for an input follows the gold label or the auto-generated knowledge. In doing so, our model can dynamically distinguish between auto-generated knowledge instances instead of treating them all as gold references, so as to effectively identify what knowledge should be leveraged for different inputs. Importantly, regularized decoding and adversarial training are exclusively applied to the training stage; we only need the main tagger for inference once the model has been trained. Experimental results and further analyses illustrate the effectiveness of our approach, where our model outperforms strong baselines and achieves state-of-the-art results on two benchmark datasets: Arabic Treebank (ATB) (Maamouri et al., 2004) and Tashkeela (Zerrouki and Balla, 2017).

The Proposed Approach
As shown in Figure 1, our approach for diacritization follows the sequence labeling paradigm, where it has two training stages for the main tagger (M). In the first training stage (presented in the orange box in Figure 1), M is enhanced by regularized decoding (RD) and adversarial training (AT ) to discriminatively learn from the auto-generated labels. Specifically, given an input Arabic character sequence X = x 1 · · · x i · · · x n , M and RD aim to predict two types of diacritization labels, Y and Y K , which follow the gold and auto-generated label criteria, respectively. AT ensures that the main tagger only learns useful information from either gold or auto-generated labels. Therefore, the first training stage can be conceptually formalized by (1) where H S denotes the output vectors of the shared encoder SE (whose input is X ) that is designed to learn the information shared by the gold and auto-generated labels. As a result, the goal of this training stage is to minimize the loss defined by where L M , L K and L A refer to the losses that come from M, RD, and AT , respectively.
Afterwards, in the second training stage (presented in the green box in Figure 1), M is further trained alone on the gold labels Y without using auto-generated Y K , RD and AT , to fine-tune its parameters, where all parameters in SE obtained through the first training stage are fixed. For inference, only M is used without requiring any additional input other than X to obtain the diacritization results. In the following sections, we first describe M, then elaborate the details of RD and AT .

The Main Tagger
The main tagger uses an encoder-decoder architecture, as shown in Figure 1, in which a shared encoder SE and a private encoder PE M are applied to model the contextual information. Particularly, SE is proposed to facilitate the process of leveraging auto-generated knowledge, which is excepted to learn information shared by the gold labels and the auto-generated knowledge. It takes the character embeddings of X (the embedding of x i is denoted as e i ) as input and encodes them to the Figure 1: The architecture of our model, where the left shows the main tagger (M) and the right shows the regularized decoding (RD) and adversarial training (AT ) modules. The diacritization labels for an example following different criteria are illustrated in M and RD, with the mismatching labels marked in green and red. E.g., for " " (highlighted in yellow), its gold and autogenerated labels are "#" (null) and "o" (sukun). 2 shared hidden vectors (denoted as h Similarly, PE M is also applied to the word embeddings and produces the result h M i . Then, we concatenate h S i and h M i and map the resulting vector to the output space with a fully connected layer: where ⊕ is concatenation and W M and b M are the trainable matrices and bias vector, respectively. Finally, a softmax decoder is applied to o M i to predict the label y i : where T denotes the set of all diacritization labels and o M,t i is the value at dimension t in o M i . Therefore the loss for M is where p(y * i |X ) denotes the probability of labeling x i by the gold label y * i .

Regularized Decoding
When leveraging auto-generated knowledge, it is important to note that such knowledge may be inaccurate or follow different annotation criteria, which is required to be appropriately addressed to pre-  Table 1: Experimental results (i.e., DER and WER with and without the case ending being considered and accuracy) of baselines and our models with RD and AT using AraBERT (a) and ZEN 2.0 (b) on the test sets of ATB and Tashkeela, "BiLSTM" and "Transformer" denote the encoders (i.e., SE and PE) used in the models.
vent the noise in the auto-generated knowledge from significantly hurting the model performance (Tang et al., 2020;Nie et al., 2020;Chen et al., 2020;Mandya et al., 2020;Tian et al., 2020aTian et al., ,b, 2021aChen et al., 2021). To tackle this challenge, we propose to learn from a special decoding process, which is integrated into the main diacritization model, in order to reduce error propagation compared to directly using the knowledge instances or their features. As shown in Figure 1, the proposed regularized decoding is an extra output process separated from the main tagger and performed on another sequence of labels Y K * , which are the auto-generated knowledge instances (diacritization labels) annotated by an existing toolkit. Therefore, the loss L K from RD is computed through and in the first training stage, all trainable parameters in SE are updated through the information back-propagated from RD.

Adversarial Training
Although auto-generated knowledge can be backpropagated through RD, it could be overwhelmed by the information directly learned from the gold label. We further improve our model by balancing the information learned from both M and RD with AT , which is proposed to equalize both sides and emphasize the shared information from them. 3 In doing so, we connect a discriminator, which is a binary classifier, to SE. The discriminator takes all h S i from SE, averages them by h S = 1 n n i=1 h S i , and then passes the resulting vector to a fully connected layer with a softmax function to compute its bias towards either type (i.e., the gold or autogenerated) of diacritization labels: where W D and b D are the trainable matrix and bias vector, respectively, that map h S to a twodimensional vector, with p m and p k representing normalized probabilities that satisfy p m + p k = 1 and indicating the bias of SE towards gold and auto-generated labels, respectively. Then we apply a negative log-likelihood loss function to the discriminator, formalized as and an adversarial loss to the parameters in SE via As a result, the goal of AT is to minimize the loss  where λ is a positive coefficient that controls the influence of L S in the adversarial training, so that to minimize L D and maximize L S synchronously.

Overall Results
In the main experiment, we run the baselines and our models using different configurations (i.e., using AraBERT or ZEN 2.0 embeddings and using BiLSTM or Transformer encoders) with and without RD and AT . The experimental results (DER and WER with and without considering the case endings, and accuracy) on the test sets of ATB and Tashkeela are reported in Table 1. 9 There are several observations. First, under different configurations (i.e., using AraBERT or ZEN 2.0 and with BiLSTM or Transformer encoders), RD improves the baseline on both datasets, which shows that RD is effective to help diacritization with auto-generated knowledge even if they follow different criterion. Second, further consistent improvement can be observed when AT is applied on top of RD, with only 3K (0.015‰ of the entire model size) more trainable parameters required to achieve this effect. 10 These observations confirm the effectiveness of forcing SE to learn from the information shared by gold and auto-generated labels with an appropriate model design. Figure 2: An example input sentence and its diacritization results ("∼u", "#", "i" and "u") from Farasa, BiLSTM, and our approach (BiLSTM+RD+AT ) with AraBERT. All results matching gold labels are highlighted in green; the mismatching results from Farasa and BiLSTM are in orange and red, respectively.
In addition, we also compare the results of our best models (with RD and AT ) with previous studies (including Farasa's results) on the test sets of both datasets. The results are shown in Table 2, where our model with BiLSTM encoder outperforms previous models and achieves state-of-theart performance on both datasets.

Case Study
To explore how our approach with RD and AT leverage auto-generated knowledge, we conduct a case study on an example sentence from the test set of ATB. The input and its diacritization results from Farasa, the BiLSTM baseline, and our approach with AraBERT (BiLSTM+RD+AT ) are illustrated in Figure 2, where the correct diacritization results are highlighted in green, and the incorrect ones from Farasa and BiLSTM are highlighted in orange and red, respectively. It is clearly observed that our approach leverages the necessary information learned from Farasa (i.e., the "∼u" label) and prevents its unreliable results from affecting the final diacritics. Specifically, for the highlighted Arabic character " ", where the Farasa output suggests the diacritic "i" (kasra), our approach leverages this knowledge and corrects the BiL-STM baseline. For the other two highlighted characters, although the Farasa output (i.e., "∼u" (Shadda+Damma) for " " and "#" (No Diacritic) for " ") also produces diacritization results that are different from the BiLSTM baseline and do not match the gold standard, our approach is able to learn from their patterns and make correct predictions. Therefore, although Farasas output does not match the gold labels in most cases (see the Farasa results in Table 2), the proposed RD and AT can leverage such knowledge and improve the main tagger accordingly.

Conclusion
In this paper, we propose to incorporate autogenerated knowledge (diacritization labels in another criterion) for Arabic diacritization with regularized decoding and adversarial training. In detail, the regularized decoding treats the auto-generated knowledge as separate "gold" labels and learns to predict them in another decoding process; the adversarial training is used to ensure that the shared information from gold and auto-generated labels are learned to help diacritization. With the regularized decoding and adversarial training, the main tagger in our approach is able to smartly leverage auto-generated knowledge provided by an existing diacritization tagger. Experimental results on two benchmark datasets illustrate the validity and effectiveness of our model, where state-of-the-art performance is obtained on both datasets.

Appendix F. Mean and Deviation of the Results
In the experiments, we test models with different configurations. For each model, we train it with the best hyper-parameter setting using five different random seeds. We report the mean (µ) and standard deviation (σ) of DER and WER (with case ending) on the test set of ATB and Tashkeela in Table 7. Table 8 reports the number of trainable parameters and the inference speed (lines per second) of the baseline (i.e., BiLSTM and Transformer encoder with and without regularized decoding (RD)) and our models with both RD and adversarial training (AT ) on ATB and Tashkeela. All models are performed on NVIDIA Quadro RTX 6000 GPUs.