Pipeline Signed Japanese Translation Focusing on a Post-positional Particle Complement and Conjugation in a Low-resource Setting

,


Introduction
It is essential to build a social infrastructure for hearing-impaired and hearing communities to share sufficient information so that they can quickly obtain information necessary for daily life and disasters and lead a safe and secure life. Sign language used in deaf communities has different vocabulary and grammar from spoken language. There are two variations of sign language in Japan (Chonan, 2001): (1) Japanese Sign Language (JSL) and (2) Manually Coded Japanese (MCJ). JSL is often used by early signers, and syntax, such as word order and language structure, is different from spoken Japanese. By contrast, the syntax of MCJ is similar to that of spoken Japanese in terms of word order. It is used by late signers or acquired hearing-impaired people. However, the two variations are said to be used interchangeably and there is no clear boundary between them. In this work, we consider an intermediate between JSL and MCJ, and denote it Signed Japanese (SJ) in the following discussion.
Translation from sign language to spoken language is typically performed in two steps. First, consecutive signs are recognized from a video signal and transformed into an intermediate representation called a gloss, then the gloss is translated into a sentence in spoken language. Current stateof-the-art sign language recognition and translation methods (Camgöz et al. 2020;Yin and Read 2020) require a large amount of data and pay little attention to differences between sign language and the corresponding spoken language. Therefore, the success of these approaches relies heavily on large paired corpora, and resource-poor sign language studies, including SJ, cannot take advantage of such approaches. In sign language, function words, such as pre-positional or postpositional particles and determiners, do not tend to be explicitly signed, and inflectional morphemes associated with verbal predicates that express categories, such as tense, mood, and aspect, are not manually signed, in general. For example, in SJ, post-positional particles ' ' are generally not explicitly signed, and the signs associated with verbal predicates are not conjugated, whereas in Japanese, verbs, adjectives, and auxiliary verbs are conjugated. Therefore, the gloss does not include such language constructs. Although gloss notation is commonly used by writing a series of spoken words that correspond to each sign in capital letters, because of the lack of sign language resources, its quality and size differ greatly according to the language (Bungeroth et al., 2008). SJ signs are heavily polysemous and their meaning is often context sensitive. Additionally, there is no publicly available corpus for SJ translation studies. Therefore, in this study, we use an in-house corpus that uses our gloss notation method. The details of the corpus and its notation are described in Section 2.
To solve challenging problems, we propose a novel pipeline method to translate from SJ written in gloss to Japanese. In particular, we focus on the linguistic differences between SJ and Japanese and estimate the post-positional particles that are missing and the appropriate forms of morphological inflection of words. Our method assumes that the ground truth gloss of the signed sentence is available. This assumption does not limit the availability of the above two-step sign language translation method. Our method first uses phrase-based statistical machine translation to match the SJ gloss to Japanese words. Then we refine the results further using transformer-based seq2seq (Sutskever et al., 2014) models, which are trained using a large out-of-domain parallel corpus. Specifically, we use three different seq2seq models (1) to complement the post-positional particles, (2) to apply morphological inflection by conjugating verbs, adjectives, and auxiliary verbs, and (3) to re-estimate the post-positional particles over the previous output. We repeatedly apply these models sequentially and adjust the translation results until they converge.
The proposed method works robustly, even for small training datasets, which are typically of the order of thousands of pairs in a dataset, and the results show that the state-of-the-art method is inferior to the SMT baseline with the low-resource setting. We found that iterative updates of translations are effective for improving the grammaticality and fluency of the translation output. Our experimental results show that the proposed model provides +4.4/+4.9 higher translation performance for BLEU3/BLEU4 scores compared with the SMT baseline.

Materials
We use two corpora: one is a small in-house SJ and Japanese parallel corpus, and the other is a large out-of-domain Japanese monolingual corpus. We describe the details of each corpus as follows.

SJ and Japanese parallel corpus
The locally organized in-house parallel corpus contains 1,086 sentence pairs with >7.5K glosses from a vocabulary of 655 words, and >11K Japanese words from a vocabulary of >1.2K words. The average length of a gloss sentence is 6.9 words, with a maximum length of 12 words and minimum length of 2 words, and the average length of a Japanese sentence is 10.3 words, with a maximum length of 21 words and minimum length of 5 words. The corpus consists of the ground truth gloss transcriptions of signs and their translations to Japanese sentences. The sentences are various spontaneous conversations that seemingly took place at municipal offices, such as asking for a certified copy of the resident register, pension, and unemployment insurance. In the corpus, a gloss word is written in the form gN , where N corresponds to an arbitrary unique number. We adopt this notation instead of using Japanese words because glosses in SJ are heavily polysemous and a sign maps to different Japanese words depending on the context. Instead, we use an auxiliary dictionary to map each gloss to spoken words or phrases. This notation method also helps the proposed method to select the appropriate Japanese word or phrase within the phrasebased statistical machine translation model that we use in the study.
Because of the sparsity of the parallel corpus, approximately 2.3% of the glosses are singletons, so we add all gloss dictionary items as additional parallel data to reduce OOV issues at test time.

Out-of-domain Japanese corpus
We use a subset of the Balanced Corpus of Contemporary Written Japanese 1 as an out-of-domain Japanese corpus to manually generate pseudo parallel corpora. The details of the corpus generation procedure are described in Section 3. To select the subsets, we use pattern matching to select sentences that end in a pattern such as [∼ (question), ∼ (admit), ∼ G g208 g20 g28 g17 g496 g2 null  (ask), ∼ (confirm), ∼ (intend), ∼ (desire)]. These patterns were chosen so that the selected sentences are similar to the target Japanese in the paired corpus. The monolingual corpus contains >195K sentences with >3.9M Japanese words from a vocabulary of >70K Japanese words.

Methodology
The overall proposed pipeline translation system is shown in Fig. 1. G represents a gloss sequence and S represents a Japanese sequence. We define two subscripts for S, that is, P P , which denotes 'postpositional particle,' and C, which denotes 'conjugation,' with the prefix + or − for each subscript, which denotes the existence or non-existence, respectively, of post-positional particles and conjugation. The definition of each term is provided in Table 1. post-positional particles conjugation

Translation method
The proposed pipeline translation consists of six steps. Algorithm 1 shows the steps applied in sequence to gradually convert from G to S +P P +C , which is the final translation of this algorithm. The details of each step are as follows: Step 0: We use a phrase-based statistical machine translation (pbsmt) (Koehn et al., 2007) to translate G into S −P P −C . In this step, we map each gloss phrase to the appropriate Japanese phrases without considering the post-positional particles and conjugations of the output Japanese sequence.
Step 1: We use a transformer-based seq2seq model (Vaswani et al., 2017) In this step, the model estimates missing post-positional particles in S −P P −C and inserts them to generate S +P P −C .
Step 2: Translate S +P P −C into S +P P +C We use another transformer-based seq2seq model (s2s m2) to translate S +P P −C into S +P P +C . In this step, the model estimates the appropriate morphological inflection or conjugated form for verbs, adjectives, and auxiliary verbs in S +P P −C to generate S +P P +C .
Step 3: Convert S +P P +C to S −P P +C In this step, we remove the previously estimated post-positional particles in S +P P +C from Step 2 to generate S −P P +C .
Step 4: Translate S −P P +C into S +P P +C We use the other transformer-based seq2seq model (s2s m3) to translate S −P P +C into S +P P +C by re-estimating missing post-positional particles. In Step 1, we estimated the post-positional particles over a Japanese sequence in the canonical form (S −P P −C ); however, in the present step, we estimate them over for the conjugated word sequence (S −P P +C ). We assume that this will correct the previous estimation using the conjugated form of the word sequence obtained in the previous steps.
Step 5: Convert S +P P +C to S +P P −C In this step, we transform S +P P +C to S +P P −C by converting words in S +P P +C to their canonical form. Steps 3, 4, 5, and 2 are repeated until Algorithm 1 Translation of a sign gloss sequence into a Japanese sentence Step 0: Step 3: S+P P +C → S−P P +C Step 4: S−P P +C → S+P P +C Step 5: S+P P +C → S+P P −C Step 2: S+P P −C → S+P P +C Snext = S+P P +C end while return Snext the translation output converges or the number of iterations reaches the maximum limit (10).

Statistical machine translation model in
Step 0 We use Moses (Koehn et al., 2007) to train the phrase-based statistical machine translation model to translate from G to S −P P −C in Step 0. To train the model, we use the parallel corpus and preprocess the target Japanese sequences by deleting post-positional particles, and convert conjugated words, such as verbs, adjectives, and auxiliary verbs, to their canonical forms using MeCab 2 . Note that we leave any post-positional particles untouched if gloss words corresponding to them exist. For example, the Japanese word ka which is a post-positional particle bound to the end of an interrogative sentence, has a corresponding gloss word in SJ. The translation model is described by the following noisy-channel model to estimate the best target Japanese word sentence s ∈ S −P P −C for a source gloss sentence g ∈ G as where p LM (s) is a language model based on the n-grams of S −P P −C . p(g|s) is decomposed into a phrase-based formula using a phrase translation table and phrase reordering model (Koehn et al., 2007). For the language model, we use modified 3-gram Kneser-Ney smoothing.

Encoder-decoder translation models in
Steps 1, 2, and 4 For the seq2seq models used in Steps 1, 2, and 4 in Section 3.1, we use a transformer-based encoderdecoder model (Ott et al., 2019). For all models, we use an encoder and decoder with an embedding of size 512, FFN-embedding of size 2048, and six layers with eight attention heads. We use the Adam optimizer with label smoothing crossentropy loss with a smoothing factor of 0.1. We set the initial learning rate to 5e-4 with a warmup updates of 4,000 and use the inverse sqrt learning rate scheduler. We set the maximum tokens in a batch to 4K. We use tied embedding for the input and output layers. We obtain the hyperparameters from a non-exhaustive parameter search, and the results are shown in Table 9 of the Appendix. We randomly split the corpus into training, validation, and test sets in an 8:1:1 ratio. For tokenization, we use byte pair encoding (Sennrich et al., 2016), which is trained using the training dataset. We set the number of operations to 10K for each tokenization of the seq2seq models. For the seq2seq model in Step 1, we pre-process the out-of-domain monolingual corpus as follows: • Source: preprocess Japanese sequences by deleting post-positional particles and converting all conjugated words, such as verbs, adjectives, and auxiliary verbs, to their canonical forms.
• Target: preprocess Japanese sequences by leaving post-positional particles untouched and converting all conjugated words, such as verbs, adjectives, and auxiliary verbs, to their canonical forms.
The pre-processed corpus becomes the pseudoparallel corpus to train the seq2seq model in Step 1 to translate S −P P −C into S +P P −C . The training corpora of the other seq2seq models, that is, s2s m2 in Step 2 and s2s m3 in Step 4, are similarly preprocessed and independently trained using the pseudo-parallel corpus. We observed that training the seq2seq model in Step 2 took more than a few hundred epochs, whereas training the seq2seq model in Steps 1 and 4 took less than 30 epochs. We used the models with the lowest validation loss for the experiments. The results on the test set demonstrated that the BLEU4 scores were 74.20, 98.75, and 75.06 for s2s m1, s2s m2, and s2s m3, respectively. This indicates that post-positional particle estimation is more uncertain compared with the estimation of morphological inflection.

Experiments
To evaluate the proposed method, we conducted 100 experiments and for each test, we randomly selected 10 samples from the parallel corpus for testing and used the remaining samples to retrain the statistical machine translation model in Step 0. For each test, we finetuned the parameters of the seq2seq model (s2s m1, s2s m2, s2s m3) by the training data and we used a beam size of 5 for decoding of the models. We averaged the 100 results to calculate the metrics of the performance.
We denote the proposed method in Section 3.1 by SMT+Iterative s2s and compared its performance with the following baselines (naive, LSTMs, and SMT), the variants of the proposed method (SMT+1step s2s and SMT+2step s2s) and the transformer-based end-to-end Gloss2Text (G2T) model proposed by Yin and Read (2020). The followings are the brief explanations of each model.
• naive: This baseline replaces each gloss word with a Japanese word using the gloss dictionary. If more than one Japanese word is defined for a gloss, the first word is used.
• LSTM: This baseline uses encoder-decoder LSTM with an attention mechanism (Bahdanau et al., 2015) to directly translate G into S +P P +C . The model is trained using the parallel corpus without using the out-of-domain corpus and is configured with several different hyperparameter settings.
• SMT: This baseline uses only the statistical machine translation model to directly trans-late G into S +P P +C . This model is trained using the parallel corpus and without using the out-of-domain corpus.
• SMT+1step s2s: This model is a variant of the proposed model which first executes Step 0 of Algorithm 1 to translate G into S −P P −C .
Then it uses another seq2seq model trained using the out-of-domain corpus to directly translate S −P P −C into S +P P +C . We compared the performance of this model, which jointly estimates post-positional particles and conjugations, with the model that estimates them separately using different models.
• SMT+2step s2s: This model is another variant of the proposed model which performs Steps 0-2, but does not iteratively update the translation result as it does in Algorithm 1. We examined how the iterative updates of the result with SMT+Iterative s2s contribute to the performance compared with the model without them.
For G2T, we changed the original hyperparameters suggested by Yin and Read (2020) and found that the following parameters were optimal using hyperparameter search on our parallel corpus. Encoder and decoder: embed-size = 256, FFNembed-size = 1024, num-layer = 1, num-attentionhead = 4. Table 2 shows the results of the experiment. To evaluate performance, we used the following metrics, BLEU-1/2/3/4 (Papineni et al., 2002), ME-TEOR (Banerjee and Lavie, 2005), and word error rate (WER), and averaged the scores to obtain the results. The results showed that the proposed model (SMT+Iterative s2s) outperformed the other models. The poor performance of the naive model indicates that the simple lookup method using the gloss dictionary did not produce successful results. LSTMs with different hyperparameters varying in the dimensions of the embedding and the hidden layers (256, 512, 1024) and the number of layers (1, 2) show the baseline performances to directly translate G into S +P P +C . Among them, the LSTM with the dimensions of the embedding and the hidden layers of 1024 and the number of layers of 1 showed the best performance. The best LSTM and the G2T were inferior   Table 3: Error propagation analysis of SMT+Iterative s2s. The score is the exact match for the correct ratio (%) (GS = gold standard, EP = error propagation).

Results
to the SMT because there were insufficient samples to train the neural models with large capacity. All the pipeline models that combined the SMT and seq2seq models outperformed the models that directly translate G into S +P P +C . This clearly demonstrates the effectiveness of the pipeline approach. Table 8 in Appendix illustrates the translation samples at each step of SMT+Iterative s2s.
We investigated whether adding the monolingual Japanese corpus in 2.2 to train the target language model improved the performance of the SMT baseline. However, on the contrary, performance was slightly degraded. We believe that this was because of a domain mismatch between the corpora. The statistical significance test results confirmed that the performance of SMT+Iterative s2s was significantly better than that of SMT, SMT+1step s2s, and SMT+2step s2s (see Table 10 in the Appendix). Table 3 shows the error propagation analysis of SMT+Iterative s2s. The score was measured using the exact match by counting the outputs that exactly matched the references at each step. Column 'GS' represents the gold standard score when using the ground truth input, and column 'EP' rep-resents the score when using the output from the previous pipeline stage as the input propagating the errors. Clearly, a large portion of the error originated from Step 0 when translating G into S −P P −C . The GS score of s2s m2 was much higher than that of s2s m1 and s2s m3, which was indicated by its higher BLEU score for the model evaluation on the test set described in Section 4. We verified that the EP score of s2s m3 was 2.8% greater than that of s2s m2, thereby illustrating the efficacy of the retrospective complement of postpositional particles. Note that the EP score of s2s m3 was measured by allowing the output of s2s m2 in the EP to be set to the input and removing all post-positional particles. Table 4 shows the frequency of iterative update counts by SMT+Iterative s2s. Approximately 72% of the results converged at the first iteration, and approximately 26% of the results converged at the second iteration. Counts above 6 were achieved when the same phrase was repeatedly generated, which is a phenomenon known as hallucination (Wang and Sennrich, 2020). If we detected such an error, we removed the repeating phrase to shorten the output. Table 5 shows the qualitative evaluations of the results using the proposed model and the other models with BLEU4, WER, and perplexity (PPL) scores. PPL in the last column was measured by the transformer-based language model that was pretrained by using the 494M-word Japanese Wikipedia.

Qualitative Evaluation
We observed that the results of SMT and G2T had more post-positional particle selection errors than the other pipeline models, and the results of SMT+1step s2s had more verb conjugation errors than SMT+2step s2s, which suggest the efficacy of the independent estimation of post-positional particles and conjugations. We confirmed that the post-positional particle estimations using SMT+Iterative s2s were either more natural or less error-prone than those using SMT+2step s2s, which made the translation results more fluent. 6 (0.6%) 6 1 (0.1%) 7 2 (0.2%) 8 ≥ 6 (0.6%) Table 4: Frequency of the iteration counts of Algorithm 1 until the translation output converged using SMT+Iterative s2s. Table 6 shows the average perplexity scores of the results of the SMT and pipeline models. While the perplexities of the pipeline models were much lower than that of SMT, the perplexity of the proposed SMT+Iterative s2s was not the lowest. This result suggests that word-based perplexity is not suitable for evaluating equally acceptable translation outputs.

Discussion
In Table 5, most of the outputs using SMT+2step s2s and SMT+Iterative s2s were grammatically acceptable Japanese sentences with slight differences in the post-positional particle selections. As shown in the second and third examples in Table 5, the PPL scores of SMT+2step s2s were lower than those of SMT+Iterative s2s, but the BLEU4 and WER scores of SMT+Iterative s2s were better than that of SMT+2step s2s, even though the meanings of the sentences were almost the same. By contrast, the sentences of SMT+2step s2s and SMT+Iterative s2s in the first and last examples had different meanings, even though the PPL, BLEU4, and WER scores indicated that the results of SMT+Iterative s2s were better than those of SMT+2step s2s. However, depending on the context, the results of SMT+2step s2s may be more appropriate. The main cause of the ambiguity issue is related to the information bottleneck raised by Yin and Read (2020) regarding the gloss notation of sign language. Currently, our parallel corpus does not include any non-manual signals (NMSs), such as facial expression, eye gaze, mouth, and movement of the head and shoulders. However, NMSs act as grammatical markings for syntactic information (Valli et al. 2011;Koizumi et al. 2002). NMSs are not expressed in sequence, but simultaneously with manual signs, and their subtleties make sign recognition and annotation more difficult. Perhaps, this is one of the reasons that most existing sign language corpora do not or only contain partial NMS labels along with glosses. As suggested by Yin and Read (2020), the performance of G2T translation may not impose an upper bound for sign-to-text translation unless the gloss faithfully describes the signed sentences. We are interested in investigating whether incorporating visual features from signs would improve the proposed G2T translation method. Because of the limited space in this paper, we leave this issue for future work. Table 7 depicts examples of the translation errors by SMT+Iterative s2s categorized into gloss word translation error, post-positional particle exchange, and conjugation exchange. As shown in Table 3, a large portion of the translation errors originated from gloss word translations. These errors mostly occurred because of the incorrect selection of Japanese wording for gloss phrases. For instance, the phrase hisaichi 'disaster area' in the reference of the 2nd example in Table 7, which is expressed as a sequence of three glosses: 'receive', 'disaster', and 'area', was translated into the un-grammatical phrase, ukete saigaibasho. It is because the correct mapping from glosses to compound nouns cannot be learned by the phrase-based SMT unless they appear in the training set. The second major source of translation errors was the post-positional particle exchanges. These errors possibly change the semantic from the reference as indicated in the 5th example of Table 7, "how to donate books to librray" v.s. "how to give a book from the library". As we mentioned above, some of these errors are difficult to handle because the system output may be correct in another context.   lation errors relating to the conjugation exchange rarely occurred, and even if they did, the impacts were minimal. Camgöz et al. (2018) proposed end-to-end sign language translation in the framework of neural machine translation, allowing them to jointly learn the spatial sign representation, underlying language model, and mapping between sign and spoken language using PHOENIX-Weather 2014T (Camgöz et al., 2018) corpus. Their later work (Camgöz et al., 2020) further improved the model by introducing a transformer-based architecture that jointly learns sign language recognition and translation while being trainable in an end-to-end manner using connectionist temporal classification loss to bind the recognition and translation problems into a single unified architecture.

Related works
In a similar research line, Yin and Read (2020) proposed the G2T model using a transformerbased seq2seq model, and evaluated the performance on PHOENIX-Weather 2014T (Camgöz et al., 2018) and ASLG-PC12 (Othman and Jemni, 2012) in various ways by changing the numbers of encoder-decoder layers and embedding schemes. All the end-to-end state-of-the-art sign language translation methods rely on large datasets and cannot be used for resource-poor datasets. The

Reference
System Error type "How can I study abroad?" gloss word translation error "I would like to volunteer in the disaster area.' gloss word translation error "What are you stockpiling in case of a disaster?" gloss word translation error "Can taxi fare be deducted from medical expenses?" post-positional particle exchange "Please tell me how to donate books to the library." post-positional particle exchange "Is it possible to register a seal with only the surname?" post-positional particle exchange "Do you pay insurance premiums even if you have no income?" post-positional particle exchange Do you have any good information? conjugation exchange "Is there a place where I can leave my child on holidays?" conjugation exchange Table 7: Examples of the translation errors by SMT+Iterative s2s are categorized into gloss word translation error, post-positional particle exchange, and conjugation exchange. We highlight the wrong words or phrases in bold.
pipeline method that we proposed is related to the transfer learning method proposed by Mocialov et al. (2018). They proposed transfer learning to improve British Sign Language modeling at a gloss level by fine-tuning or layer substitution on neural network models pre-trained on the Penn Treebank dataset. Although the purpose of their work was not to translate sign language, their work is similar to ours in that it takes advantage of linguistic commonality between resource-poor sign language and its spoken language. Our approach to converting non-grammatical sentences into grammatical sentences is related to previous work on grammatical error correction (Imamura et al. 2012;Liu et al. 2018;Oyama et al. 2013). They used insert or replace operations to correct particle or morphological inflection errors in a monolithic model, and we believe that the proposed seq2seq-based iterative method using multiple models can be used for similar tasks.

Conclusion
We proposed a pipeline machine translation method from SJ to Japanese by assuming that the gloss of the sign is provided. We focused on grammatical differences between SJ and Japanese, particularly post-positional particles and morphological inflections, and proposed a pipeline model by cascading the phrase-based statistical machine translation and three transformer-based seq2seq models, which effectively addressed the resourcepoor issue of the sign language corpus. The statistical machine translation model first maps each gloss phrase to a Japanese phrase, then three seq2seq models pre-trained using the monolingual corpus transform the initial translation by complementing post-positional particles, and apply con-jugations for verbs, auxiliary verbs, and adjectives. Translation is repeated until the output converges. We confirmed that the proposed method outperformed the SMT baseline by +4.4/+4.9 points for BLEU-3/4. 1e-10** 4e-06** SMT+2step s2s 1e-10** 0.164 1e-10** 0.177 SMT+Iterative s2s 1e-10** 0.003** 0.013* 1e-10** 0.018* 0.049*