Explore More Guidance: A Task-aware Instruction Network for Sign Language Translation Enhanced with Data Augmentation

Sign language recognition and translation first uses a recognition module to generate glosses from sign language videos and then employs a translation module to translate glosses into spoken sentences. Most existing works focus on the recognition step, while paying less attention to sign language translation. In this work, we propose a task-aware instruction network, namely TIN-SLT, for sign language translation, by introducing the instruction module and the learning-based feature fuse strategy into a Transformer network. In this way, the pre-trained model's language ability can be well explored and utilized to further boost the translation performance. Moreover, by exploring the representation space of sign language glosses and target spoken language, we propose a multi-level data augmentation scheme to adjust the data distribution of the training set. We conduct extensive experiments on two challenging benchmark datasets, PHOENIX-2014-T and ASLG-PC12, on which our method outperforms former best solutions by 1.65 and 1.42 in terms of BLEU-4. Our code is published at https://github.com/yongcaoplus/TIN-SLT.


Introduction
Sign language recognition and translation aims to transform sign language videos into spoken languages, which builds a bridge for communication between deaf and normal people. Considering the unique grammar of sign languages, current effective recognition and translation systems involve two steps: a tokenization module to generate glosses from sign language videos, and a translation module to translate the recognized glosses into spoken natural languages. Previous works Sincan and Keles, 2020;Sharma and Kumar, 2021;Kumar et al., 2020; * Equal Contribution. # Corresponding author: Min Chen. 2020) have proposed various solutions to address the first step, but paid less attention to the translation system. Hence, this paper aims to solve the problem of sign language translation (SLT) with the goal of translating multiple recognized independent glosses into a complete sentence.
To do so, most existing works (Ko et al., 2019;Stoll et al., 2018) directly apply advanced techniques, e.g., Seq2Seq model (Sutskever et al., 2014) or Transformer (Vaswani et al., 2017), from neural machine translation to SLT. However, different from the lingual translation task in neural machine translation, SLT poses several unique challenges. First, it is hard to collect and annotate a large amount of sign language corpus. It is still an open question that how to explore more guidance and external information for SLT task by incorporating the pre-trained language models based on masses of unlabeled corpus. Second, since sign languages are developed independently from spoken languages with quite different linguistic features, the discrepancy of representation space between glosses and spoken sentences is significant, thus increasing the translation difficulty.
To address the above issues, we propose a novel task-aware instruction network, called TIN-SLT for sign language translation, further enhanced with a multi-level data augmentation scheme. Our TIN-SLT is capable of encoding pre-trained language model's ability into the translation model and also decreasing the discrepancy between the representation space of glosses and texts.
To begin with, we leverage the extracted hidden features from the pre-trained model as extra information to guide the sign language translation. Besides, we apply an instruction module to transform general token features into task-aware features. In this way, we can fully utilize the language skills originating from the external world, thus reducing the demand for sign language training data.
Next, to better inject the information from pretrained model into the SLT model, we design a learning-based feature fusion strategy, which has been analyzed and validated to be effective compared with existing commonly-used fusion ways.
Finally, considering the large difference between the sign language glosses and texts in terms of the representation space, we propose a multilevel data augmentation scheme to enrich the coverage and variety of existing datasets.
In summary, our contributions are threefold: (i) a novel TIN-SLT network to explore more guidance of pre-trained models, (ii) a learning-based feature fusion strategy, and (iii) a multi-level data augmentation scheme. Extensive experiments on challenging benchmark datasets validate the superiority of our TIN-SLT over state-of-the-art approaches; see Figure 1 for example results.

Related Works
Methods for sign language recognition. SLR task mainly focuses on the extraction of extended spatial and temporal multi-cue features Koller et al., 2017). Most existing works (Yin et al., 2016;Qiu et al., 2017;Wei et al., 2019;Cui et al., 2019) study the strong representation of sign language videos such as multi-semantic (Cui et al., 2019) and multi-modality  analysis. Although extracting representative features from sign language videos is fully explored, how to effectively conduct the subsequent translation by considering the unique linguistic features of sign (a) Vocab distribution on PH14 dataset (b) Vocab distribution on ASLG dataset Figure 2: Comparing the sample distribution between the input sign glosses (yellow dots) and the output translated texts (red dots) on two datasets.
language is often ignored in these SLR works.
Methods for sign language translation. Early approaches for SLT rely on seq2seq model and attention mechanism (Arvanitis et al., 2019), while facing the limitation of long-term dependencies. Later, motivated by the ability of the Transformer (Vaswani et al., 2017), many researchers utilize it to effectively improve SLT performance. For example, the work in Camgoz et al. (2020) tried to use Transformer for both recognition and translation, and promote the joint optimization of sign language recognition and translation. The subsequent work (Yin and Read, 2020) proposed the STMC-Transformer network which first uses STMC networks  to achieve better results for SLR, and then exploits Transformer for translation to obtain better SLT performance.
General neural machine translation. Broadly speaking, sign language translation belongs to the field of neural machine translation, with the goal of carrying out automated text translation. Earlier approaches deployed recurrent network (Bahdanau et al., 2014), convolutional network (Gehring et al., 2017), or Transformer (Vaswani et al., 2017) as encoder-decoder module. Among them, Transformer has achieved state-of-the-art results, but the translation performance still needs to be improved due to the limited training corpus. In addition, there are some explorations in bringing the pre-trained models into neural machine translation (Imamura and Sumita, 2019;Shavarani and Sarkar, 2021;Zhu et al., 2020).

Challenges
The goal of this work is to translate the recognized multiple independent glosses (network input) into a complete spoken sentence (expected output). Compared with general neural machine translation tasks, SLT faces two main challenges:  Figure 3: Network architecture of TIN-SLT. As shown in the bottom row, we first employ STMC model  to recognize sign language videos to independent glosses. Next, we design a multi-level data augmentation scheme to enrich existing data pool for better feature embedding from glosses. Then, we design a task-aware instruction network with a novel instruction module to translate glosses into a complete spoken sentence.
Limited annotated corpus: Compared with natural languages, the data resources of sign languages are scarce (Bragg et al., 2019). As a result, the SLT models trained on limited data often suffer from the overfitting problem with poor generalization .
Discrepancy between glosses (input) and texts (output): Figure 2 shows the representation space of sign glosses (yellow dots) and translated texts (red dots) using Word2Vec (Mikolov et al., 2013) on two different datasets. We can observe that the representation space of sign glosses is clearly smaller than that of the target spoken language, thus increasing the difficulty of network learning.

Our Approach
To address the above challenges, we propose TIN-SLT by effectively introducing the pre-trained model into SLT task and further designing a multilevel data augmentation scheme. Figure 3 depicts the detailed network architecture. In the following subsections, we will firstly introduce the network architecture of TIN-SLT, followed by our solutions to address the above two challenges.

Network Architecture of TIN-SLT
Given a sign language video V = {V 1 , . . . , V T } with T frames, like existing approaches, we also adopt a two-step pipeline by first (i) recognizing V into a sequence G = {g 1 , . . . , g L } with L independent glosses and then (ii) translating G into a complete spoken sentence S = {w 1 , . . . , w M } with M words, but we pay more attention to solve step (ii). Hence, for step (i), as shown in the bottom-left part of Figure 3, we empirically use the spatialtemporal multi-cue (STMC) network , which consists of a spatial multi-cue module and a temporal multi-cue module. For more technical details of STMC, please refer to . Below, we shall mainly elaborate on the details of addressing step (ii).
After obtaining the sequence G of sign glosses, considering that the representation space of glosses is much smaller than that of texts (see Figure 2), we thus design a multi-level data augmentation scheme to expand the gloss representation space; see the top-left part of Figure 3 as an illustration and we shall present its details in Section 4.3.
Next, as shown in the bottom-middle part of Figure 3, the key of our design is a task-aware instruction network, where we adopt Transformer as the network backbone consisting of several encoder and decoder layers, whose objective is to learn the conditional probabilities p(S|G). Since SLT is an extremely low-data-resource task as we have discussed in Section 3, we thus focus on exploring more task-aware guidance by learning external world knowledge, which is dynamically incorporated into the Transformer backbone via our designed task-aware instruction module. We shall present its details in Section 4.2.
Lastly, the outputs of last decoder are passed through a non-linear point-wise feed forward layer and we can obtain the predicted sentence S by a linear transform and softmax layer.

Task-aware Instruction Module
As is shown in Figure 3, our task-aware instruction network is composed of a series of encoder and decoder layers. To handle the limited training data, we propose to leverage the learned external knowledge from natural language datasets to guide the learning of sign languages. More specifically, we design a task-aware instruction module to dynamically inject external knowledge from pre-trained models into our encoder and decoder. Below, we shall present the details.
Encoder. Given the recognized glosses,let H I denotes the instruction features encoded by the pre-trained model (PTM), H E and H E denotes the input and output of encoder which is randomly initialized. As shown in Figure 4, H I and H E are fed into the task-aware instruction module for feature fusing. Then, the output of the instruction module is fed into residual connection (Add&Norm) and feed forward network (FFN).
The light yellow box of Figure 4 shows the detailed design of task-aware instruction module. Specifically, we feed H E into a self-attention module to learn the contextual relationship between the features of glosses, while H I is fed into a PTMattention, which is the same architecture as selfattention. Different from existing work which employ PTM in general neural network (Zhu et al., 2020), we insert an adaptive layer to fine-tune PTMattention output for SLT task, to transform general gloss features into task-aware features.
where σ() denotes the adaptive layer (we set it as fully connection layers here), and h t denotes the gloss features at time step t. Then, the output of two modules are combined via α strategy. The whole process is formulated as follows: where Attn E and Attn I are two attention layers with different parameters, which follow (Vaswani et al., 2017). The way of setting an optimal α will be introduced later.
Decoder. Let S D and S D denotes the input and output of decoder, s t denote the hidden state at time step t, and s 0 denotes the beginning token of a sentence, i.e., < bos >. The hidden states are passed to a masked self-attention ensuring that each token may only use its predecessors as follows: Representations H E and H I extracted from encoder and PTM are fed into the decoder-attention and PTM-attention module, respectively, as shown in the right part of Figure 4. Similar to Encoder, we formulate this decoding output as: where Attn D represent decoder-attention, and s t is the output of decoder instruction module.
Learning-based feature fusion. As shown in Eq.
(2), representations extracted from both PTM-and self-attention are fused via a parameter α. How to set a reasonable and optimal α will directly affects the learning performance, which is a problem worthy of exploration. Instead of manually setting a constant α, we propose a learning-based strategy to encourage the network to learn the optimal α by itself for better feature fusion.
Specifically, learning-based strategy means that we adopt the back-propagation learning algorithm to update α during the network training process: where g t indicates the gradient and Γ(·) represents the optimization algorithm. Though the idea of self-learning is straightforward, we shall show in the experiment section that it is quite effective compared with many other strategies.

Multi-level Data Augmentation
To decrease the discrepancy between glosses (input) and texts (output), we propose a multi-level data augmentation scheme. Our key idea is that, besides existing gloss-text pairs, we use upsampling as our data augmentation algorithm and generate text-text pairs as extended samples to introduce texts information into glosses, thus enlarging the feature distribution space of glosses. Actually, there is a trade-off between augmentation and overfitting, which means the upsampling ratio Φ upsamp should be determined by the degree of gloss-text difference. We here propose four factors φ = [φ v , φ r , φ s , φ d ] to calculate the difference in terms of token, sentence and dataset level, and set weighted φ as Φ upsamp .
Token level. Vocabulary Different Ratio (VDR, φ v ) is used to measure the difference of gloss vocabulary space and text's, as calculated by Eq. (6).
where W G and W S represent gloss and text vocabularies, and | · | denotes the size of set. We present Rare Vocabulary Ratio (RVR, φ r ) to calculate the ratio of the rare words: where #(·) is 1 if the value is true, else 0, Counter(G) is to calculate the gloss vocabulary frequency, and τ r means the empirical thresh frequency determined by the vocabulary frequency, which is empirically set to be 2.
Sentence level. We propose Sentence Cover Ratio (SCR, φ s ) to compute the gloss-text pair similarity and covered ratio, calculated as: where r i denotes the covered ratio of gloss-text pair G i and S i , while τ c means the empirical thresh (set τ c = 0.5). We labeled gloss-text pairs which satisfy r i > τ c as candidates C.
Dataset level. We use Dataset Length-difference Ratio (DLR, φ d ) to calculate the length of sentence distance, calculated as: Then we can get the upsampling ratio by: where the weight matrix θ is empirically set as [0.1, 0.1, 0.6, 0.2], corresponding to the weight of [φ v , φ r , φ s , φ d ], as we suppose the sentence level matters the most and the weight of token level is the same as dataset level. Lastly, we obtain the upsampling ratio and use upsampling strategy among all candidates C to enrich the dataset. ASLG-PC12, i.e., ASLG, is a parallel corpus of English written texts and American Sign Language (ASL) glosses, which is constructed based on rule-based approach. It contains more than one hundred million pairs of sentences between English sentences and ASL glosses.
Evaluation metrics. To fairly evaluate the effectiveness of our TIN-SLT, we follow (Yin and Read, 2020) to use the commonly-used BLEU-N (Ngrams ranges from 1 to 4) (Papineni et al., 2002), ROUGE-L (Lin, 2004) and METEOR (Banerjee and Lavie, 2005) as the evaluation metrics.
Experimental setup. The experiments are conducted on Ubuntu 18.04 system with two NVIDIA V100 GPUs. Our Transformers are built using 2048 hidden units and 8 heads in each layer. Besides, we adopt Adam (Kingma and Ba, 2014) as optimization algorithm with β 1 = 0.9, β 2 = 0.998 and use inverse sqrt learning rate scheduler with a weight decay of 10 −3 . Please refer to Appendix for more hyper-parameter settings.

Test Set
Model

Comparison with Others
To compare our TIN-SLT against state-of-the-art approaches on sign language translation task, we conducted two groups of experiments, Gloss2Text (G2T) and Sign2Gloss2Text (S2G2T).
Evaluation on G2T. G2T is a text-to-text translation task, whose objective is to translate groundtruth sign glosses to spoken language sentences. In specific, for PH14 dataset, we should output German spoken language sentences; while for ASLG dataset, we should output English sentences. Table 1 summarizes the comparison results. Clearly, our TIN-SLT achieves the highest values on most evaluation metrics with a significant margin. Particularly, the superiority of our method on PH14 dataset is more obvious, where almost all the evaluation values are the highest. Thanks to our multilevel data augmentation scheme, the integrity of translated sentences has been improved, which is reflected in the significant improvement of BLEU-N metric. In addition, the strong guidance from external knowledge also encourages our network to generate translated sentences in correct grammar, consistent tense and appropriate word order. For the lower ROUGE-L metric, we think that although the instruction module obviously help improve the accuracy and fluency of translation results, it leads to a slight decrease of continuous texts' recall rate in this task.
Evaluation on S2G2T. S2G2T is an extended task beyond G2T, which aims to recognize sign language videos to sign glosses, and then translate the recognized glosses to spoken sentences. Hence, un-   (Yin and Read, 2020). Clearly, our TIN-SLT achieves the highest values on most metrics.
like the task of G2T, in this comparison, we focus on the evaluation of the whole two-step pipeline, that is, obtaining spoken language sentences from sign language videos. Considering that only PH14 contains sign language videos, we thus conduct experiments on this dataset for S2G2T task, and the results are reported in Table 2. Note that, for the recognition step, we employ STMC model to realize vision-based sequence learning . From the comparison we can see that, our TIN-SLT still outperforms existing approaches on most evaluation metrics.

Analysis and Discussions
Here, we conducted a series of detailed experiments to analyze our method and give some insights behind our network design. Effect of learning-based feature fusion. In this work, we propose a learning-based strategy to set α dynamically. Here, we conducted experiments by comparing this strategy with other four different strategies, including (1) cosine annealing (Loshchilov and Hutter, 2016), (2) cosine increment, (3) cosine decrement, and (4) constant value. The update of α by the three cosine strategies are calculated as Eq. (11) with different settings of the epoch cycle coefficient T c : (11) where α is the fusion ratio, T t is current epoch step, and γ is the time-shift constant. We set T c as (25, 100, 100) and γ as (0, 0, π) for cosine annealing, cosine decrement, and cosine increment, respectively. The minimum value α min and maximum value α max of α are set to be 0 and 1.
Figures 5(a)-5(b) are the experimental results on the two datasets. We can observe that the learningbased strategy (red line) gets the best result on ASLG and comparable result with the constant setting (α=0.8) on PH14, but still better than other three cosine strategies. Moreover, we also visualize the learned value of α during the training process as shown in Figures 5(c)-5(d) to find out the contribution ratio of the BERT model to the final performance. We can see that, the value of α is gradually decreasing on PH14, meaning that  Table 3: Ablation analysis of our major network components on the G2T task.
the model depends more on the BERT pre-trained knowledge at the beginning of the training process and gradually inclines to our employed training corpus. The observation is just opposite on ASLG, since it is a much larger dataset than PH14 and our model relies more on BERT to further boost the performance near the end of training.
Analysis on major network components. In our TIN-SLT, there are two major components: the multi-level data augmentation scheme and the instruction module. To validate the effectiveness of each component, we conduct an ablation analysis on the G2T task with the following cases.
• Baseline: We use two layers Transformer (Yin and Read, 2020) without data augmentation and instruction module as baseline.  • w/ DataAug: Based on the baseline, we add our data augmentation scheme back.
• w/ Encoder: Based on w/ DataAug, we fuse instruction module only into the encoder.
• w/ Decoder: Based on w/ DataAug, we fuse instruction module only into the decoder.
As a contrast, in our full pipeline, the instruction module is inserted into both encoder and decoder. Table 3 shows the evaluation results on both PH14 and ASLG. By comparing the results from Baseline and w/ DataAug, we can see that our data augmentation improves the translation performance, especially for the PH14 dataset. A reasonable interpretation is that the translation task on PH14 dataset is more difficult than on ASLG, thus our data augmentation contributes more. On the other hand, w/ Encoder, w/ Decoder and Full pipeline explore the best location to introduce PTM information into the model. Results in Table 3 show that our full model achieves the best performance. Particularly, by comparing the results from w/ Encoder and w/ Decoder against the results from SOTA methods (Tables 1 & 3), we can observe that as long as we employ the pre-trained model, no matter where it is inserted into the network, the performance is always better than existing methods. Effect of different pre-trained models. We here explored the translation performance by using different pre-trained models; see Table 4. We analyzed the model size and vocabulary coverage of the pre-trained model with gloss and text of our dataset. We can see that introducing a pre-trained model with larger vocabulary coverage of the target dataset will gain better performance, since a pretrained model with larger vocabulary coverage can 1 The pre-trained models links are listed in Appendix.  inject more knowledge learned from another unlabeled corpus into the translation task. For ASLG, although the vocabulary coverage is the same, we can see that the bigger model has better performance since it can learn contextual representation better.
Analysis on hyper-parameters. To search the best settings of our hyper-parameters, we employed Neural Network Intelligence (NNI) (Microsoft, 2018), a lightweight but powerful toolkit. As shown in Figures 5(e)-5(h), we explored how beam size, layer number, learning rate and dropout rate affect the model performance on PH14 dataset. First, beam search enables to explore more possible candidates, but large beam widths do not always result in better performance as shown in Figure 5(e). We obtain optimal beam size as 10 on PH14. Second, the layer number decides the model size and capacity, where the larger model would overfit on a small dataset. In Figure 5(f), we find the optimal layer number to be 3 on PH14. Lastly, as shown in Figures 5(g) & 5(h), we adopt an earlystopping strategy to avoid overfitting and find the best learning rate and dropout rate are 0.0003 and 0.45, respectively.
Case study. Table 5 presents some intuitive translation results on ASLG by reporting the translated spoken sentences. Overall, the translation quality is good, even the translated sentences with low BLEU-4 still convey the same information. Also, we can observe that our translated sentences are basically the same with ground truth, although using different expressions, i.e., "decision making" vs. "decision made". The translation results on PH14 are reported in Appendix.

Conclusion
In this paper, we proposed a task-aware instruction network for sign language translation. To address the problem of limited data for SLT, we introduced a pre-trained model into Transformer and designed an instruction module to adapt SLT task. Besides, due to the discrepancy between the representation space of sign glosses and spoken sentences, we proposed a multi-level data augmentation scheme. Extensive experiments validate our superior performance compared with state-of-the-art approaches. While there is obvious improvement among most evaluation metrics, the complexity of our models is also increased, causing a longer training period. In the future, we would like to explore the possibility of designing a lightweight model to achieve real-time efficiency.