IIE-NLP-Eyas at SemEval-2021 Task 4: Enhancing PLM for ReCAM with Special Tokens, Re-Ranking, Siamese Encoders and Back Translation

This paper introduces our systems for all three subtasks of SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning. To help our model better represent and understand abstract concepts in natural language, we well-design many simple and effective approaches adapted to the backbone model (RoBERTa). Specifically, we formalize the subtasks into the multiple-choice question answering format and add special tokens to abstract concepts, then, the final prediction of QA is considered as the result of subtasks. Additionally, we employ many finetuning tricks to improve the performance. Experimental results show that our approach gains significant performance compared with the baseline systems. Our system achieves eighth rank (87.51%) and tenth rank (89.64%) on the official blind test set of subtask 1 and subtask 2 respectively.


Introduction
The computer's ability in understanding, representing, and expressing abstract meaning is a fundamental problem towards achieving true natural language understanding. SemEval-2021 Task 4: Reading Comprehension of Abstract Meaning (ReCAM) provides a well-formed benchmark that aims to study the machine's ability in representing and understanding abstract concepts (Zheng et al., 2021).
The Reading Comprehension of Abstract Meaning (ReCAM) task is divided into three subtasks, including Imperceptibility, Nonspecificility, and Interaction. Please refer to the task description paper (Zheng et al., 2021) for more details. To address the above challenges in ReCAM, we first formalize all subtasks as a type of multiple-choice § Corresponding author. ¶ Our Code is publicly available at https://github. com/indexfziq/IIE-NLP-Eyas-SemEval2021.
Question Answering (QA) task like (Xing et al., 2020). Recently, the large Pre-trained Language Models (PLMs), such as GPT-2 (Radford et al., 2019), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), demonstrate their excellent ability in various natural language understanding tasks (Wang et al., 2018;Zellers et al., 2018Zellers et al., , 2019. So, we employ the state-of-the-art PLM, RoBERTa, as our backbone model. Moreover, we design many simple and effective approaches to improve the performance of the backbone model, such as adding special tokens, sentence re-ranking, label smoothing and back translation. This paper describes approaches for all subtasks developed by the IIE-NLP-Eyas Team (Natural Language Processing group of Institute of Information Engineering of the Chinese Academy of Sciences). Our contributions are summarized as the followings: • We design many simple and effective approaches to improve the performance of the PLMs on all three subtasks, such as special tokens, sentence re-ranking, siamese encoders and back translation and label smoothing; • Experiments demonstrate that our proposed methods achieve significant improvements compared with baselines and we obtain the 8th-place in subtask-1 and the 10th-place in subtask-2 on the final official evaluation.

Approaches
Since the format of the tasks in ReCAM is the same, we use the unified framework to address all tasks.
The following is the detail of our methods.
Task Definition We first present the description of symbols which are used in this paper. Formally, suppose there are seven key elements in all subtasks, i.e. {D, Q, A 1 , A 2 , A 3 , A 4 , A 5 }. We sup-pose the D denotes the given article, the Q denotes the summary of the article with a placeholder, the A * denotes the candidate abstract concepts for all subtasks to fill in the placeholder.

Multi-Choice Based Model
The pre-trained language models have made a great contribution to MRC tasks. Recently, a significant milestone is the BERT (Devlin et al., 2019), which gets new state-of-the-art results on eleven natural language processing tasks. In this section, we present the description of the multi-choice based model which we use in all subtasks. Consider the BERT-style model RoBERTa's (Liu et al., 2019) stronger performance than BERT, we utilize it as our backbone model, which introduces more data and bigger models for better performance. A multiple-choice based QA model M consists of a PLM encoder and a task-specific classification layer which includes a feed-forward neural network f (·) and a softmax operation. For each pair of question-answer, the calculation of M is as follow: (1) where the [·] is the input constructed according to the instruction of PLMs, and the S * is the final hidden state of the first token (<s>). For more details, we refer to the original work of PLMs (Liu et al., 2019). The candidate answer which owns a higher score will be identified as the final prediction. The model M is trained end-to-end with the cross-entropy objective function.
Special Tokens Considering the great performance of special tokens in entity and relation extraction (Zhong and Chen, 2021), as well as of the prompt template on commonsense reasoning (Xing et al., 2020), we attach special tokens to highlight the semantic representation of candidate abstract concepts in the input layer. To help the PLMs represent and understand the abstract concept (i.e. option word in ReCAM tasks) in textual description (i.e. summary of the article in ReCAM task), we use <e> and </e> to add on both ends of the abstract concept, i.e. <e> abstract concept </e>. It is interesting that the special tokens are useful features contributing to most of the system's boost, and we have tried many other useful special tokens which will be discussed in section 4.
Sentence Ranking As the given passage is too long to be deal with the Pre-trained Language Models (PLMs), we consider refining the passage input by rearranging the order of the sentences in the passage. With this reorder process, the sentence, which is more critical to the question, can appear at the beginning of the passage. Although the passage's sequential information is sacrificed, we keep the more question-relevant information of the passage. Supposing the passage D contains N sentences, i.e., We denote the given cloze-style question as Q. To rank the sentences in D, we resort BERT to compute the similarity score between each sentence, i.e. W n , and Q following the algorithm in Zhang et al. (2020). After ranking, the sentences in D are sorted in descending order of similarity scores, and we can get a rearranged passageD as the passage input to the QA model. In the implement progress,D will be truncated to fit into the PLM encoder with our setting max length.
Siamese Encoders When exploring the dataset, we find that the complete question statement, representing the result statement after replacing the placeholder token with the candidate option, also contains the semantic information which can help to make the judgment about options. Based on the observation, we propose a siamese encoders based architecture to inject the additional complete question statement information while not influence the input with passage. On the other hand, it can be seen as introducing an auxiliary task to assist the main task. Specifically, the training of siamese encoder based architecture is as following: where the P LM (·) stands for PLM encoder,Q i is the complete question statement, i indicates the i-th candidate answer, f (·) is the feed forward network. To coordinate the two losses, we opt for an uncertainty loss (Kendall et al., 2018) to adjust it adaptively through σ {1,2} as: L(θ, σ 1 , σ 2 ) = 1 2σ 2 1 L 1 (θ)+ 1 2σ 2 2 L 2 (θ)+logσ 2 1 σ 2 2 , where L {1,2} are the cross-entropy loss between the model prediction P {1,2} and the ground truth label respectively.
Back Translation Generally speaking, more successful neural networks require a large number of parameters, often in the millions. In order to make the neural network implements correctly, a lot of data is needed for training, but in actual situations, there is not as much data as we thought. The role of data augmentation includes two aspects. One is to increase the amount of training data and improve the generalization ability of the model. The other is to increase the noise data and improve the robustness of the model. A large number of the works (Buslaev et al., 2018;Bloice et al., 2019;Chen et al., 2020;Cubuk et al., 2020;Sato et al., 2018;Zhu et al., 2020) consider the data augmentation to make better performances. In the field of computer vision, a lot of work (Buslaev et al., 2018;Bloice et al., 2019;Chen et al., 2020;Cubuk et al., 2020) uses existing data to perform operations, such as flipping, translation or rotation, to create more data, so that neural networks have better generalization effects. Adding Gaussian distribution to text processing (Sato et al., 2018) can also achieve the effect of data augmentation. Besides, some works (Miyato et al., 2017;Zhu et al., 2020) utilize the adversarial training methods to do the data augmentation. For convenience and simplicity, we adopt the back translation (Sennrich et al., 2016) to increase the amount of training data, which is used to construct pseudo parallel corpus in unsupervised machine translation (Lample et al., 2018). Specifically, we use the Google API † to translate the passage into French, and then translate the translation into English in turn. The pseudo parallel corpus can be obtained as: where {D } means the translated English corpus that we used as data agument, bkt is back translation.
As for the question, given the existence of the special character placeholder, forced translation may result in grammatical errors and semantic gaps. Therefore, the questions and options will be kept original. After getting the pseudo parallel corpus, we train our model with the training data together with the cross-entropy loss function.   we consider training model with label smoothing (Miller et al., 1996;Pereyra et al., 2017). Label smoothing can maintain uncertainty over the label space during training. When training with label smoothing, for classification tasks, the hard one-hot label distribution is replaced with a softened label distribution through a smoothing value α, which is a hyperparameter. Specifically, for hard one-hot label distribution, the target category's probability will be assigned to 1.0 and others are 0.0. Label smoothing will soften the label distribution by modifying the probability distribution with a discount. Then, the target category's probability will be 1−α, and the probabilities of the rest categories are α K−1 , where K is the number of task categories. In our experiments, we set the smoothing value α = 0.1.

Experimental Setup
In all subtasks, the scale of each task is shown in Table 1. We train the model on training data and the related pseudo data generated by back translation, then select hyper-parameters based on the best performing model on the dev set, and then report results on the test set.
Our system is implemented with PyTorch and we use the PyTorch version of the pre-trained language models ‡ . We employ RoBERTa (Liu et al., 2019) large model as our PLM encoder in Equation 2. The Adam optimizer (Kingma and Ba, 2014) is used to fine-tune the model. We introduce the detailed setup of the best model on the development dataset. For subtask-1 and subtask-2, the hyper-parameters are shown in Table 2.

Evaluation Results
Imperceptibility From Table 3, we can see the results of our system on subtask-1 of ReCAM.
Compared to the backbone model RoBERTa large model, our methods achieve significant improvements. It is interesting that the special token is the most helpful part for the Imperceptibility subtask.
Nonspecificility Table 4 summarizes the results of our approachs on subtask-2 of ReCAM. In Nonspecificility subtask, the model with special tokens and label smoothing performs best. Compared to the backbone model ROBERTA LARGE , all our methods achieve better performance.
Interaction We also perform subtask-3 of Re-CAM, Interaction, which aims to provide more insights into the relationship of the two views on abstractness. In this task, we test the performance of our system that is trained on one definition and evaluated on the other. The results of our system's performance on Imperceptibility and Nonspecificility subtasks which is shown in Table 5. We can find that our model is relatively robust for different abstract concepts.

Ablation Study
In this part, we perform an ablation study of our approaches (special tokens, sentence re-ranking, label smoothing, siamese encoders and back translation). Table 3 and 4 shows that our proposed methods help the backbone model better represent and understand the abstract concepts. Note that the special tokens bring the PLMs with the best improvements in both subtask-1 and subtask-2. It is possible that the special tokens teach the model to focus on the abstract concept in a stronger manner. Moreover, other common tricks bring with little improvements.

Discussion of Special Tokens
We also search for the best special tokens for Re-CAM on the dev set of subtask-1. e stands for the word entity. # and $ are common special tokens for NLP downstream applications.
As shown in Table 6, <e> </e> enhance the representations of abstract concepts best of all. # and $ work well. In addition, the <> and </> could be helpful for PLMs to pay attention to the abstract concepts. Moreover, it is interesting that each special token helps PLMs choose the right abstract concepts which are submerged in long sequential tokens (including article and summary). This result strengthen the point that special tokens can enhance the representation of abstract concepts in PLM based approaches.

Conclusion
In this paper, we design many simple and effective approaches to improve the performance of the PLMs on all three subtasks. Experiments demonstrate that the proposed methods achieve significant improvement compared with the PLMs baseline and we obtain the eighth-place in subtask-1 and tenth-place in subtask-2 on the final official evaluation. Moreover, we show that special tokens are useful features contributing to most of the system's boost, which work well in enhancing PLMs for representating and understanding abstract concepts.