Structure-aware Sentence Encoder in Bert-Based Siamese Network

Recently, impressive performance on various natural language understanding tasks has been achieved by explicitly incorporating syntax and semantic information into pre-trained models, such as BERT and RoBERTa. However, this approach depends on problem-specific fine-tuning, and as widely noted, BERT-like models exhibit weak performance, and are inefficient, when applied to unsupervised similarity comparison tasks. Sentence-BERT (SBERT) has been proposed as a general-purpose sentence embedding method, suited to both similarity comparison and downstream tasks. In this work, we show that by incorporating structural information into SBERT, the resulting model outperforms SBERT and previous general sentence encoders on unsupervised semantic textual similarity (STS) datasets and transfer classification tasks.


Introduction
Pre-trained models like BERT (Devlin et al., 2018) and RoBERTa  have demonstrated promising results across a variety of downstream NLP tasks. Though BERT-like models have been shown to capture hidden syntax structures (Clark et al., 2019;Hewitt and Manning, 2019;Jawahar et al., 2019), recent works have achieved performance improvements on various natural language understanding (NLU) tasks through the use of a graph network that captures syntax and semantics information.  demonstrate the value of syntax information for pronoun resolution tasks, using Relational Graph Convolutional Networks (RGCNs) (Schlichtkrull et al., 2018) to incorporate syntactic dependency graphs. Wu et al. (2021) argue that semantics has not been brought to the surface of pre-trained models and propose to introduce semantic label information into RoBERTa via RGCNs. Similar ideas have been applied to information extraction (Santosh et al., 2020), sentence-pair classification (Liu et al., 2020) and sentiment analysis Yin et al., 2020) tasks. Though problem-specific fine-tuning is required, these improvements suggest that structural supervision is useful, and that RGCNs serve as an effective structure encoder.
BERT can also be used as a general sentence encoder, either by using the CLS token (the first token of BERT output) or applying pooling over its outputs. However, this fails to produce sentence embeddings that can be used effectively for similarity comparison. Furthermore, this method of using BERT for similarity comparison is extremely inefficient, requiring sentence pairs to be concatenated and passed to BERT for every possible comparison. In response, Sentence-BERT (SBERT) has been proposed to alleviate this by fine-tuning BERT on natural language inference (NLI) datasets using a siamese structure (Reimers and Gurevych, 2019). General-purpose sentence embeddings are generated which outperform previous sentence encoders on both similarity comparison and transfer tasks.
In this paper, we show that it is possible to improve the SBERT sentence encoder through the use of explicit syntactic or semantic structure. Inspired by SBERT's success in producing general sentence representations and previous efforts on introducing structural information into pre-trained models, we propose a model that combines the two by training a BERT-RGCN model in a siamese structure. Under specific structural supervision, the proposed model is able to produce structureaware, general-purpose sentence embeddings. Our empirical results show that it outperforms SBERT and previous sentence encoders on unsupervised similarity comparison and transfer classification tasks. Furthermore, we find that the produced sentence representation generalises better especially on fine-grained classification tasks.

Related Work
Sentence encoders have been studied extensively in years. Skip-Thought (Kiros et al., 2015) has been trained to predict its surrounding sentences by using current sentence in a self-supervised fashion. Hill et al. (2016) proposed a sequential denoising autoencoder (SDAE) to reconstruct given sentence representations. InferSent (Conneau et al., 2017), on the other hand, used labelled NLI datasets to train a general-purpose sentence encoder in a BiLSTM-based siamese structure. Cer et al. (2018) proposed the Universal Sentence Encoder (USE) model based on transformers (Vaswani et al., 2017), and trained it with both unsupervised tasks and supervised NLI tasks. Inspired by InferSent, Sentence-BERT (SBERT) (Reimers and Gurevych, 2019) produces general-purpose sentence embeddings by fine-tuning BERT on NLI datasets in a siamese structure, showing improved performance on a variety of tasks.
Hidden syntax structures in pre-trained models have been well explored. Various probing methods have been used to investigate hidden structures (Clark et al., 2019;Hewitt and Manning, 2019;Jawahar et al., 2019). The impact of external structures on pre-trained models has also been questioned. Glavaš and Vulić (2021) examined the benefits of incorporating universal dependencies into pre-trained models. Dai et al. (2021) showed that the tree induced from pre-trained models could produce competitive results compared with external trees. However, recent improvements have still been observed on various NLU tasks by incorporating structural information into pre-trained models. Yin et al. (2020) proposed SentiBERT to incorporate constituency tree into BERT for sentiment analysis. Xu and Yang (2019) modelled each sentence as a directed dependency graph by using RGCNs, and achieved large improvements on pronoun resolution. Zhang et al. (2020) proposed a semanticsaware BERT model by further encoding semantic information with BERT using a GRU (Chung et al., 2014). RGCNs have also been used by Wu et al. (2021) to introduce semantic information into RoBERTa, and achieved consistent improvements when fine-tuned on problem-specific datasets. Similar efforts can be seen where researchers try to provide syntax information via self-attention mechanism (Bai et al., 2021;.

Model
Inspired by Reimers and Gurevych (2019), we train our model in a siamese network to update weights so as to produce similarity-comparable sentence representations. The model we propose consists of two components, as shown in Figure 1. BERT: Each sentence is first fed into the pretrained BERT-base model to produce both a sentence representation, by applying mean-pooling, and an original contextualised sequence-length token representation, which is used to initialise a RGCN.
Structure Information: We use Spacy dependency parser (Honnibal et al., 2020) with its middle model to obtain dependency parse trees for all input sentences. We also experimented with the use of semantic graphs 1 , since Wu et al. (2021) has shown that semantic information benefits pre-trained models. However, we found semantic graphs to be less effective than syntactic dependency trees when evaluated on our development set, and as a result, in the experiments below, we restrict our attention to the use of syntactic dependency graphs.
RGCN: RGCNs, proposed by (Schlichtkrull et al., 2018), can be viewed as a weighted message passing process. At each RGCN layer, each node's representation will be updated by collecting information from its neighbours and applying edge-specific weighting: where N r i and W l r are the neighbours of node i and the weight of relation r ∈ R, respectively. c i,r is the normalisation constant and normally set to be |N r i | which is the number of neighbours under relation r. W l 0 is the self-loop weight. In our case, each sentence is first parsed into a dependency tree, then modelled as a labelled directed graph by an RGCN, where nodes are words and edges are dependency relations. Following Schlichtkrull et al. (2018), we allow information to flow in both directions (from head to dependent and from dependent to head). Following Wu et al. (2021), we pass BERT output through an embedding projection which is made of an affine transformation and ReLU nonlinearity, then use the transformed representations to initialise RGCN's node representations. Since BERT and Spacy use different tokenisation strategies, we align them by taking the first subtoken as its word representation from BERT for each word in the RGCN. A structure-aware sentence representation is derived from RGCN's output by applying a mean-pooling over its node representations. During training, rather than using c i,r = |N r i |, we found it best to apply the normalisation factor across relation types, c i,r = c i = r |N r i |, the number of neighbours. We use a one-layer RGCN, as we find that a deeper network lowers the performance.
Connect BERT and RGCN: The concatenation of BERT and RGCN's sentence representations are then passed through a layer normalisation layer to form the final sentence representation. Sentence embeddings of given sentence-pair are then interacted before passing to the final classifier for training. As for the interaction, we use the concatenation of sentence embedding u, v and the elementwise difference |u − v|, which has been found to be the best concatenation mode by Reimers and Gurevych (2019). In this siamese structure, all parameters are shared and will be updated correspondingly. We use cross-entropy loss for optimisation. all experiments on these models, we use released pre-trained models and scripts to produce sentence embeddings.

Training Details
In order to produce general-purpose sentence embeddings, we follow SBERT in training the model on a combination of the SNLI (Bowman et al., 2015) and the MNLI datasets (Williams et al., 2018). They contain 570, 000 and 430, 000 sentence pairs, respectively, which are annotated as contradiction, entailment, or neutral. Our model is trained for one epoch, and we use a batch-size of 16, the Adam optimizer with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. For RGCN layer, we use dropout of 0.2 and hidden dimension of 512. Following SBERT, we evaluate our model on the STS benchmark development set in Spearman rank correlation for every 1, 000 steps during training, and save the best model.

Evaluation -Unsupervised STS
First, we evaluate our model on semantic textual similarity (STS) datasets. Here we use STS12-16 tasks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, SICK-Relatedness (SICK-R) (Marelli et al., 2014) test set and STS benchmark (STSb) (Cer et al., 2017) test set. These datasets are labelled from 0 to 5 on semantic relatedness of sentence pairs. We obtain these datasets via SentEval (Conneau and Kiela, 2018). In this evaluation, we test different encoders' performance without using any task-specific training data. The results are given in Table 1, and show that our model outperforms SBERT on all 7 tasks, obtaining the highest average score, and demonstrating the benefits of including explicit syntax structure during supervision. Both SBERT and our model perform worse than USE on SICK-R. However, as observed by Reimers and Gurevych (2019), USE is trained on various datasets including question-answering pairs, NLI, online forums and news, which appears to be particularly suitable to the data of SICK-R. Both BERT-AVG and BERT-CLS perform poorly which reflects their weakness as general-purpose sentence encoders.
As shown in Table 2, the proposed model outperforms previous encoders in general though the difference between SBERT and our model is relatively small. Our model performs significantly worse than USE on TREC, which may be due to the fact that USE is pre-trained on question-answering data, which appears to be beneficial to the TREC question-type classification task. Unlike previous poor performance on STS datasets, BERT-CLS and BERT-AVG produce good results on classification tasks. This shows that the relevant information is encoded in BERT-CLS and BERT-AVG, they just lack the ability to produce similarity-comparable sentence embeddings. Both SBERT and our model perform worse than BERT-AVG and BERT-CLS on SUBJ task, which suggests that, while gaining on sentiment analysis tasks, fine-tuning on NLI datasets leads to information loss on recognising the subjectivity of a sentence.
Extraction Difficulty As we have seen, the difference between SBERT and our model in our previous transfer comparison is small. Our hypothesis is that, since we concatenate the outputs of BERT and RGCN, the representations produced by our model are more complex, and that simple logistic regression lacks the ability to extract useful infor-   Table 3: Results on SentEval evaluation with MLP. Cells marked as bold only when the mean minus std is no worse than the mean plus std of the other model mation from such complex embeddings. To assess this, we replace the logistic regression with a single hidden layer MLP (128 hidden units) which is widely used as a probing classifier. We focus on the comparison between our model and SBERT, re-running these two models with 5 random seeds, and report accuracy in the same fashion, except we adopt a more strict bold strategy to mark the difference (as explained in the caption).
As shown in Table 3, for some tasks, e.g. MR and CR, both models show stable performance cross different classifiers, and their performance remains similar when this more powerful extractor is used. However, for SST-5 (5-way sentiment classification) and TREC (6-way question-type classification), we see that clear improvements are obtained by our model, suggesting that the additional syntax supervision that we bring in through RGCNs is beneficial for fine-grained classification tasks. A similar pattern of results was found when we experimented with a 2 hidden layer MLP.

Conclusion
In this work, we show that SBERT can be improved by explicitly incorporating structural information. By using RGCNs to incorporate syntactic structure into supervision, our model is able to produce structure-aware, general-purpose sentence embeddings that achieve improved results on both unsupervised similarity comparison and transfer classification tasks, when compared against previous sentence encoders. By extending probing classifiers, we further show that our syntax-informed supervision method is particularly beneficial for fine-grained tasks such as SST-5 and TREC.

Acknowledgement
We thank all anonymous reviewers for their helpful comments, and NVIDIA for the donation of the GPU that supported our work.