Knowledge-Guided Paraphrase Identification

Paraphrase identification (PI), a fundamental task in natural language processing, is to identify whether two sentences express the same or similar meaning, which is a binary classification problem. Recently, BERT-like pretrained language models have been a popular choice for the frameworks of various PI models, but almost all existing methods consider general domain text. When these approaches are applied to a specific domain, existing models cannot make accurate predictions due to the lack of professional knowledge. In light of this challenge, we propose a novel framework, namely Knowing, which can leverage the external unstructured Wikipedia knowledge to accurately identify paraphrases. We propose to mine outline knowledge of concepts related to given sentences from Wikipedia via BM25 model. After retrieving related outline knowledge, Knowing makes predictions based on both the semantic information of two sentences and the outline knowledge. Besides, we propose a gating mechanism to aggregate the semantic information-based prediction and the knowledge-based prediction. Extensive experiments are conducted on two public datasets: PARADE (a computer science domain dataset) and clinicalSTS2019 (a biomedical domain dataset). The results show that the proposed Knowing outperforms state-ofthe-art methods.


Introduction
Paraphrase identification (PI) is a classical yet fundamental natural language processing (NLP) task, which aims to determine whether a pair of sentences express the same or similar meaning (Bhagat and Hovy, 2013). Such a task can be used to examine whether a machine learning model really understands the semantic meanings of input sentences and is helpful for many other NLP tasks such as machine translation (Madnani et al., 2012) and question answering (Dong et al., 2017;Rinaldi  To identify paraphrases automatically, machine learning models have been proposed. Traditional models (Mihalcea et al., 2006;Kozareva and Montoyo, 2006;Islam and Inkpen, 2009;Wan et al., 2006;Xu et al., 2014) focus on leveraging lexical and syntactic features to measure the similarity between two sentences. Recently, deep learning models are introduced and achieve the state-of-the-art performance. These models adopt convolutional neural networks (CNNs) (He et al., 2015;Filice et al., 2015), Long Short-Term Memory (LSTM) (Parikh et al., 2016;Chen et al., 2017;He and Lin, 2016;Nie and Bansal, 2017) or pretrained language models like BERT (Devlin et al., 2019). They directly learn the implicit relation between a pair of input sentences. However, existing approaches all ignore the importance of knowledge associated with input sentences.
In fact, each meaningful sentence usually belongs to a certain domain and contains domainspecific knowledge (He et al., 2020a). When domain experts identify whether these two sentences are paraphrases or not, they first read sentences to comprehend the semantic meanings, and then analyze them based on the domain knowledge associated with the sentences, and finally make a decision. As shown in Fig. 1, although S1 and S2 contain several matching words, experts know that they are not paraphrases because the first sentence describes the concept "dataset" but the second sentence does not. S3 and S4 even do not share a lot of lexical and syntactic features, but they have the same meaning since binary instructions are made up of 0s and 1s, which corresponds to the computer science domain knowledge. Therefore, for the PI task, only relying on lexical and syntactic features is insufficient, and it is indispensable to introduce domain knowledge into the models.
We will meet several technical challenges when introducing domain knowledge into the PI task: (1) Knowledge selection. Even in a specific domain, there is a huge volume of domain knowledge. The format of domain knowledge is either structured knowledge base or unstructured text. Thus, using which kind of domain knowledge and how to retrieve related knowledge for each sentence efficiently and accurately are new challenges.
(2) Knowledge fusion. After we choose appropriate knowledge for each sentence (e.g., a description of a knowledge concept), the challenge is how to automatically incorporate such unstructured knowledge into state-of-the-art identification models to make more accurate predictions.
To solve the aforementioned challenges, in this paper, we propose a knowledge-infused gated model (named Knowing), which is shown in Fig. 2. Knowing consists of four main components: a base prediction module, a knowledge selection module, a knowledge fusion module, and an aggregation module. The base module is to encode sentence pairs and make predictions based on their lexical and syntactic information. Then we incorporate domain knowledge, and the first step is to collect related knowledge. The knowledge selection module retrieves top-m related knowledge outlines from Wikipedia via BM25 (Sanderson et al., 2010) for each sentence pair. Then the knowledge fusion module is to encode knowledge via an attention mechanism and get the knowledge-based prediction. In the end, the aggregation module aggregates the lexical and syntactic feature-based prediction and knowledge-based prediction via a novel gate mechanism.
The main contributions can be summarized as follows: • To the best of our knowledge, we are the first to focus on domain-specific paraphrase identification and the first to infuse unstructured Wikipedia knowledge into BERT for paraphrase identification.
• We propose an effective and efficient way to use unstructured Wikipedia knowledge, which uses the outline of each concept and retrieves them via BM25.
• We propose a novel gated mechanism to automatically aggregate the lexical and syntactic feature-based prediction and knowledgebased prediction.
• The proposed model outperforms state-of-theart paraphrase identification models on two public domain-specific datasets.

Preliminaries
Before formally introducing the proposed model Knowing, we first mathematically define our task and the knowledge that we use in this paper.

Problem Formulation
Given a sentence pair (S 1 i , S 2 i ) ∈ X and an external knowledge base B, our goal is to learn a function F (S 1 i , S 2 i , B) → {0, 1} to determine whether the two sentences S 1 i and S 2 i have the same or similar semantic meaning, where X = {(S 1 1 , S 2 1 ), · · · , (S 1 n , S 2 n )} denotes the training dataset, and the corresponding labels are Y = {y 1 , · · · , y n }.

External Knowledge
One major contribution of this work is to incorporate external knowledge to enhance the performance of paraphrase identification. Thus, the selection of knowledge base is crucial. In this paper, we use the most popular knowledge base, i.e., Wikipedia. Each concept in Wikipedia is associated with extensive descriptions, including the outline, the definition, its functions, related concepts, and so on. The length of the whole knowledge description is usually greater than 512, which exceeds the maximum size requirement of BERT-like pretrained language models. In fact, compared with the other sections of the description, the outline is informative and is the high-level abstraction of the corresponding knowledge. Thus, instead of using the whole description of knowledge, we use the outlines of concepts as the external knowledge.

Quickbrowse
Quickbrowse was a Web-based subscription service that enables users to browse multiple Web pages more quickly by combining them vertically into a single Web page. It was one of the early metabrowsing services. Figure 3: The outline of concept Quickbrowse.

History
We take the concept "Quickbrowse" 1 as an example, which is shown in Fig. 3. The content selected in the red box is the outline that contains most of the important information even with a few words. Such outlines are more suitable for BERT-like pretrained language models. 1 https://en.wikipedia.org/wiki/ Quickbrowse 3 Methodology

Overview
The goal of this work is to effectively incorporate external knowledge to further improve the state-ofthe-art performance of the paraphrase identification task. Towards this aim, we propose a new model Knowing as shown in Fig. 2, which includes four modules: (1) base prediction module, (2) knowledge selection module, (3) knowledge fusion module, and (4) aggregation module.
The base prediction module uses a pre-trained BERT model to encode sentence pairs via their lexical and syntactic features and makes predictions based on the sentence pair representations. However, this base predictor ignores the importance of external knowledge. To empower the effectiveness of knowledge, the knowledge selection module is designed to retrieve relevant outline knowledge from Wikipedia for each sentence pair. Since there may be several related outlines, to simultaneously take them into account, the knowledge fusion module first encodes each outline using the pre-trained BERT and then uses an attention mechanism to synthesize outline knowledge representation. The sentence pair representation obtained from the base prediction module and the fused knowledge representation learned by the knowledge fusion module are the inputs of the aggregation module. In particular, we design a gated function to aggregate them to learn the final representation that is used to identify paraphrases. Next, we provide the details of each module.

Base Prediction Module
Given a pair of sentences, the simplest way is to first learn a representation for each sentence by extracting lexical and syntactic features and then train a classifier to identify their relations. However, this simple approach may not achieve satisfactory performance since it does not model the interactions at the word level. To address this issue, we propose to use the powerful pre-trained language model BERT, which can model interactions among words between two sentences by directly feeding the sentence pair to BERT, i.e., where S = S 1 i ⊕S 2 i and ⊕ represents concatenation. The sentence pair representation is further used to identify the paraphrase relation by utilizing a fully connected layer (FC) followed by the sigmoid function as follows: where σ(·) is the sigmoid function. This base prediction module may achieve satisfactory performance but it may still make mistakes on some sentences pairs that are difficult to match or distinguish. To further improve the performance of the PI task, we need to consider the utilization of domain knowledge. Next, we introduce how to select related knowledge for sentence pairs and then describe how to use the selected knowledge.

Knowledge Selection Module
The goal of the knowledge selection module is to automatically retrieve relevant knowledge for a given sentence pair from the set of knowledge outline B. A straightforward solution is to adopt the pre-trained BERT model to encode sentence pairs and outlines, then calculate their similarity scores, and finally, select the top-m outlines with the highest scores. However, such an approach is inefficient and the computation and space complexity could be high due to the huge number of concepts in Wikipedia-there are about 5,903,527 concepts.
To prevent the complexity bottleneck, we propose to use the classical model BM25 (Sanderson et al., 2010) to estimate the relevance scores between outlines and sentence pairs. BM25 ranks a set of documents based on the query terms appearing in each document. Sentence pairs in a specific domain usually contain some professional terms, and we can treat them as the query terms, which makes it possible for us to use the simple but effective BM25 for knowledge selection. As an example, suppose we have a sentence pair "a computer that manages web site services, such as supplying a web page to multiple users on demand." and "provides information and services to web surfers.". If an outline in knowledge base contains terms "web", "services" and "users", it may be useful to determine whether the two sentences talk about the same thing.
Mathematically, given a sentence pair (S 1 i , S 2 i ), the knowledge selection module retrieves m relevant outlines {k 1 , k 2 , · · · , k m } via BM25, and the corresponding relevance scores of the m outline knowledge are denoted as {s k 1 , s k 2 , · · · , s km }.

Knowledge Fusion Module
There are m relevant outlines selected by the knowledge selection module, and intuitively each of them contains informative knowledge. However, the outlines differ in the amount of useful information they can provide. Thus, we need to automatically learn a relevance score to distinguish the importance of outlines and then use the weighted sum operation to fuse all the outlines for synthesizing outline knowledge representation.
Towards this aim, we first encode the outline knowledge. Similar to the encoding of sentence pairs, we still use BERT to encode the outline knowledge, and the k i knowledge representation is obtained as follows: To distinguish the importance of the m outlines for the prediction, we take advantage of the attention mechanism (Chorowski et al., 2015;Lian et al., 2020;Vaswani et al., 2017) to automatically assign an attention weight to each outline. Formally, the importance can be computed via where M is a learnable square matrix, and E = [e k 1 , e k 2 , · · · , e km ]. Then we represent the whole knowledge via the weighted sum based on the importance values as follows Using the learned knowledge representation e k and the learned sentence pair representation e s , we can make a prediction. To enable them to fully interact with each other, we propose to use a fully connected layer (FC) followed by a Sigmoid function to get the prediction as follows:

Aggregation Module
Finally, the aggregation module is to synthesize the prediction P s and P k . A direct way is to use P k or (P s + P k )/2 as the synthesized result. However, the outline knowledge is retrieved via BM25, so it may not be entirely accurate. Therefore, directly aggregating P s and P k as P k or (P s + P k )/2 may introduce more noise if the knowledge is not that relevant to the sentence pair. In order to solve this problem, we design a gated mechanism to automatically control the weight of knowledge in the final prediction, i.e., the more relevant the knowledge, the larger the weight, and vice versa. Considering that BM25 also outputs the relevance scores of the knowledge while retrieving it, we use the relevance scores as the gate input to learn the weight of knowledge. Formally, the gate can be represented as: where s = [s k 1 , s k 2 , · · · , s km ], and W 1 and W 2 are parameters to be learned. Finally, we can get the final prediction and the loss function of the proposed Knowing is

Experiments
In this section, we empirically validate the effectiveness of the proposed Knowing model. To explore the insights behind Knowing model, we explore the role of Wikipedia knowledge on specific domains and the role of the proposed gated mechanism.

Experiment Settings
Knowledge base. In this paper, we use Wikipedia as the knowledge base to assist the paraphrase identification task. More specifically, the number of collected knowledge outlines from Wikipedia is 5,903,527.

Datasets.
We use two public datasets, PARADE (He et al., 2020a) and clinical-STS2019 , to evaluate the model performance. PARADE is a computer science domain benchmark dataset for paraphrase identification, while clinicalSTS2019 belongs to the biomedical domain. For PARADE dataset, we use the same training, validation, testing splits with He et al. (2020a). In clinicalSTS2019, the similarity score of each sentence pair ranges from 0 to 5, where 0 indicates irrelevance, and 5 indicates the equivalence in semantic meanings between the two sentences. Since paraphrase identification is a binary classification task, in order to use this dataset, we need to convert the six classes to two categories. More specifically, we set the labels of instances with scores 0, 1, and 2 as 0, and the remaining ones as 1. Since there is no validation set in the clinicalSTS2019 dataset, we construct one by randomly sampling 10% pairs of its training set. The statistics are shown in Table 1.

Comparison with Methods Pre-trained on Domain Specific Corpora
In this section, we aim to explore how to effectively introduce external knowledge into language models. Besides using external knowledge as knowledge base as in the proposed model Knowing, another option is to pre-train language models on domain specific corpora to store external knowledge in model parameters. To further analyze these two options, we adopt several methods which pre-train BERT on biomedical domain corpora, such as BlueBERT (Peng et al., 2019), BioMed-BERT (Chakraborty et al., 2020), SciBERT (Beltagy et al., 2019), and on computer science domain corpora such as SciBERT (pre-trained on Semantic Scholar with 18% of computer science papers and 82% biomedical papers) as baselines for an empirical comparison. The performance comparison is shown in Table 3. First, we can observe that the variants of BERT pre-trained on domain corpora perform better than vanilla BERT. Comparing BERT with BlueBERT, BioMedBERT, and SciBERT, BlueBERT, BioMed-BERT and SciBERT perform better than BERT significantly on clinicalSTS2019, up to 5.5% improvement with respect to F1 score. And SciBERT perform better than BERT up to 2.1% improvement with respect to F1 score on PARADE. It confirms the importance of external knowledge and the effectiveness of pre-training BERT on biomedical domain and computer science domain corpora.
Even though pre-training on domain specific cor-  Figure 4: The performance of Knowing and its two variants on two datasets.
Quick browse was a Web-based subscription service that enables users to browse multiple Web pages more quickly by combining them vertically into a single Web page. It was one of the early meta browsing services.
In distributed computing, code on demand is any technology that sends executable software code from a server computer to a client computer upon request from the client's software. Some well-known examples of the code on demand paradigm on the web are Java applets … The term Web service (WS) is either: … more specifically for transferring machine-readable file formats such as XML and JSON. In practice, a Web service commonly provides an object-oriented Web-based interface to a database server, utilized for example by another Web server, or by a mobile app, that provides a user interface to the end user.
Google Charts is an interactive Web service that creates graphical charts from user-supplied information. The user supplies data and a formatting specification expressed in JavaScript embedded in a Web page; in response the service sends an image of the chart.
A static web page (sometimes called a flat page or a stationary page) is a web page that is delivered to the user's web browser exactly as stored, … from all contexts, subject to modern capabilities of a web server to negotiate content-type or language of the document where such versions are available and the server is configured to do so.
Sentence 1: a computer that manages web site services, such as supplying a web page to multiple users on demand. Retrieved Knowledge Attention score  pora can incorporate external knowledge into parameters, it is less effective compared with the proposed Knowing. Compared to BlueBERT, BioMedBERT and SciBERT, the Knowing(BERT) brings at least 3.7% improvement in terms of F1 score on clinicalSTS2019. Compared to SciBERT, the Knowing(BERT) also brings 1.4% improvement in terms of F1 score on PARADE. Such an improvement demonstrates that it is more effective to treat external knowledge as knowledge base for retrieval instead of pre-training on domain corpora. Besides, Knowing(SciBERT) shows limited gain compared to Knowing(BERT), which demonstrates that Knowing(BERT) can take good advantage of external knowledge while additional pretraining may not bring more benefits.

Effectiveness of the Gated Mechanism
In this section, we explore the role of the proposed gated mechanism by analyzing two variants: (1) we set the gate value as 0 and correspondingly P ((S 1 i , S 2 i )) = P k ; and (2) we aggregate P s and P k via the average operation, i.e., P ((S 1 i , S 2 i )) = (P s + P k )/2. We report the performance comparison between these two variants and our proposed Knowing in Fig. 4.
First, we compare our proposed model Knowing against the model with g = 0. When g = 0, the model makes predictions solely based on external knowledge and thus the performance is degraded in term of Accuracy and F1 score compared with Knowing according to Fig. 4. It shows that accurate predictions cannot be achieved solely based on knowledge without taking semantic information into account.
Second, we compare our proposed model Knowing with another variant that is based on average operation (Knowing with g = 0.5). The proposed Knowing brings improvements in terms of Accuracy and F1 score on two datasets, especially on clinicalSTS2019. Such an observation illustrates the necessities of adaptive combination between external knowledge and semantic information of sentences instead of using a fixed ratio. These results clearly show that the proposed gated mechanism is able to automatically tune how much external knowledge is incorporated and this mechanism further improves the model performance.

Case Study
In this section, we use a concrete example from the PARADE dataset to show how the Knowing works, which is shown in Fig. 5. The sentence pair is "a computer that manages web site services, such as supplying a web page to multiple users on demand." and "provides information and services to web surfers". PARADE dataset contains the computer science topic attribute for each sentence pair. Since such an attribute is not very common for other datasets, we do not take the topic attribute as a part of inputs to our model, but we can use this information to verify the knowledge selection for analysis purpose. The topic attribute of the given example is Web Service. The attention mechanism of our model Knowing assigns the largest score to the third knowledge concept "Web Service". Such a selection aligns well with the given topic attribute value Web Service. Correspondingly, the gate value for external knowledge is 0.6402, which indicates that the selected knowledge concept is helpful in making the final prediction, which also aligns well with our observation.

Paraphrase Identification
Traditional methods for paraphrase identification (PI) are based on word or string similarity measurements. VBS (Mihalcea et al., 2006) applies cosine similarity with tf-idf weighting. STS (Islam and Inkpen, 2009) and KM (Kozareva and Montoyo, 2006) measure the similarity based on both semantic and string similarity. MCS (Mihalcea et al., 2006) obtains the similarity scores based on multiple word similarity computation methods.
BERT (Devlin et al., 2019) and other pre-trained language models  achieve stateof-the-art performance on PI. However, existing works do not incorporate domain knowledge for PI, and hence, cannot achieve satisfactory performance on domain specific PI task. Different from existing works, the proposed method exploits the unstructured knowledge and applies a novel gating mechanism to automatically aggregate the lexical and syntactic information for predictions.

Knowledge-enhanced Language model
Incorporating external knowledge into language model is effective for downstream tasks and recently attracts lots of attentions.
Recent works (Zhang et al., 2019;Xiong et al., 2019;Peters et al., 2019;Cui et al., 2020;Song et al., 2021;Hu et al., 2019;Ye et al., 2019) explore how to introduce knowledge graphs to enhance language models for downstream tasks. However, knowledge graph may be not available for each domain since its construction needs lots of human efforts. Moreover, a structured knowledge graph contains entities and relations, but the knowledge associated with each entity may be incomplete, which may be difficult to provide enough help for paraphrase identification.
To take advantage of unstructured knowledge, a lot of works (Chakraborty et al., 2020;He et al., 2020b;Beltagy et al., 2019;Peng et al., 2019;Huang et al., 2019;Lee et al., 2020) propose to pre-train language models on domain specific text. However, the pre-training objective function is usually not designed to capture knowledge concepts and their explanations, and only leads to limited improvement with intensive computation costs. Compared with the existing works, our pro-posed method can leverage external knowledge effectively to achieve significant improvements without a computationally expensive pre-training stage.

Conclusions
In this paper, we investigated the important and challenging task of domain-specific paraphrase identification. Since domain-specific text is difficult to be understood without domain knowledge, we proposed to incorporate Wikipedia knowledge into our model. However, there are two major challenges: (1) how to select knowledge, and (2) how to incorporate the selected knowledge into state-ofthe-art paraphrase identification models automatically.
To solve these challenges, we introduced Wikipedia as external knowledge base and proposed a knowledge-infused gated model named Knowing to fuse Wikipedia knowledge with BERT. The Knowing contains four modules: a base prediction module, a knowledge selection module, a knowledge fusion module, and an aggregation module. The base prediction module is to learn sentence pair representations and make predictions only based on sentence pair themselves. The knowledge selection module is to retrieve relevant knowledge from Wikipedia, and then knowledge fusion module applies an attention mechanism to synthesize the knowledge representations. Finally, the aggregation module is to aggregate the based module's predictions and the knowledge-based predictions via a gate function. The experiments on two public domain-specific datasets show that the proposed Knowing outperforms state-of-the-art baselines.