Knowledge Injection with Perturbation-based Constrained Attention Network for Word Sense Disambiguation

Supervised Word Sense Disambiguation (WSD) has been studied intensively for over three decades. However, disentangling diverse contexts is still a challenging problem. This paper addresses the problem and proposes a Perturbation-based constrained attention network (Pconan) for injecting lexical knowledge derived from the WordNet. The Pconan allows modeling beneficial dependencies between the segments/words within the input sequence with the mask-attention technique. We incorporate a perturbation method into our model to mitigate the overfitting problem resulting from intensive learning. The experimental results by using a benchmark dataset show that our method is comparable to the SOTA WSD methods. Our source codes are available online 1 .


Introduction
Computational lexicons such as WordNet (George A. Miller and Miller, 1990) and ACQUILEX (Edward, 1991) have been popular knowledge resources for NLP tasks.There is a large body of WSD work based on neural networks that leverage rich information derived from these resources (Luo et al., 2018b;Vial et al., 2019;Kumar and Talukdar, 2019;Huang et al., 2019;Blevins and Zettlemoyer, 2020;Bevilacqua and Navigli, 2020;Conia and Navigli, 2021).They demonstrated that the external knowledge base is beneficial to disambiguate senses.However, it is often the case that the inputs are long sequences.It hampers WSD attempts with external knowledge.Several authors have attempted to alleviate the issue.(Blevins and Zettlemoyer, 2020) independently embedded the target word with its surrounding contexts and the dictionary definition of each sense.(Bevilacqua and Navigli, 2020) extended (Blevins and Zettlemoyer, 2020) method and integrated relational knowledge into the architecture through a simple additional 1 https://github.com/fukumoto-lab/Pconansparse dot product operation.Their results by using a benchmark dataset were beyond 80%.
The attention mechanism is also one of the major techniques to capture long-term dependencies on their sequence (Vaswani et al., 2017).(Luo et al., 2018a) introduced a co-attention mechanism to generate co-dependent representations to capture both word-and sentence-level information.Their assumption is that lexical knowledge such as gloss sentences and context sentences can help each other to highlight the important words within these sentences, while the sense definition candidates do not all at once take into account during the training process.Several authors focused on the issue (Wang and Wang, 2021).(Barba et al., 2021b) proposed a joint-learning that learns the input context and target word definitions jointly.Subsequently, (Barba et al., 2021a) attempted to process the disambiguation of a target word to be conditioned not only on its context but also on the explicit senses assigned to the surrounding words, while their model has not leveraged external lexical knowledge.
Inspired by the previous work mentioned above, we propose a method to inject lexical knowledge, i.e. example sentences from WordNet to effectively learn a context sentence and lexical knowledge simultaneously.Our model called Perturbationbased constrained attention network (Pconan) allows the modeling of dependencies between the segments/words within the input sequence with the mask-attention technique.The technique makes it possible to concentrate on learning beneficial dependencies only, entirely discarding the others.However, this causes an overfitting problem, especially when the available training data is limited.To alleviate this issue, the Pconan utilizes the perturbation technique (Sato et al., 2019).More specifically, we add noises to the training data and the model learns sense distinctions of the same word by using these noisy data to assign the correct label.The social and intellectual <d> world </d> .
<def> the world of scholarship and science.<def> people in ... shared interest .
<def> he is a hero in the eyes of the public.<def> belonging to the modern era; since the Middle Ages.
[SEP] <def> he is a hero in the eyes of the public.Let SI tw(i) be a sense inventory, tw(i) be a current, i-th target word appeared in the context sentence c.
Let also |s N | be the CDs for tw(i), along with ESs, where tw(i) g k j and es(i) s k j be the j-th word of the k-th CD g k , and ES s k (1 ≤ k ≤ N ), respectively.N is the total number of CDs and <def> stands for the start of each segment.Let also For a given c including <d>tw(i)</d>, we create an input sequence Here, the inputs to each encoder are padded with DEBERTa-specific start and end symbols: [CLS] and [SEP] (He et al., 2021).The goal of the WSD task is for the input sequence, to find the correct definition g ∈ SI tw(i) .

Constrained Attention Network
Our model applies (1) identifying keywords, and (2) transferring information to learn relevant contextual features for WSD.The top of Figure 1 illustrates an example of the input sequence X, i.e.
For the input sequence we apply the so-called hard attention technique (Xu et al., 2015;Shen et al., 2018) that a model concentrates solely on learning beneficial dependencies to identify keywords in the sequence, entirely discarding the others.The middle picture of Figure 1 illustrates masked attention for a bidimensional matrix.The words aligned on the horizontal axis are heads, and those aligned on the vertical axis are dependents.As illustrated in Figure 1, we discard some segment pairs, each of which consists of a head and dependence on the sequence by masking them as these pairs are not semantically related to each other and do not include keywords that are beneficial to identify the sense of the target word.in Figure 1).The output is a sequence of words with attention weights.In Figure 1, keywords having high weight values are marked in red.From the result of keyword identification, for each candidate definition, we created a sequence for the target candidate definition starting from <def> as follows: • Context sentence segment perceived as immediately before the CD segment.((i) in Figure 1) • <def> of the target CD has two branches.One is that it witnesses other CDs, their ES, and context definitions.((ii) in Figure 1) Another is that its ES.Relative position of the special token <def> is set to 1. ((iii) in Figure 1) As shown in "(2) Transferring information" of Figure 1, the structure provides a method to leverage the relative positions of <def> including keywords representation to learn a model more accurately.

Model Architecture
Figure 2 illustrates an overview of our model.Let E = [e 1 , e 2 , • • • , e L ] ∈ R d×L refers to the concatenation of word/token embeddings for the input sequence X, and E rpm ∈ R d×L be relative position matrix of X.Here, e k and its relative position is obtained by DEBERTa encoding.d refers to the dimension of embedding, and L is the number of words/tokens in X.We further utilize a special symbol in X so that the model can capture the difference between the context sentence and others.Let also r k ∈ R d be a perturbation vector for the k-th word x k in the input X.The perturbed input embedding êk is computed based on the stochastic gradient descent as follows: where ϵ refers to a hyperparameter that controls the norm of the perturbation and L 1 (θ) indicates cross-entropy loss which is given by: where M is the total number of target words in the training data, and N i is the word sense number of the i-th target word, y ij and ŷij are true and predicted probability of the i-th target word that belongs to the j-th candidate definition.As shown in Figure 2, for each embedding e k , we apply Eq.( 1) and obtain perturbed sequence.Let Ê be the concatenation of perturbed sequence.Our constrained attention networks aim to learn relevant contextual keywords for WSD, and finally output attention weights A for the inputs, Ê and E rpm .We further obtain Â by applying the mask-attention procedure to A. Ê is linearly projected and we obtain V c .We multiply V c and Â by matrix multiplication.Keyword information is transferred by this operation.The result is fed into a feed-forward network, combined with layer normalization and residual connection.Each encoder layer takes the output of the previous layer as input and the number of layers is N .We obtain the matrix an output of the encoder.Each <def> vector that corresponds to the start of the candidate definition is extracted from the matrix H, passed to the fully connected layer and finally, we obtain the probability score ŷij by the softmax function.The final loss L(θ) obtained by our model is given by: (3) where D refers to the number of training instances.α indicates a hyperparameter.KL(p||q) denotes KL-divergence between distributions p and q, and r shows a concatenated vector of r k for all x k (1 ≤ k ≤ L).We train the whole architecture to minimize L(θ).Similar to (Barba et al., 2021a) approach, at training time, we use teacher forcing on the context definitions, and at prediction time, we use a greedy decoding strategy and the model deems g as the most likely definition for a current target word.

Model settings and evaluation metrics
We utilized the hyperparameters with the best performance on SE07 as follows: The dimension of word embedding d was 1,024.The number of maximum words per batch was 1,536.The gradient accumulation and the maximum number of steps were 8.0 and 25,000, respectively.The number of layer N of DEBERTa was 24 and the learning rate was 3e-6.The initial perturbation was set to 1e-2 and the ϵ value in Eq.( 1) was 3e-6.α in Eq. ( 3) was set to 1.0.We used Rectified Adam as optimizer (Weijie Liu, 2020).The experiments were conducted by using Pytorch on Nvidia GeForce RTX A6000 (48GB memory).We used the F1-score following (Alessandro Raganato, 2017b).

Results
The results are summarized in Table 2.The performance of joint-learning approaches was better than those of frequency-based and knowledge source integration approaches in all test sets and partof-speech (POS) patterns.This indicates that the model learned the input context and target word definitions jointly are effective for disambiguation.Our model was statistically significant compared with the second-best method for test sets and POS patterns except for SE3, 15, Adj, and Adv.

Ablation study
We conducted ablation studies to empirically examine our mask-attention technique and perturbation (Prtb).Table 3 shows the results.When we did not utilize the mask-attention and perturbation procedures, the F1-score was 82.1% which is no significant difference compared with ConSec (82.0%) even though ESs are injected.When we applied the mask-attention technique, the improvement was 0.5% at maximum, 82.6%.Among masked attentions, there is also no statistically significant difference between the masked attention (a) context sentence and ES pairs (82.5%) and the combination of (b)∼ (d) (82.3%).However, we gained 0.6% improvement by using (a) ∼ (d) and further gained 0.4% improvement by perturbation.From these observations, we can conclude that the perturbation approach helps the mask-attention procedure to boost the WSD task performance.

Qualitative analysis of errors
We performed an error analysis to provide feedback for further improvement of our method.The number of errors for each POS was 590 noun words, 420 for a verb, 120 for an adjective, and 38 for an adverb, 1,168 words in all.The average senses for these POS words were 6.7 for nouns, 12.2 for verbs, 5.9 for adjectives, and 5.2 for adverbs.We randomly picked up 100 words from 1,168 and found that there are mainly two types of errors: 1. Sense distribution: When the sense distribution of the target word in the training data is unbalanced, most of the target words tend to be assigned to the sense of having much training instances.This was the most frequent error type and 51 words were classified into this type.
2. The similarity between candidate definitions: When words that appear one candidate definition are semantically similar or the same as those of other candidate definitions, it is difficult to model beneficial dependencies to identify keywords.For example, in Figure 3, as "move forward" appears in both candidate definitions and example sentence, only a few words such as "car" and "seat" are clues to predict beneficial dependencies which causes an error.26 words were classified into this type.
We focused on example sentences extracted from WordNet as lexical knowledge.(Vial et al., 2019;Conia and Navigli, 2021) utilized the semantic relationships between senses such as synonymy, hypernymy, and hyponymy derived from the Word-  Net and reported that the external knowledge contributes to improving WED performance.This is definitely worth trying with Pconan.

Conclusion
We presented WSD approach for injecting lexical knowledg from the WordNet with perturbationbased constrained attention network.The comparative results with the SOTA WSD methods showed the effectiveness of our method.Future work will include: (i) evaluating our model by using other lexical knowledge such as the semantic relationship between senses, (ii) investigating other perturbation techniques (Gal and Ghahramani, 2016;Wang and Wang, 2021) to improve the performance, and (iii) applying methods (Liu et al., 2020;Xiong et al., 2021) to reduce the overall self-attention complexity for further advantages in efficacy.

a
<def> the world of scholarship and science .

Figure 1 :
Figure 1: Identifying keyword and transferring information

[
Context sentence] Skilled ringers use their wrists to <d> advance </d> or retard the next swing so that one bell can swap places with another in the following change.[Candidate sense & definition & example sentence] advance#1 <def> move forward, also in the metaphorical sense.<def> Times marches on.advance#5 <def> cause to move forward <def> Can you move the car seat forward?

Figure 3 :
Figure 3: An example with similar context definitions: The correct sense in <d> advance <d> is advance#5.

The social and intellectual <d> world </d> . <def> belonging
to the modern era; since the Middle Ages.

<def> he
is a hero in the eyes of the public.<def> belonging to the modern era; since the Middle Ages.

Table 1 :
Table1shows pairs that we masked.For example, (b) pairs of candidate definitions and example sentences which do not correspond to each other are masked (purple box Masked attention between head and dependent.

Table 2 :
Performance comparison: The best score is in boldface and the second best is underlined.* denotes the method (if any) whose score is not statistically significant compared to the best one.We used a t-test, p-value < 0.01.

Table 3 :
Ablation test over the Pconan components: "w/o Prtb" refers to the result without pertubation.* denotes the method whose score is not statistically significant compared to the baseline, ConSec.