Understanding Jargon: Combining Extraction and Generation for Definition Modeling

Can machines know what twin prime is? From the composition of this phrase, machines may guess twin prime is a certain kind of prime, but it is still difficult to deduce exactly what twin stands for without additional knowledge. Here, twin prime is a jargon - a specialized term used by experts in a particular field. Explaining jargon is challenging since it usually requires domain knowledge to understand. Recently, there is an increasing interest in extracting and generating definitions of words automatically. However, existing approaches, either extraction or generation, perform poorly on jargon. In this paper, we propose to combine extraction and generation for jargon definition modeling: first extract self- and correlative definitional information of target jargon from the Web and then generate the final definitions by incorporating the extracted definitional information. Our framework is remarkably simple but effective: experiments demonstrate our method can generate high-quality definitions for jargon and outperform state-of-the-art models significantly, e.g., BLEU score from 8.76 to 22.66 and human-annotated score from 2.34 to 4.04.


Introduction
Jargons are specialized terms associated with a particular discipline or field.To understand jargons, a straightforward approach is to read their definitions, which are highly summarized sentences that capture the main characteristics of them.For instance, given jargon twin prime, people can know its meaning by reading its definition: "A twin prime is a prime number that is either 2 less or 2 more than another prime number." Recently, acquiring definitions of words/phrases automatically has aroused increasing interest.There are two main approaches: extractive, corresponding to definition extraction, where definitions are extracted from existing corpora automatically (Anke and Schockaert, 2018;Veyseh et al., 2020;Kang et al., 2020); and abstractive, corresponding to definition generation, where definitions are generated conditioned with the target words/phrases and the contexts in which they are used (Noraset et al., 2017;Gadetsky et al., 2018;Bevilacqua et al., 2020;August et al., 2022;Gardner et al., 2022).
In this paper, we study jargon definition modeling, which aims to acquire definitions for jargon automatically.Jargon definition modeling is important since definitions of jargon are less likely to be organized in an existing dictionary/encyclopedia and such terms are difficult for non-experts to understand without explanations (Bullock et al., 2019).This is particularly true for new jargon from fast-advancing fields.For instance, neither Oxford dictionary (Butterfield et al., 2016) nor Wikipedia2 includes few-shot learning-an important setup in machine learning.
However, to acquire definitions for jargon, both extractive and abstractive approaches may fail.Extracting high-quality definitions would be difficult due to the incompleteness and low quality of data sources (this issue is more serious for jargon since jargon is usually less frequently used than general words/phrases).E.g., a good definition may not be available in the corpus; even if it existed, it might be difficult to select from a large set of candidate sentences (Kang et al., 2020).Generating definitions for jargon would be challenging since jargons are usually technical terms that need domain knowledge to understand, while the contexts in which they are used cannot provide sufficient knowledge.For instance, it is almost impossible for a model to generate the definition for twin prime only with context "proof of this conjecture would also imply the existence of an infinite number of twin primes" since the context does not explain twin prime, and the specific meaning is difficult Jargon: few-shot learning

Self-Definitional Information (SDI)
• Few-shot learning, or one-shot learning in this case, is a hot topic for machine learning applications where the model is supposed to predict something based on a few training examples.• Few-shot learning is a sub-area of machine learning.• …

Correlative Definitional Information (CDI)
• Zero-shot learning (ZSL) is a problem setup in machine learning, where at test time, a learner observes samples from classes that were not observed during training, and needs to predict the class they belong to.• Meta learning is a subfield of machine learning where automatic learning algorithms are applied to metadata about machine learning experiments.• …

Definition:
Few-shot learning (FSL) is a problem setup in machine learning, where predictions are made based on a few training examples.

Generator
Figure 1: The overview of the proposed framework.In this example, the definition of few-shot learning is generated based on both the SDI (e.g., "predictions are made based on a few training examples") and the CDI (e.g., "is a problem setup in machine learning").
to infer from the surface form, leading to hallucinations, i.e., generating irrelevant or contradicted facts (Bevilacqua et al., 2020).Consequently, existing models designed for general words/phrases perform poorly on jargon.In our evaluation (Tables 5 and 6), we find most definitions produced by the state-of-the-art model contain wrong information.
Fortunately, definition extraction and definition generation can complement each other naturally.On one hand, definition generator has the potential to help the extractor by refining and synthesizing the extracted definitions; therefore, the extracted sentences are not required to be perfect definitions of the target jargon.On the other hand, definition extractor can retrieve useful definitional information as knowledge for the generator to produce definitions of jargon.However, surprisingly, existing works are either extractive or abstractive, even do not connect and compare them.
Therefore, in this work, we propose to combine definition extraction and definition generation for jargon definition modeling.We achieve this by introducing a framework consisting of two processes: extraction, where definitional information of jargon is extracted from the Web; and generation, where the final definition is generated with the help of the extracted definitional information.
We build models for extraction and generation based on Pre-Trained Language Models (Devlin et al., 2019;Lewis et al., 2020a;Brown et al., 2020).Specifically, for extraction, we propose a BERT-based definition extractor to extract selfdefinitional information (i.e., definitional sentences of the target jargon).We also suggest that related terms can help define the target jargon and leverage Wikipedia as the external knowledge source to retrieve correlative definitional information (i.e., definitions of related terms).For generation, we design a BART-based definition generator to produce the final definition by incorporating the extracted knowledge.An example is shown in Figure 1.
Our framework for jargon definition modeling is remarkably simple that can easily be further expanded by leveraging more advanced language models, e.g., we can replace the BART generator with larger models such as Meta OPT (Zhang et al., 2022) with a simple modification.Besides, since our framework does not require a domain-specific corpus or ontology like the ones used in Vanetik et al. (2020); Liu et al. (2021) et al., 2004;Fahmi and Bouma, 2006); 2) machine learning-based, which extracts definitions by statistical machine learning with carefully designed features (Westerhout, 2009;Jin et al., 2013); 3) deep learning-based, the state-of-the-art approach for definition extraction, which is based on deep learning models such as CNN, LSTM, and BERT (Anke and Schockaert, 2018;Veyseh et al., 2020;Kang et al., 2020;Vanetik et al., 2020).(Gadetsky et al., 2018;Ishiwatari et al., 2019;Washio et al., 2019;Mickus et al., 2019;Li et al., 2020;Reid et al., 2020;Bevilacqua et al., 2020;Huang et al., 2021a).For example, Bevilacqua et al. (2020) apply pre-trained BART (Lewis et al., 2020a) for definition generation with a simple context encoding scheme.Huang et al. (2021a) employ three T5 models (Raffel et al., 2020)  There are also recent works on definition modeling for other languages, e.g., Chinese, by incorporating the special properties of the specific language (Yang et al., 2020;Zheng et al., 2021).However, although definition extraction and definition generation are quite relevant tasks, surprisingly, existing works do not connect and compare them.In this work, we report the first attempt to combine them.

Methodology
Our framework for jargon definition modeling consists of two processes: extraction, which extracts self-and correlative definitional information of the target jargon from the Web; and generation, which generates the final definition by incorporating the extracted definitional information.The overview of the framework is shown in Figure 1.

Self-Definitional Information
Since jargons are specialized terms used in a particular field, to understand jargon, we need background knowledge of jargon.To acquire useful information for defining jargon, it is natural to refer to definitional sentences containing the target jargon, named Self-Definitional Information (SDI).We achieve SDI by first extracting sentences containing the target jargon from the Web (more details are in Section 4.1) and then using a classifier to rank the extracted sentences.
To build the classifier, we apply the BERT model (Devlin et al., 2019), which has achieved excellent results on various text classification tasks.We adopt a simple encoding scheme, which is "[CLS] jargon [DEF] sentence", e.g., "[CLS] machine learning [DEF] machine learning is the study of computer algorithms that improve automatically through experience and by the use of data."The final hidden state of the first token [CLS] is used as the representation of the whole sequence and a classification layer is added.After fine-tuning on the jargon-sentence pairs, the model has a certain ability to distinguish whether the sentence contains representative definitional information of the target jargon.SDI is then obtained as the top definitional sentences by ranking the sentences according to the confidence of the prediction.We refer to this model as SDI-Extractor.

Correlative Definitional Information
To explain a jargon, in addition to utilizing SDI, we can also refer to the definitions of its related terms, i.e., Correlative Definitional Information (CDI).For instance, to define few-shot learning, we can incorporate definitions of zero-shot learning and meta learning, with which we can know the meaning of "shot" and "learning" and may define few-shot learning similarly to zero-shot learning.
To get related terms and their definitions, we leverage Wikipedia as the external knowledge source, which covers a wide range of domains and contains high-quality definitions for a large number of terms.Specifically, we follow the core-fringe notion in Huang et al. (2021b), where core terms are terms that have corresponding Wikipedia pages, and fringe terms are ones that are not associated with a Wikipedia page.For each jargon, we treat it as query to retrieve the most relevant core terms via document ranking based on Elasticsearch (Gormley and Tong, 2015), and extract first sentences on the corresponding Wikipedia pages as the definitions of related terms.We refer to this model as CDI-Extractor.

Generation
After extraction, we acquire the self-and correlative definitional information of jargon.This kind of information captures important characteristics of jargon and can be further refined and synthesized into the final definition by a definition generator.
Definition generation can be formulated as a conditioned sentence generation task-generating a coherent sentence to define the target jargon.Formally, we apply the standard sequence-to-sequence formulation: given jargon x, combining with the extracted sentences S s (for SDI) and S c (for CDI), the probability of the generated definition d is computed auto-regressively: where m is the length of d, d i is the ith token of d, and d 0 is a special start token.
Following Bevilacqua et al. (2020), to build the generator, we employ BART (Lewis et al., 2020a), a pre-trained transformer-based encoder-decoder model that can be fine-tuned to perform specific conditional language generation tasks with specific training input-output pairs.Different from existing works (Gadetsky et al., 2018;Ishiwatari et al., 2019;Bevilacqua et al., 2020) which aim to learn to define a word/phrase in a given context, we propose to learn to define a jargon using the extracted knowledge.To be specific, we aim to fine-tune the BART model to generate the definition of the target jargon based on the surface name of the jargon and the extracted definitional information.
To apply the BART model, for a target jargon, we adopt the following encoding scheme: "jargon , where sent i and sent ′ i are the ith sentences ranked by SDI-Extractor and CDI-Extractor, respectively.We fine-tune BART to produce the ground-truth definition conditioned with the encoded input.
After training, given a new jargon, we get corresponding SDI and CDI according to Section 3.1.We encode the jargon and the top k ranked sentences of SDI and top k ′ ranked sentences of CDI as described above and use the generator to produce the final definition.We refer to this model as CDM-Sk,Ck ′ , i.e., Combined Definition Modeling.
Here we would like to mention that our combined definition modeling framework is modular and can be applied to different extractor-generator combinations commonly proposed for definition extraction/generation, which means that the proposed framework can improve the performance for a variety of definition modeling systems.For instance, we can replace the BART generator with GPT-2/3 generator (Radford et al., 2019;Brown et al., 2020) or DMAS (Huang et al., 2021a) by simply modifying the encoding scheme.

Datasets
Existing datasets for definition modeling are mainly for general words/phrases.In this paper, we build several datasets (UJ-CS, UJ-Math, UJ-Phy) for jargon based on Wikipedia and CFL (Huang et al., 2021b).Compared to general words/phrases, jargons are less ambiguous but more specialized, i.e., a jargon usually only has one meaning, but it requires domain knowledge to understand.We also conduct experiments on the dataset (Sci&Med) provided in August et al. (2022), which contains definitions of scientific and medical terms derived from Wikipedia science glossaries and MedQuAD (Ben Abacha and Demner-Fushman, 2019).
Definition Extraction.We build a dataset for jargon definition extraction with Wikipedia.We first collect jargons with Wikipedia Category.Specifically, we traverse from three root categories, including Category:Subfields of computer science3 , Category:Fields of mathematics4 , and Category:Subfields of physics5 , and collect pages at the first three levels of the hierarchies.For each page, we process the title with lemmatization as the jargon, extract the first sentence in the summary section as the corresponding definition, and sample ≤ 5 sentences containing the target jargon from other sections as negatives (they are less likely to be definitional sentences).We filter out jargons with surface name frequency < 5 in the arXiv corpus 6 (to filter out some noisy phrases, e.g., List of artificial intelligence projects).The dataset contains 26,559 positive and 121,975 negative examples, and the train/valid/test split is 0.8/0.1/0.1.
We collect jargons in two ways.For computer science, we collect jargons (author-assigned keywords) by web scraping from Springer publications on computer science.We filter out jargons with frequency < 5.For mathematics and physics, we collect jargons with the CFL model proposed in Huang et al. (2021b).Specifically, we collect terms with domain relevance score > 0.5 as jargons.For each jargon in the list, URLs of the top 20 results from Google search are visited.Then the sentences containing the target jargon are extracted.For training and evaluation, we only keep jargons that have a corresponding Wikipedia page and extract the first sentence on each page as the ground-truth definition.Table 1 summarizes the statistics of the data.

Experimental Setup
Baselines.For extraction, we compare SDI-Extractor with a CNN baseline and a CNN-BiLSTM baseline proposed in Anke and Schockaert ( 2018).Here we should mention that the more recent models (Veyseh et al., 2020;Kang et al., 2020)  of this paper; therefore, we put more emphasis on the evaluation for generation.For generation, we evaluate on the following models: • Gen (w/o context): A simple version of Generationary (Bevilacqua et al., 2020), where BART (Lewis et al., 2020a) is fine-tuned on jargondefinition pairs.• Gen (w/ context): Generationary with a sentence containing the target jargon as context, where BART is fine-tuned on context-definition pairs.• DMAS (Huang et al., 2021a): A definition modeling model with three T5 (Raffel et al., 2020), where a re-ranking mechanism is included to model the specificity of definitions.Context is given by a sentence containing the target jargon.• BART NO SD and BART SD: For the Sci&Med dataset (August et al., 2022), we also compare with the two best methods introduced in their paper: BART SD, where BART is fine-tuned with the term question, e.g., What is (are) carbon nanotubes?, concatenated with the supporting document; and BART NO SD, where BART is fine-tuned with just the question and definition, without the support documents.• Extractive: An extractive baseline, which outputs the candidate definition with the highest confidence score predicted by SDI-Extractor (Section 3.1.1).• CDM-Sk,Ck ′ : The combined definition modeling model introduced in Section 3.2.Sk or Ck ′ is omitted when k or k ′ is equal to 0.
Metrics.For extraction, we use the standard precision, recall, and F1 scores to evaluate the performance.For generation, we follow Bevilacqua et al.
Implementation Details.For SDI extraction, we adopt BERT-base-uncased from huggingface transformers framework (Wolf et al., 2020).We apply the BertForSequenceClassification in huggingface (with a linear layer on top of the pooled output).We use the default hyperparameters and fine-tune the model using Adam (Kingma and Ba, 2015) with learning rate of 2 × 10 −6 .All the layers of the BERT model are fine-tuned.For the two baselines, we train the models on our data with the official implementation.For the extracted SDI, we exclude sentences from Wikipedia to avoid the models to see the ground truth.
For CDI extraction, following Huang et al.
(2021a), we use the built-in Elasticsearch-based Wikipedia search engine8 to collect related core terms for jargon; and then, we extract the first sentence on the corresponding Wikipedia page as the definition of each related term.
For generation, we employ the fairseq library9 to build the BART-base generator and adopt the hyperparameters and settings as suggested in Bevilacqua et al. (2020).We set the learning rate as 5 × 10 −5 and use batch size of 1, 024 tokens, updating every 16 iterations, with the number of warmup steps as 1, 000.For all the datasets, we use the same trained SDI-extractor as described above to extract SDI.We adopt the default/suggested hyperparameters for the baselines.We train and evaluate all the baselines and variants on the same train/valid/test split on NVIDIA Quadro RTX 5000 GPUs.The training of CDM can be finished in one hour.

Definition Extraction
Table 2 reports the results of definition extraction.We observe that SDI-Extractor outperforms baselines significantly and the performance is quite satisfactory (with an F1 score higher than 0.97), which indicates our definition extractor can extract useful self-definitional information for jargon.

Definition Generation
We provide both quantitative and qualitative evaluations for definition generation.

Automatic Evaluation
Tables 3 and 4 show the results on automatic metrics10 .We observe the proposed CDM model out- performs the SOTA baselines significantly.Comparing Gen (w/ context) with Gen (w/o context), we find contexts (random sentences containing the target jargon) only have limited help with jargon definition modeling.Besides, CDM-S5 outperforms CDM-S3, while CDM-S3 outperforms CDM-S1, which means the sentences extracted by SDI-Extractor can provide important definitional information.Comparing CDM-C5 with Gen (w/ context) and Gen (w/o context), we can verify CDI is also helpful for definition generation, while the improvement is not as significant as the models with SDI, e.g., CDM-S5.Among all the models, CDM-S5,C5 usually achieves the best performance, which demonstrates the combination of SDI and CDI is the most significant for jargon definition modeling.
An interesting finding is that our simple extractive model is comparable to the SOTA abstractive baselines (except for Table 4, because most of the definitions in the dataset are not complete sentences, e.g, "the science of automatic control systems" for cybernetics, while SDI-Extractor usually extracts complete sentences).We suppose this is because, compared to general words/phrases, jargons are more difficult to define without external knowledge.For instance, it is almost impossible for a model to generate the definition for twin prime only with context "proof of this conjecture would also imply the existence of an infinite number of twin primes", while the definition can possibly be retrieved from the Web.The results also demonstrate that existing context-aware definition modeling systems are hard to handle jargon, while our proposed extraction-generation framework is quite practical for jargon definition modeling.

Human Evaluation
We conduct human evaluation for the computer science field (UJ-CS).Specifically, we randomly sample 50 jargons from the test set, and ask three human annotators to evaluate the definitions probe consistent with Table 3. duced by different models with the rating scale described in Section 4.2.Table 5 reports the human evaluation results, where the average pairwise Cohen's κ is 0.69 (good agreement).We observe the state-of-the-art baseline Gen (w/context) is difficult to generate reasonable definitions for jargon.In contrast, the proposed CDM-S5,C5 model can produce high-quality definitions in most cases (with a human-annotated score higher than 4).The human evaluation results are also consistent with the automatic evaluation results presented in Table 3.

Sensitivity to Frequency
To investigate the sensitivity of the models with respect to the popularity of jargon, we report the results according to jargon frequency in Figure 2. We observe that Generationary (Bevilacqua et al., 2020) achieves slightly worse performance for less popular jargon on all metrics, while CDM performs well for low-frequency jargon, which indicates our framework can produce high-quality definitions for long-tail jargon.We suppose this is because, although long-tail jargon is less frequent, we can still extract useful definitional information from the entire Web and incorporate it for definition generation.

Generation Examples and Error Analysis
In Table 6, we show some sample outputs in the test set of three models: Extractive, Gen (w/ context), and CDM-S5,C5, with ground-truth definitions in Wikipedia (Gold) as references.
From the results, we observe although the extractive baseline can produce reasonable sentences, the output sentences may not be high-quality definitional sentences of the target jargon.For instance, Extractive Gen (w/ context) CDM-S5,C5 Gold intelligent user interfaces ACM IUI 2021 is the 26th annual premier international forum for reporting outstanding research and development on intelligent user interfaces.
An intelligent user interface (IUI) is a user interface (UI) that is designed to respond to the needs and wants of the user.
In computing, an intelligent user interface (UI) is a user interface that has been developed by a group of researchers interested in enabling natural human-computer interaction by combining techniques from computer vision, machine learning, computer graphics, human-computer interaction and psychology.
An intelligent user interface (Intelligent UI, IUI, or sometimes Interface Agent) is a user interface (UI) that involves some aspect of artificial intelligence (AI or computational intelligence).

wear leveling
The term preemptive wear leveling (PWL) has been used by Western Digital to describe their preservation technique used on hard disk drives (HDDs) designed for storing audio and video data.
Wear leveling is the process of reducing the wear of a Flash die.
Wear leveling is a technique used to increase the lifetime of a solid-state drive (SSD).
Wear leveling (also written as wear levelling) is a technique for prolonging the service life of some kinds of erasable computer storage media, such as flash memory, which is used in solid-state drives (SSDs) and USB flash drives, and phasechange memory.gittins index In applied mathematics, the "Gittins index" is a real scalar value associated to the state of a stochastic process with a reward function and with a probability of termination.
The Gittins index is a decision-making tool used in decision-making and project management.
In applied mathematics, the Gittins index is a real scalar value associated to the state of a stochastic process with a reward function and with a probability of termination.
The Gittins index is a measure of the reward that can be achieved through a given stochastic process with certain properties, namely: the process has an ultimate termination state and evolves with an option, at each intermediate state, of terminating.reduplication The term ?C?compensatory reduplication?C? refers to duplication that serves a phonological purpose.
In mathematics, reduplication is a generalization of the concept of reduplication.
Reduplication is the repetition of an entire word, word stem (root with one or more affixes), or root.
In linguistics, reduplication is a morphological process in which the root or stem of a word (or part of it) or even the whole word is repeated exactly or with a slight change.power delay profile The power delay profile of a channel represents the average power of the received signal in terms of the delay with respect to the first arrival path in multi-path transmission.
A power delay profile (PDP) is a measure of the time delay between the transmission and reception of a signal.
In telecommunications, the power delay profile (PDP) of a multipath channel represents the average power of the received signal in terms of the delay with respect to the first arrival path in multi-path transmission.
The power delay profile (PDP) gives the intensity of a signal received through a multipath channel as a function of time delay.
the extracted sentence for wear leveling in fact is the definition of preemptive wear leveling.We also find Gen (w/ context) suffers severely from hallucinations, i.e., generating irrelevant or contradicted facts.For instance, gittins index is described as a decision-making tool instead of a measure/value, which is completely wrong.This is mainly because the contexts of jargon may not provide sufficient knowledge to define jargon.In contrast, the quality of definitions generated by CDM-S5,C5 is highall the generated definitions capture the main characteristics of the target jargon correctly.
Error Analysis.To further understand the results and identify the remaining challenges, we analyze the human evaluation results.We find that errors could be introduced in either the extraction or the generation process.E.g., 1) for intelligent user interfaces in

Discussion
In this work, we focus on jargon definition modeling.The proposed framework can be further extended to general words/phrases in a context-aware setting (Gadetsky et al., 2018).For instance, to retrieve the definitional information, we can incorporate the context the target word/phrase used in.E.g., the BERT extractor can be trained with a modified encoding scheme: " [SEP] sent ′ 2 ... [SEP] sent ′ k ′ ".Since our framework is modular, the BERT extractor and BART generator can also be replaced with more advanced language models.It is also interesting to train the extractor and generator jointly or iteratively (Guu et al., 2020;Lewis et al., 2020b).We keep the proposed model simple and leave context-aware combined definition modeling and more complicated combinations as future work.

Conclusion
We present the first combination of definition extraction and definition generation.We show that, by incorporating extracted self-and correlative definitional information, the generator can produce high-quality definitions for jargon.Experimental results demonstrate the effectiveness of our framework, where the proposed method outperforms recent baselines by a large margin.We also publish several datasets for jargon definition modeling.In future work, we plan to improve our framework as discussed in Section 5 and apply our methods to construct several online domain dictionaries.

Limitations
One limitation of this paper is that it does not consider the diversity of definitions.Definitions from different perspectives can facilitate a more comprehensive understanding.For instance, to define artificial intelligence, we may relate it to or contrast it with other concepts, e.g., "artificial intelligence refers to systems or machines that mimic human intelligence to perform tasks and can iteratively improve themselves based on the information they collect."or "artificial intelligence is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans."Recent work starts to model the specificity and complexity for definition modeling (Huang et al., 2021a;Gardner et al., 2022); however, the diversity of generative definitions is still limited.We believe our framework can benefit diversity since the generator has the potential to generate definitions with different styles by incorporating diverse definitional information extracted from the Web.

Figure 2 :
Figure 2: Results of definition generation with respect to jargon frequency in Springer (author-assigned keywords).Best viewed in color.

Table 1 :
The statistics of the data.

Table 2 :
Results of definition extraction.
cannot be compared directly since these works focus on a fine-grained sequence labeling task, where the training data also requires additional labeling.Besides, extraction is not the focus 6 https://www.kaggle.com/Cornell-University/arxiv

Table 3 :
Results of definition generation on automatic metrics.The best results are bold and second best ones are underlined.

Table 6
languages, although SDI-Extractor extracts reasonable definitions (e.g., "Markup languages are languages used by a computer to annotate a document."),the generator mistakenly synthesizes the SDI and CDI into "A markup language is a se-ries of tags mixed with plain text."Nonetheless, compared to existing models that do not combine extraction and generation, CDM greatly reduces hallucination.
[CLS] word/phrase [SEP] context [DEF] sentence".Similarly, the generator can produce the final definition conditioned on the context.E.g., the input of the generator can be encoded as "word/phrase [SEP] context [DEF] sent 1 [SEP] sent 2 ... [SEP] sent k [DEF] sent ′