PeerDA: Data Augmentation via Modeling Peer Relation for Span Identification Tasks

Span identification aims at identifying specific text spans from text input and classifying them into pre-defined categories. Different from previous works that merely leverage the Subordinate (SUB) relation (i.e. if a span is an instance of a certain category) to train models, this paper for the first time explores the Peer (PR) relation, which indicates that two spans are instances of the same category and share similar features. Specifically, a novel Peer Data Augmentation (PeerDA) approach is proposed which employs span pairs with the PR relation as the augmentation data for training. PeerDA has two unique advantages: (1) There are a large number of PR span pairs for augmenting the training data. (2) The augmented data can prevent the trained model from over-fitting the superficial span-category mapping by pushing the model to leverage the span semantics. Experimental results on ten datasets over four diverse tasks across seven domains demonstrate the effectiveness of PeerDA. Notably, PeerDA achieves state-of-the-art results on six of them.


Introduction
Span Identification (SpanID) is a family of Nature Language Processing (NLP) tasks with the goal of detecting a small portion of text spans and further classifying them into categories (Papay et al., 2020). For example, Named Entity Recognition (NER) is the most famous SpanID task. SpanID serves as the initial step for complex text analysis by narrowing down the search scopes of important spans, which holds a pivotal position in the field of NLP (Ding et al., 2021). Recently, different domain-specific SpanID tasks, such as social media NER (Derczynski et al., 2017), Aspect Based Sentiment Analysis (ABSA) (Liu, 2012), Contract Clause Extraction (CCE) (Chalkidis et al., 2017) and Span Based Propaganda Detection (SBPD) (Da San Martino et al., 2019), have emerged due to the increasing demands in various NLP applications. Precisely, as shown in Figure 1 (a), the process of SpanID can be summarized as accurately extracting span-category Subordinate (SUB) relationif a span is an instance of a certain category. Early works (Chiu and Nichols, 2016) typically tackle SpanID tasks as a sequence tagging problem, where the SUB relation is recognized via predicting the category for each input token under certain context. Recently, to better exploit the knowledge from pretrained language models (Devlin et al., 2019), many efforts (Liu et al., 2020; have been made on reformulating SpanID tasks as a Machine Reading Comprehension (MRC) problem. Taking the example from Figure 1 (b), MRC creates a unique query for each category and recognizes the SUB relation by detecting relevant spans as answers to the category query.
However, only leveraging the SUB relation in the training data to build SpanID models exhibits two major limitations: 1) Over-fitting: With only SUB relation, SpanID models would easily over-fit to the superficial span-category mapping on ex-isting categories and show poor generalization to unseen data. 2) Scarcity: For low-resource applications or long-tailed categories, it is insufficient to learn an effective SpanID model given limited span-category pairs with SUB relation (SUB pairs).
In the light of these limitations, we further explore the utility of span-span Peer (PR) relation indicating that the two spans are two different instances from the same category. Note that, the PR relation is a by-product of SUB relation and does not require additional annotation. The major difference between them is that the PR relation does not involve the direct supervision signal, i.e., the category information. For example, "Hawaii" and "London" are connected with the PR relation because they are two different instances from the "LOC" category in Figure 1 (a). Recognizing the PR relation requires two involved spans to compare their individual semantic information about their category. With the pretrained contextualized representations (Devlin et al., 2019;Liu et al., 2019), this allows one span to fuse additional category information from its peer and thus learns a better span representation, which can enhance the models' capability of capturing the SUB relation. In addition, the number of span-span pairs with the PR relation (PR pairs) grows quadratically (i.e. O(N 2 )) in the number of SUB pairs. Therefore, there are a considerable number of PR pairs available for alleviating the insufficient learning of semantics of spans from low-resource applications or long-tailed categories.
In this paper, with the aim of leveraging the PR relation to enhance SpanID models, we propose a Peer Data Augmentation (PeerDA) approach that treats PR pairs as a kind of augmented training data. As depicted in Figure 1 (b), PeerDA is realized in the paradigm of MRC such that the query can be customized either as a category or a similar span to explore the SUB and PR relation respectively. To achieve this, we extend the usage of the original training data in two views: (1) The SUB-based training data. It directly solves SpanID tasks by extracting the SUB relation in the original training data.
(2) The PR-based training data. It is augmented to enrich the semantics of spans by extracting the PR relation in the original training data, where one span is asked to identify its peer from the context.
We evaluate the effectiveness of PeerDA on ten datasets across seven domains, from four different SpanID tasks, namely, NER, ABSA, CCE, and SBPD. Experimental results show that extracting PR relation benefits the learning of semantics and encourages models to identify more possible spans. As a result, PeerDA sets new state-of-the-art (SOTA) on seven SpanID tasks. We also make diverse analyses to explore more potential scenarios of applying PeerDA.
Our contributions are summarized as followed: • We propose a novel PeerDA approach to tackle SpanID tasks from the perspective of extracting PR relation. • We conduct extensive experiments on ten datasets, including four different SpanID tasks across seven domains, and achieve SOTA performance on seven SpanID tasks. • We conduct various analyses in detail to explore more potential scenarios of applying PeerDA.
2 Related Work DA for SpanID: DA is a widely-adopted solution to address data scarcity, where models are fed with automatically generated data to improve the generalization and robustness. In the scope of SpanID, DA approaches can be roughly classified into three groups: (1) Word Replacement that keeps the labels unchanged but replaces or paraphrases some context tokens either using simple rules (Wei and Zou, 2019;Dai and Adel, 2020) or strong language models (Kobayashi, 2018;Wu et al., 2019;Li et al., 2020a).
(2) Self-training is to first train a model with manually labeled data, then use the model to automatically label unlabeled data, and finally leverage the manually and automatically labeled data to enhance itself (Xie et al., 2019(Xie et al., , 2020. It shows promising results in many SpanID tasks, including NER (Wang et al., 2020), propaganda detection (Hou et al., 2021 text of SpanID, Li et al. (2020b) address the nested NER issues by decomposing nested entities into multiple queries. Mao et al. (2021) tackle ABSA by combining aspect term extraction and sentiment polarity classification in a dual MRC framework. Hendrycks et al. (2021) tackle CCE with MRC to deal with the extraction of long clauses. Moreover, other tasks such as relation extraction (Li et al., 2019a), event detection (Liu et al., 2020, and summarization (McCann et al., 2018) are also reported to benefit from the MRC paradigm.

SpanID as Multi-Span MRC
MRC is initially designed to only identify one span as the answer to each query, which is not suitable for SpanID that requires identifying multiple spans. For example, in Figure 2, it cannot simultaneously detect the two spans given the query of "Work of Art". To tackle this issue, we follow the design of Li et al. (2020b) to build a Multi-Span MRC model that is capable of simultaneously extracting multiple spans and implement our PeerDA on it.
Overview of SpanID: Given the input text X = {x 1 , ..., x n }, SpanID is to detect all important spans {x k s,e } K k=1 and classify them with proper labels {y k } K k=1 , where each span x k s,e = {x s k , x s k +1 , ..., x e k −1 , x e k } is a continuous subsequence of X satisfying s k ≤ e k and the label comes from a predefined category set Y according to different SpanID tasks (e.g. "Person" in NER).

Multi-Span
Second, the span predictor consists of two binary classifiers, one to predict whether each context token is the start index of the answer based on its hidden state, and the other to predict whether each context token is the end index: where W s , W e ∈ R d×2 are the weights of two classifiers and d is the dimension of hidden states. The span predictor would output multiple start and end indexes for the given query and context. Finally, the start-end selector matches each start index to each end index and select the most possible spans from all combinations as the output answers: where P s,e denotes the probability of X MRC s:e to form a possible answer, W sel ∈ R 1×2d is the weight matrix and [; ] is the concatenation operation.
The overall training objective is to minimize the cross-entropy loss (CE) between three predictions and their corresponding ground-truth labels: where α, β and Y start , Y end , Y s,e are balance rates and golden start, end, and span indices respectively.

PeerDA
In this work, we propose a PeerDA approach to perform data augmentation. The training data D consists of two parts: (1) The SUB-based training data D SUB , where the query is about a category and the MRC context is the input text.
(2) The PR-based training data D PR is constructed with PR pairs, where one span is used to create the query and the input text containing the second span serves as the MRC context.

SUB-based Training Data
First, we need to transform the original training examples into (query, context, answers) triples following the paradigm of Multi-Span MRC. To extract the SUB relation between categories and relevant spans, a natural language query q SUB y is constructed to reflect the semantics of each category y.
Given the input text X as the context, the answers to q SUB y are the spans having the category same to y. Then we can obtain one MRC triple denoted as (q SUB y , X, {x k s,e | x k s,e ∈ X, y k = y} K k=1 ). By enumerating all categories, we create |Y | training examples for each input text to allow the identification of spans from all categories.

PR-based training data
To construct augmented data that explores the PR relation, we first create a category-wise span set S y that includes all training spans with category y: Obviously, any two different spans in S y have the same category and shall hold the PR relation. Therefore, we pair every two different spans in S y to create a peer set P y : s,e ∈ S y , x q s,e = x a s,e } (7) For each PR pair in P y , we can construct one training example by constructing the query with the first span x q s,e : q PR y = Highlight the parts (if any) similar to x q s,e .
Then the text X a containing the second span x a s,e denotes the MRC context, and the spans having the same category y in X a are the answers. Finally, we obtain the MRC triple with respect to the PR relation as (q PR y , X a , {x k s,e | x k s,e ∈ X a , y k = y} K k=1 ). Theoretically, given the span set S y , there are only |S y | SUB pairs in the training data but we can obtain |S y | × (|S y | − 1) PR pairs to construct D PR . Such a large number of augmented data shall hold great potential to enrich spans' semantics. However, putting all PR-based examples into training would make the data distribution more skewed since the long-tailed categories get less PR pairs for augmentation and also increase the training cost. Therefore, as the first step for DA with the PR relation, we propose three augmentation strategies to control the size and distribution of augmented data. We leave more advanced augmentation strategies in future work.
PeerDA-Size: This is to increase the size of augmented data while keeping the data distribution unchanged. Specifically, for each category y, we randomly sample λ|S y | PR pairs from P y . Then we collect all sampled PR pairs to construct D PR , where λ is the DA rate to control the size of D PR .
PeerDA-Categ: Categories are not evenly distributed in the training data, and in general SpanID models perform poorly on long-tailed categories because of the insufficient learning of semantics from a limited number of SUB pairs. To tackle this issue, we propose a PeerDA-Categ strategy to augment more training data for long-tailed categories. Specifically, let y * denote the category having the largest span set with the size of |S y * |. We randomly sample |S y * | − |S y | PR pairs from P y for each category y and construct a category-balanced training set D PR using all sampled pairs. Theoretically, we would get the same size of the training data for each category after the augmentation, which significantly increases the exposure for spans from the long-tailed categories.
PeerDA-Both: To take advantage of the above two strategies, we further propose PeerDA-Both, which allows to maintain the data distribution and, at the same time, effectively increase the size of training data. In PeerDA-Both, we randomly sample max(λ|S y * | + (|S y * | − |S y |), 0) PR pairs from P y for each category y to construct D PR , where λ|S y * | is to control the size and |S y * | − |S y | is to control the data distribution.

Train
We combine the D SUB and D PR constructed above to obtain the final training data. However, an input text is unlikely to contain all kinds of categories. When converting the text into the MRC paradigm, many of the |Y | examples are unanswerable. If a SpanID model is trained on this unbalanced data, it is likely to output an empty span since it has been taught by the unanswerable examples not to extract spans. To mitigate such imbalance, we follow Hendrycks et al. (2021)   ABSA is a fine-grained sentiment analysis task centering at aspect terms. We explore two ABSA subtasks: • Aspect Term Extraction (ATE) is to extract aspect terms, where there is only one query asking if there are any aspect terms in the input text. • Unified Aspect Based Sentiment Analysis (UABSA) is to jointly extract aspect terms (i.e. spans) and predict their sentiment polarities (i.e. categories), namely positive, negative and neutral.
We evaluate the two sub-tasks on two datasets, including the laptop domain Lap14 and restaurant domain Rest14 from SemEval Shared tasks (Pontiki et al., 2014). We use span-level micro-averaged F 1 as the evaluation metric.
CCE is a legal task to detect and classify contract clauses (i.e. spans) into relevant clause types (i.e. categories), such as "Governing Law". We conduct CCE experiments using CUAD (Hendrycks et al., 2021), where they annotate contracts with 41 clause types. We follow Hendrycks et al. (2021) to split the contracts into segments within the length limitation of pretrained language models and treat each individual segment as one example. We also follow them to use Area Under the Precision-Recall Curve (AUPR) and Precision at 80% Recall (P@0.8R) as the evaluation metrics. SBPD aims to detect both the text fragment where a persuasion technique is being used (i.e. spans) and its technique type (i.e. categories). We use News20 and Social21 from SemEval shared tasks (Da San Martino et al., 2020;Dimitrov et al., 2021). Due to a blind test set of News20, we evaluate its model performance on the dev set. we use the official evaluation metrics from SemEval shared task, which is a modified version of microaveraged F 1 , Precision, and Recall accounting for partial matching between two spans.

Implementations
Since legal SpanID tasks have a lower tolerance for missing important spans, we do not include startend selector and the βCE(P s,e , Y s,e ) in Eqn. 4 in the CCE models but follow Hendrycks et al. (2021) to output top 20 spans from span predictor for each input example in order to extract spans as much as possible. While for NER, ABSA, and SBPD, we use the original multi-span MRC architecture. In consideration of fair comparison, our multispan MRC models utilize BERT (Devlin et al., 2019) as the text encoder for ABSA and RoBERTa (Liu et al., 2019) for NER, CCE, and SBPD. Detailed configurations can be found in Appendix A.

Baselines 2
Note that our main contribution is to provide a new perspective to treat the PR relation as a kind of training data for augmentation. We are not focusing on pushing the SOTA results to new heights even though we compare PeerDA with some SOTA approaches. NER: We compare with Tagging (Liu et al., 2019) and MRC (Li et al., 2020b) baselines. We also report the previous best approaches for each dataset, including RB-CRF+RM (Lin et al., 2021), CL-KL (Wang et al., 2021), T-NER (Ushio and Camacho-Collados, 2021) and KaNa (Nie et al., 2021), W 2 NER . ABSA: In addition to MRC baseline, we also compare with previous approaches on top of BERT. 3

OntoNotes5
WNUT17  Apart from that, there are several interesting observations: 1) MRC performs worse than Tagging in our experiments, especially when the amount of training data is small. We conjecture that MRC suffers more from data scarcity as it requires a large amount of data to learn good domain-specific sestronger pre-trained backbones are not included here .

Methods
Lap14 Rest14     approaches, PeerDA approaches mostly achieve better results on two subtasks. Especially on ATE, PeerDA-Both exceeds the best baseline by 1.0 and 0.8 F 1 on two datasets respectively. However, due to the context-aware characteristics, sentiment polarities show a weak correlation with the aspect terms in UABSA. In fact, we find there is a large overlap of aspect terms in different span sets. Therefore, the common semantics of each PR pair cannot infer a particular sentiment polarity but only indicates that they are both aspect terms. Under this circumstance, PeerDA only contributes to detecting aspect terms but does little help in sentiment classification. On the other hand, PeerDA-Both and PeerDA-Categ would alter the data distribution, making the long-tailed category (i.e. Neutral) more likely to get predicted. Such behavior becomes a disadvantage for classification. This explains why PeerDA-Size can still be beneficial but PeerDA-Both and PeerDA-Categ do not perform well on UABSA.
CCE: The results of CCE are shown in Table 4. PeerDA-Both surpasses MRC by 8.7 AUPR and 13.3 P@0.8R and even surpasses the previous best model (DeBERTa xlarge ) of extremely large size by 4.5 AUPR, achieving SOTA performance on CUAD. Apart from the superior performance, we also find that CCE benefits more from PeerDA-Categ than PeerDA-Size. Similar to NER, this finding suggests that the skewed data distribution  caused by the large number of long-tailed categories indeed hinders the model training and, our PeerDA-Categ, which attempts to create categorybalanced training data, provides an effective solution for alleviating this issue.

SBPD:
The results of two SBPD tasks are presented in Table 5. PeerDA-Both outperforms MRC by 9.9 and 6.4 F 1 and achieves SOTA performance on News20 and Social21 respectively. We can also observe that both two tasks benefit a lot from PeerDA-Categ. Such consistent phenomenon from NER, CCE, and SBPD implies that data should be appropriately increased according to the category abundance in different SpanID tasks.

Further Discussions
In this section, we make further discussions to bring valuable insights of our PeerDA approach. We use the PeerDA-Both strategy in the following experiments because of its overall superiority.

Cross-domain Transfer:
We conduct crossdomain transfer experiments on four English NER datasets, where the model is trained on OntoNotes5, the largest dataset among them, and evaluated on the test part of another three datasets. Since these four datasets are from different domains and differ substantially in their categories, this setting can eliminate the possibility that models rely on the superficial span-category mapping such that we can purely evaluate the model's ability of learning semantics. The results are presented in Table 6 and full Precision and Recall can be found in Appendix A.4. PeerDA can significantly exceed MRC on all three transfer pairs. On average, PeerDA achieves 1.3 and 4.6 F 1 gains over base-size MRC and large-size MRC respectively. These results verify our postulation that modeling the PR relation can enrich spans' semantics and thus improve models' generalization ability.
Low-resource Scenarios: We simulate low- Figure 3: Performance on low-resource scenarios. We select one dataset for each SpanID task and report the test results (AUPR for CCE and F 1 for others) from the models trained on different proportions of the training data.  resource scenarios by randomly selecting 10%, 30%, 50%, and 100% of the training data for training SpanID models and show the comparison results between PeerDA and MRC on four SpanID tasks in Figure 3. The general trends are consistent across four tasks, where PeerDA boosts the performance of SpanID tasks in different training data sizes. When training PeerDA with 50% of the training data, it can reach or almost exceed the performance of MRC trained on the full training set. These results suggest that PeerDA can use training data more efficiently, which saves a lot by using fewer data to achieve comparable performance. Such value could be extremely significant when the annotation of the relevant dataset is expensive. For example, CUAD costs about $2 million in annotating legal contracts (Hendrycks et al., 2021).  attack independently, including entity attack that replaces entities to other entities not presented in the training set and context attack that replaces the context of entities. It shows that PeerDA does not work well on entity attack because we only use entities in the training set to conduct data augmentation, which is intrinsically ineffective to this adversarial attack. This motivates us to engage outer source knowledge (e.g. Wikipedia) into our PeerDA approach in future work. On Lap14, PeerDA significantly improves Tagging and MRC by 6.5 and 4.1 F 1 on the adversarial set respectively.

Robustness
Peer-driven DA: We compare PeerDA with Mention Replacement (MenReplace) (Dai and Adel, 2020), another Peer-driven DA approach randomly replaces a span mention in the context with another mention of the same category in the training set. The results of four SpanID tasks are presented in Table 8. PeerDA exhibits better performance than MenReplace on all four tasks. In addition, Men-Replace would easily break the text coherence as a result of putting span mentions into the incompatible context, while PeerDA can do a more natural augmentation without harming the context.

Conclusion
In this paper, we propose a novel PeerDA approach for SpanID tasks to augment training data from the perspective of capturing the PR relation. PeerDA has two unique advantages: (1) It is capable to leverage abundant but previously unused PR relation as training data to enrich the semantics of spans.
(2) It alleviates the over-fitting issue of neural network models by pushing the models to weigh more on semantics. We conduct extensive experiments to verify the effectiveness of our PeerDA and make further in-depth analyses to explore more potential scenarios of applying PeerDA.

References
Ilias Chalkidis, Ion Androutsopoulos, and Achilleas Michos. 2017. Extracting contract elements. In Proceedings of the 16th edition of the International Conference on Articial Intelligence and Law, pages 19-28. ABSA (Li et al., 2019b;Chen and Qian, 2020) is a fine-grained sentiment analysis task centering at aspect terms. We explore two ABSA sub-tasks: • Aspect Term Extraction (ATE) is to extract aspect terms, where there is only one query asking if there are any aspect terms in the input text.
• Unified Aspect Based Sentiment Analysis (UABSA) is to jointly extract aspect terms and predict their sentiment polarities. We formulate it as a SpanID task by treating the sentiment polarities, namely, positive, negative, and neutral, as three category labels, and aspect terms as spans.
We evaluate the two sub-tasks on two datasets, including the laptop domain dataset Lap14 and restaurant domain dataset Rest14 from SemEval Shared tasks (Pontiki et al., 2014). We use the processed data from .
CCE is a legal NLP task to detect and classify contract clauses into relevant clause types, such as "Governing Law" and "Uncapped Liability". The goal of CCE is to reduce the labor of legal professionals in reviewing contracts of dozens or hundreds of pages long. CCE is also a kind of SpanID task where spans are those contract clauses that warrant review or analysis and labels are predefined clause types. We conduct experiments on CCE using CUAD (Hendrycks et al., 2021), where they annotate contracts from Electronic Data Gathering, Analysis and Retrieval (EDGAR) with 41 clause types. We follow Hendrycks et al. (2021) to split the contracts into segments within the length limitation of pretrained language models and treat each individual segment as one example. We also follow their data split strategy.
SBPD (Da San Martino et al., 2019) is a typical SpanID task that aims to detect both the text fragment (i.e. spans) where a persuasion technique is being used as well as its technique type (i.e. category labels). We use the News20 and Social21 from two SemEval shared tasks (Da San Martino et al., 2020;Dimitrov et al., 2021) and follow the official data split strategy. Note that News20 does not provide the golden label for the test set. Therefore, we evaluate News20 on the dev set.

A.2 Implementations
We use Huggingface's implementations of BERT and RoBERTa (Wolf et al., 2020) 4 . The hyperparameters can be found in Table 9, where we set the maximum input length according to the length of input text for each datasets and set the batch size according to the memory of a Titan RTX GPU card. We follow the default learning rate schedule and dropout settings used in BERT. We use AdamW (Loshchilov and Hutter, 2019) as our optimizer. The balance rates α, β are all set to 1 according to (Li et al., 2020b). 1 1 1 -0.5 0.5 1 Table 9: Hyper-parameters settings.
Figure 4: Performance in terms of different DA rate λ. We vary λ to get different volumes of PR-based training data.

A.3 Effect of DA Rate
We vary the DA rate λ to investigate how the volume of PR-based training data affect the SpanID models performance. Figure 4 shows the effect of different λ in four SpanID tasks. PeerDA mostly improves the MRC in all different trials of λ and we suggest that some parameter tuning for λ is beneficial to obtain optimal results.
Another observation is that too large λ would do harm to the performance. Especially on CCE, due to the skewed distribution and a large number of categories, PeerDA can produce a huge size of PRbased training data. We speculate that too much PR-based training data would affect the learning of BL-based training data and thus affect the model's ability to solve a SpanID task, causing the optimal λ to be a negative value. In addition, too much PRbased training data would also increase the training cost. As a result, we should maintain an appropriate ratio of BL-based and PR-based training data to keep a reasonable performance on SpanID tasks.

A.4 Full Results of Cross-domain Transfer
Precision/Recall/F 1 scores for each MRC and PeerDA model of the cross-domain transfer experiments are presented in